I saw the following chart in an article about the ordering of author names in research papers. This is a topic of great interest to researchers looking to become famous. Memorably, the article was written by two European researchers with the same first and last names.

The article investigates conventions of sequencing authors in scientific papers with more than one authors. The two most popular conventions are: order by significance of contribution ("importance"), or order alphabetically by first letter of last name.

It appears that the following analysis was performed to make the above chart: take all papers with exactly two authors, group by first character of first author's last name, then compute the probability that the last names are listed in alphabetical order, given the first character of the first author's last name.

There is a natural downward slope to this line: if the first author has a last name starting A, the proportion of papers for which the second author has a last name starting B, C, D,... or Z should be close to 100%. If the first author's last name begins with Z, there will be an overwhelming chance that the second author's last name starts with a letter before Z, which means that the pair of names would not be alphabetical.

This chart caught my attention because it's one of those charts that may be uninsightful by design.

***

Why can't it be insightful?

Let's start by finding a statistical model for the observed data (whether the pair of names are alphabetical). The order convention (by importance or by alphabet) is unobservable.

Assume we have a pair of authors. For a specific research paper, the pair is either listed alphabetically or not. If the authors are listed non-alphabetically, then we are sure that the order convention is not alphabetical - which for the purpose of this analysis, we assume to be by importance. By contrast, if the observed order is alphabetical, then either order convention could have generated it -- names being listed alphabetically does not exclude the possibility that they were ordered by importance!

In the article, the authors draw the following conclusion from the chart:

Statistics [green line] exceeds this baseline [bold black line] the most, indicating that many authors intentionally ordered their surnames alphabetically. By contrast, psychology [blue line] appears to be the discipline where the list of authors is mostly ordered by the maount of work contributed (i.e., the percentages are close to the baseline probabilities).

They see the signal in the gap between the value of a particular discipline and the "baseline" probability. If the proportion of papers with alphabetical ordering is higher, then these authors conclude that that discipline is more likely to prefer alphabetical ordering.

Must it be so?

***

Let's take an extreme case as a thought experiment. Assume that no publication uses alphabetical ordering. All authors are listed by importance of their contributions.

If we create the above chart, computing the proportion of alphabetically ordered names given the first letter of the first author's surname, it would look exactly the same - except that the horizontal axis should be relabeled "First letter of the primary author's last name" (instead of "First letter of the first author").

However, the gap between these lines gives us zero information about the preference for alphabetical ordering because all names are listed by importance, and some just happen to also be alphabetical. In this experiment, the gap is purely driven by the (chance?) pairing of authors.

Either the research doesn't accomplish what the researchers think it did, or I'm missing the point. What do you think?

## Recent Comments