Tennis greats at the top of their game

The following chart of world No. 1 tennis players looks pretty but the payoff of spending time to understand it isn't high enough. The light colors against the tennis net backdrop don't work as intended. The annotation is well done, and it's always neat to tug a legend inside the text.

Tableautennisnumberones

The original is found at Tableau Public (link).

The topic of the analysis appears to be the ages at which tennis players attained world #1 ranking. Here are the male players visualized differently:

Redo_junkcharts_no1tennisplayers

Some players like Jimmy Connors and Federer have second springs after dominating the game in their late twenties. It's relatively rare for players to get to #1 after 30.


Choosing between individuals and aggregates

Friend/reader Thomas B. alerted me to this paper that describes some of the key chart forms used by cancer researchers.

It strikes me that many of the "new" charts plot granular data at the individual level. This heatmap showing gene expressions show one column per patient:

Jnci_genemap

This so-called swimmer plot shows one bar per patient:

Jnci_swimlanes

This spider plot shows the progression of individual patients over time. Key events are marked with symbols.

Jnci_spaghetti

These chart forms are distinguished from other ones that plot aggregated statistics: statistical averages, medians, subgroup averages, and so on.

One obvious limitation of such charts is their lack of scalability. The number of patients, the variability of the metric, and the timing of trends all drive up the amount of messiness.

I am left wondering what Question is being addressed by these plots. If we are concerned about treatment of an individual patient, then showing each line by itself would be clearer. If we are interested in the average trends of patients, then a chart that plots the overall average, or subgroup averages would be more accurate. If the interpretation of the individual's trend requires comparing with similar patients, then showing that individual's line against the subgroup average would be preferred.

When shown these charts of individual lines, readers are tempted to play the statistician - without using appropriate tools! Readers draw aggregate conclusions, performing the aggregation in their heads.

The authors of the paper note: "Spider plots only provide good visual qualitative assessment but do not allow for formal statistical inference." I agree with the second part. The first part is a fallacy - if the visual qualitative assessment is good enough, then no formal inference is necessary! The same argument is often made when people say they don't need advanced analysis because their simple analysis is "directionally accurate". When is something "directionally inaccurate"? How would one know?

Reference: Chia, Gedye, et. al., "Current and Evolving Methods to Visualize Biological Data in Cancer Research", JNCI, 2016, 108(8). (link)

***

Meteoreologists, whom I featured in the previous post, also have their own spider-like chart for hurricanes. They call it a spaghetti map:

Dorian_spaghetti

Compare this to the "cone of uncertainty" map that was featured in the prior post:

AL052019_5day_cone_with_line_and_wind

These two charts build upon the same dataset. The cone map, as we discussed, shows the range of probable paths of the storm center, based on all simulations of all acceptable models for projection. The spaghetti map shows selected individual simulations. Each line is the most likely trajectory of the storm center as predicted by a single simulation from a single model.

The problem is that each predictive model type has its own historical accuracy (known as "skill"), and so the lines embody different levels of importance. Further, it's not immediately clear if all possible lines are drawn so any reader making conclusions of, say, the envelope containing x percent of these lines is likely to be fooled. Eyeballing the "cone" that contains x percent of the lines is not trivial either. We tend to naturally drift toward aggregate statistical conclusions without the benefit of appropriate tools.

Plots of individuals should be used to address the specific problem of assessing individuals.


This Wimbledon beauty will be ageless

Ft_wimbledonage


This Financial Times chart paints the picture of the emerging trend in Wimbledon men’s tennis: the average age of players has been rising, and hits 30 years old for the first time ever in 2019.

The chart works brilliantly. Let's look at the design decisions that contributed to its success.

The chart contains a good amount of data and the presentation is carefully layered, with the layers nicely tied to some visual cues.

Readers are drawn immediately to the average line, which conveys the key statistical finding. The blue dot  reinforces the key message, aided by the dotted line drawn at 30 years old. The single data label that shows a number also highlights the message.

Next, readers may notice the large font that is applied to selected players. This device draws attention to the human stories behind the dry data. Knowledgable fans may recall fondly when Borg, Becker and Chang burst onto the scene as teenagers.

 

Then, readers may pick up on the ticker-tape data that display the spread of ages of Wimbledon players in any given year. There is some shading involved, not clearly explained, but we surmise that it illustrates the range of ages of most of the contestants. In a sense, the range of probable ages and the average age tell the same story. The current trend of rising ages began around 2005.

 

Finally, a key data processing decision is disclosed in chart header and sub-header. The chart only plots the players who reached the fourth round (16). Like most decisions involved in data analysis, this choice has both desirable and undesirable effects. I like it because it thins out the data. The chart would have appeared more cluttered otherwise, in a negative way.

The removal of players eliminated in the early rounds limits the conclusion that one can draw from the chart. We are tempted to generalize the finding, saying that the average men’s player has increased in age – that was what I said in the first paragraph. Thinking about that for a second, I am not so sure the general statement is valid.

The overall field might have gone younger or not grown older, even as the older players assert their presence in the tournament. (This article provides side evidence that the conjecture might be true: the author looked at the average age of players in the top 100 ATP ranking versus top 1000, and learned that the average age of the top 1000 has barely shifted while the top 100 players have definitely grown older.)

So kudos to these reporters for writing a careful headline that stays true to the analysis.

I also found this video at FT that discussed the chart.

***

This chart about Wimbledon players hits the Trifecta. It has an interesting – to some, surprising – message (Q). It demonstrates thoughtful processing and analysis of the data (D). And the visual design fits well with its intended message (V). (For a comprehensive guide to the Trifecta Checkup, see here.)


SCMP's fantastic infographic on Hong Kong protests

In the past month, there have been several large-scale protests in Hong Kong. The largest one featured up to two million residents taking to the streets on June 16 to oppose an extradition act that was working its way through the legislature. If the count was accurate, about 25 percent of the city’s population joined in the protest. Another large demonstration occurred on July 1, the anniversary of Hong Kong’s return to Chinese rule.

South China Morning Post, which can be considered the New York Times of Hong Kong, is well known for its award-winning infographics, and they rose to the occasion with this effort.

This is one of the rare infographics that you’d not regret spending time reading. After reading it, you have learned a few new things about protesting in Hong Kong.

In particular, you’ll learn that the recent demonstrations are part of a larger pattern in which Hong Kong residents express their dissatisfaction with the city’s governing class, frequently accused of acting as puppets of the Chinese state. Under the “one country, two systems” arrangement, the city’s officials occupy an unenviable position of mediating the various contradictions of the two systems.

This bar chart shows the growth in the protest movement. The recent massive protests didn't come out of nowhere. 

Scmp_protestsovertime

This line chart offers a possible explanation for burgeoning protests. Residents’ perceived their freedoms eroding in the last decade.

Scmp_freedomsurvey

If you have seen videos of the protests, you’ll have noticed the peculiar protest costumes. Umbrellas are used to block pepper sprays, for example. The following lovely graphic shows how the costumes have evolved:

Scmp_protestcostume

The scale of these protests captures the imagination. The last part in the infographic places the number of protestors in context, by expressing it in terms of football pitches (as soccer fields are known outside the U.S.) This is a sort of universal measure due to the popularity of football almost everywhere. (Nevertheless, according to Wikipedia, the fields do not have one fixed dimension even though fields used for international matches are standardized to 105 m by 68 m.)

Scmp_protestscale_pitches

This chart could be presented as a bar chart. It’s just that the data have been re-scaled – from counting individuals to counting football pitches-ful of individuals. 

***
Here is the entire infographics.


A chart makes an appearance in my new video

Been experimenting with short videos recently. My latest is a short explainer on why some parents are willing to spend over a million dollars to open back doors to college admissions. I even inserted a chart showing some statistics. Click here to see the video.

 

Also, subscribe to my channel to see future episodes of Inside the Black Box.

***

Here are a couple of recent posts related to college admissions.

  • About those so-called adversity scores (link)
  • A more detailed post on various college admissions statistics (link)

Clarifying comparisons in censored cohort data: UK housing affordability

If you're pondering over the following chart for five minutes or more, don't be ashamed. I took longer than that.

Ft_ukgenerationalhousing

The chart accompanied a Financial Times article about inter-generational fairness in the U.K. To cut to the chase, a recently released study found that younger generations are spending substantially higher proportions of their incomes to pay for housing costs. The FT article is here (behind paywall). FT actually slightly modified the original chart, which I pulled from the Home Affront report by the Intergenerational Commission.

Uk_generational_propincomehousing

One stumbling block is to figure out what is plotted on the horizontal axis. The label "Age" has gone missing. Even though I am familiar with cohort analysis (here, generational analysis), it took effort to understand why the lines are not uniformly growing in lengths. Typically, the older generation is observed for a longer period of time, and thus should have a longer line.

In particular, the orange line, representing people born before 1895 only shows up for a five-year range, from ages 70 to 75. This was confusing because surely these people have lived through ages 20 to 70. I'm assuming the "left censoring" (missing data on the left side) is because of non-existence of old records.

The dataset is also right-censored (missing data on the right side). This occurs with the younger generations (the top three lines) because those cohorts have not yet reached certain ages. The interpretation is further complicated by the range of birth years in each cohort but let me not go there.

TL;DR ... each line represents a generation of Britons, defined by their birth years. The generations are compared by how much of their incomes did they spend on housing costs. The twist is that we control for age, meaning that we compare these generations at the same age (i.e. at each life stage).

***

Here is my version of the same chart:

Junkcharts_redo_ukgenerationalhousing_1

Here are some of the key edits:

  • Vertical blocks are introduced to break up the analysis by life stage. These guide readers to compare the lines vertically i.e. across generations
  • The generations are explicitly described as cohorts by birth years
  • The labels for the generations are placed next to the lines
  • Gridlines are pushed to the back
  • The age axis is explicitly labeled
  • Age labels are thinned
  • A hierarchy on colors
  • The line segments with incomplete records are dimmed

The harmful effect of colors can be seen below. This chart is the same as the one above, except for retaining the colors of the original chart:

Junkcharts_redo_ukgenerationalhousing_2

 

 


Visually exploring the relationship between college applicants and enrollment

In a previous post, we learned that top U.S. colleges have become even more selective over the last 15 years, driven by a doubling of the number of applicants while class sizes have nudged up by just 10 to 20 percent. 

Redo_pewcollegeadmissions

The top 25 most selective colleges are included in the first group. Between 2002 and 2017, their average rate of admission dropped from about 20% to about 10%, almost entirely explained by applicants per student doubling from 10 to almost 20. A similar upward movement in selectivity is found in the first four groups of colleges, which on average accept at least half of their applicants.

Most high school graduates however are not enrolling in colleges in the first four groups. Actually, the majority of college enrollment belongs to the bottom two groups of colleges. These groups also attracted twice as many applicants in 2017 relative to 2002 but the selectivity did not change. They accepted 75% to 80% of applicants in 2002, as they did in 2017.

***

In this post, we look at a different view of the same data. The following charts focus on the growth rates, indexed to 2002. 

Collegeadmissions_5

To my surprise, the number of college-age Americans  grew by about 10% initially but by 2017 has dropped back to the level of 2002. Meanwhile, the number of applications to the colleges continues to climb across all eight groups of colleges.

The jump in applications made selectivity surge at the most selective colleges but at the less selective colleges, where the vast majority of students enroll, admission rate stayed put because they gave out many more offers as applications mounted. As the Pew headline asserted, "the rich gets richer."

Enrollment has not kept up. Class sizes expanded about 10 to 30 percent in those 15 years, lagging way behind applications and admissions.

How do we explain the incremental applications?

  • Applicants increasing the number of schools they apply to
  • The untapped market: applicants who in the past would not have applied to college
  • Non-U.S. applicants: this is part of the untapped market, but much larger

An exercise in decluttering

My friend Xan found the following chart by Pew hard to understand. Why is the chart so taxing to look at? 

Pew_collegeadmissions

It's packing too much.

I first notice the shaded areas. Shading usually signifies "look here". On this chart, the shading is highlighting the least important part of the data. Since the top line shows applicants and the bottom line admitted students, the shaded gap displays the rejections.

The numbers printed on the chart are growth rates but they confusingly do not sync with the slopes of the lines because the vertical axis plots absolute numbers, not rates. 

Pew_collegeadmissions_growthThe vertical axis presents the total number of applicants, and the total number of admitted students, in each "bucket" of colleges, grouped by their admission rate in 2017. On the right, I drew in two lines, both growth rates of 100%, from 500K to 1 million, and from 1 to 2 million. The slopes are not the same even though the rates of growth are.

Therefore, the growth rates printed on the chart must be read as extraneous data unrelated to other parts of the chart. Attempts to connect those rates to the slopes of the corresponding lines are frustrated.

Another lurking factor is the unequal sizes of the buckets of colleges. There are fewer than 10 colleges in the most selective bucket, and over 300 colleges in the largest bucket. We are unable to interpret properly the total number of applicants (or admissions). The quantity of applications in a bucket depends not just on the popularity of the colleges but also the number of colleges in each bucket.

The solution isn't to resize the buckets but to select a more appropriate metric: the number of applicants per enrolled student. The most selective colleges are attracting about 20 applicants per enrolled student while the least selective colleges (those that accept almost everyone) are getting 4 applicants per enrolled student, in 2017.

As the following chart shows, the number of applicants has doubled across the board in 15 years. This raises an intriguing question: why would a college that accepts pretty much all applicants need more applicants than enrolled students?

Redo_pewcollegeadmissions

Depending on whether you are a school administrator or a student, a virtuous (or vicious) cycle has been realized. For the top four most selective groups of colleges, they have been able to progressively attract more applicants. Since class size did not expand appreciably, more applicants result in ever-lower admit rate. Lower admit rate reduces the chance of getting admitted, which causes prospective students to apply to even more colleges, which further suppresses admit rate. 

 

 

 


Nice example of visual story-telling in the FT

I came across this older chart in the Financial Times, which is a place to find some nice graphics:

Ft_uklifeexpectancy

The key to success here is having a good story to tell. Blackpool is an outlier when it comes to improvement in life expectancy since 1993. Its average life expectancy has improved, but the magnitude of improvement lags other areas by quite a margin.

The design then illustrates this story in two ways.

On the right side, one sees Blackpool occupying a lone spot on the left side of the histogram. On the left chart, the gap between Blackpool and the national average is plotted over time. The gap is clearly widening; the size of the gap is labeled so the reader immediately knows it went from 1.8 to 4.9.

Although they're not labeled, the reader understand that the other two lines are the best and worst areas. The comparison between Glasgow City and Blackpool is also informative. Glasgow City, which has the worst life expectancy in the U.K. is fast catching up with Blackpool, the second worst.

I also like color-coded titles. It draws attention to Blackpool and it links the conclusion to both charts in an efficient manner.


Five steps to let the young ones shine

Knife stabbings are in the news in the U.K. and the Economist has a quartet of charts to illustrate what's going on.

Economist_20190309_WOC479

I'm going to focus on the chart on the bottom right. This shows the trend in hospital admissions due to stabbings in England from 2000 to 2018. The three lines show all ages, and two specific age groups: under 16 and 16-18.

The first edit I made was to spell out all years in four digits. For this chart, numbers like 15 and 18 can be confused with ages.

Redo_econ_ukknives_1

The next edit corrects an error in the subtitle. The reference year is not 2010 as those three lines don't cross 100. It appears that the reference year is 2000. Another reason to use four-digit years on the horizontal axis is to be consistent with the subtitle.

Redo_econ_ukknives_2

The next edit removes the black dot which draws attention to itself. The chart though is not about the year 2000, which has the least information since all data have been forced to 100.

Redo_econ_ukknives_3

The next edit makes the vertical axis easier to interpret. The indices 150, 200, are much better stated as + 50%, + 100%. The red line can be labeled "at 2000 level". One can even remove the subtitle 2000=100 if desired.

Redo_econ_ukknives_4

Finally, I surmise the message the designer wants to get across is the above-average jump in hospital admissions among children under 16 and 16 to 18. Therefore, the "All" line exists to provide context. Thus, I made it a dashed line pushing it to the background.

Redo_econ_ukknives_5