## Finding the story in complex datasets

##### Mar 16, 2023

In CT Mirror's feature about Connecticut, which I wrote about in the previous post, there is one graphic that did not rise to the same level as the others.

This section deals with graduation rates of the state's high school districts. The above chart focuses on exactly five districts. The line charts are organized in a stack. No year labels are provided. The time window is 11 years from 2010 to 2021. The column of numbers show the difference in graduation rates over the entire time window.

The five lines look basically the same, if we ignore what looks to be noisy year-to-year fluctuations. This is due to the weird aspect ratio imposed by stacking.

Why are those five districts chosen? Upon investigation, we learn that these are the five districts with the biggest improvement in graduation rates during the 11-year time window.

The same five schools also had some of the lowest graduation rates at the start of the analysis window (2010). This must be so because if a school graduated 90% of its class in 2010, it would be mathematically impossible for it to attain a 35% percent point improvement! This is a dissatisfactory feature of the dataviz.

***

In preparing an alternative version, I start by imagining how readers might want to utilize a visualization of this dataset. I assume that the readers may have certain school(s) they are particularly invested in, and want to see its/their graduation performance over these 11 years.

How does having the entire dataset help? For one thing, it provides context. What kind of context is relevant? As discussed above, it's futile to compare a school at the top of the ranking to one that is near the bottom. So I created groups of schools. Each school is compared to other schools that had comparable graduation rates at the start of the analysis period.

Amistad School District, which takes pole position in the original dataviz, graduated only 58% of its pupils in 2010 but vastly improved its graduation rate by 35% over the decade. In the chart below (left panel), I plotted all of the schools that had graduation rates between 50 and 74% in 2010. The chart shows that while Amistad is a standout, almost all schools in this group experienced steady improvements. (Whether this phenomenon represents true improvement, or just grade inflation, we can't tell from this dataset alone.)

The right panel shows the group of schools with the next higher level of graduation rates in 2010. This group of schools too increased their graduation rates almost always. The rate of improvement in this group is lower than in the previous group of schools.

The next set of charts show school districts that already achieved excellent graduation rates (over 85%) by 2010. The most interesting group of schools consists of those with 85-89% rates in 2010. Their performance in 2021 is the most unpredictable of all the school groups. The majority of districts did even better while others regressed.

Overall, there is less variability than I'd expect in the top two school groups. They generally appeared to have been able to raise or maintain their already-high graduation rates. (Note that the scale of each chart is different, and many of the lines in the second set of charts are moving within a few percentages.)

One more note about the charts: The trend lines are "smoothed" to focus on the trends rather than the year to year variability. Because of smoothing, there is some awkward-looking imprecision e.g. the end-to-end differences read from the curves versus the observed differences in the data. These discrepancies can easily be fixed if these charts were to be published.

##### Mar 08, 2023

This dataviz project by CT Mirror is excellent. The project walks through key statistics of the state of Connecticut.

Here are a few charts I enjoyed.

The first one shows the industries employing the most CT residents. The left and right arrows are perfect, much better than the usual dot plots.

The industries are sorted by decreasing size from top to bottom, based on employment in 2019. The chosen scale is absolute, showing the number of employees. The relative change is shown next to the arrow heads in percentages.

The inclusion of both absolute and relative scales may be a source of confusion as the lengths of the arrows encode the absolute differences, not the relative differences indicated by the data labels. This type of decision is always difficult for the designer. Selecting one of the two scales may improve clarity but induce loss aversion.

***

The next example is a bumps chart showing the growth in residents with at least a bachelor's degree.

This is more like a slopegraph as it appears to draw straight lines between two time points 9 years apart, omitting the intervening years. Each line represents a state. Connecticut's line is shown in red. The message is clear. Connecticut is among the most highly educated out of the 50 states. It maintained this advantage throughout the period.

I'd prefer to use solid lines for the background states, and the axis labels can be sparser.

It's a little odd that pretty much every line has the same slope. I'm suspecting that the numbers came out of a regression model, with varying slopes by state, but the inter-state variance is low.

In the online presentation, one can click on each line to see the values.

***

The final example is a two-sided bar chart:

This shows migration in and out of the state. The red bars represent the number of people who moved out, while the green bars represent those who moved into the state. The states are arranged from the most number of in-migrants to the least.

I have clipped the bottom of the chart as it extends to 50 states, and the bottom half is barely visible since the absolute numbers are so small.

I'd suggest showing the top 10 states. Then group the rest of the states by region, and plot them as regions. This change makes the chart more compact, as well as more useful.

***

There are many other charts, and I encourage you to visit and support this data journalism.

## If you blink, you'd miss this axis trick

##### Jan 31, 2023

When I set out to write this post, I was intending to make a quick point about the following chart found in the current issue of Harvard Magazine (link):

This chart concerns the "tectonic shift" of undergraduates to STEM majors at the expense of humanities in the last 10 years.

I like the chart. The dot plot is great for showing this data. They placed the long text horizontally. The use of color is crucial, allowing us to visually separate the STEM majors from the humanities majors.

My intended post is to suggest dividing the chart into four horizontal slices, each showing one of the general fields. It's a small change that makes the chart even more readable. (It has the added benefit of not needing a legend box.)

***

Then, the axis announced itself.

I was baffled, then disgusted.

Here is a magnified view of the axis:

It's not a linear scale, as one would have expected. What kind of transformation did they use? It's baffling.

Notice the following features of this transformed scale:

• It can't be a log scale because many of the growth values are negative.
• The interval for 0%-25% is longer than for 25%-50%. The interval for 0%-50% is also longer than for 50%-100%. On the positive side, the larger values are pulled in and the smaller values are pushed out.
• The interval for -20%-0% is the same length as that for 0%-25%. So, the transformation is not symmetric around 0

I have no idea what transformation was applied. I took the growth values, measured the locations of the dots, and asked Excel to fit a polynomial function, and it gave me a quadratic fit, R square > 99%.

This formula fits the values within the range extremely well. I hope this isn't the actual transformation. That would be disgusting. Regardless, they ought to have advised readers of their unusual scale.

***

Without having the fitted formula, there is no way to retrieve the actual growth values except for those that happen to fall on the vertical gridlines. Using the inverse of the quadratic formula, I deduced what the actual values were. The hardest one is for Computer Science, since the dot sits to the right of the last gridline. I checked that value against IPEDS data.

The growth values are not extreme, falling between -50% and 125%. There is nothing to be gained by transforming the scale.

The following chart undoes the transformation, and groups the majors by field as indicated above:

***

Yesterday, I published a version of this post at Andrew's blog. Several readers there figured out that the scale is the log of the relative ratio of the number of degrees granted. In the above notation, it is log10(100%+x), where x is the percent change in number of degrees between 2011 and 2021.

Here is a side-by-side view of the two scales:

The chart on the right spreads the negative growth values further apart while slightly compressing the large positive values. I still don't think there is much benefit to transforming this set of data.

P.S. [1/31/2023]

(1) A reader on Andrew's blog asked what's wrong with using the log relative ratio scale. What's wrong is exactly what this post is about. For any non-linear scale, the reader can't make out the values between gridlines. In the original chart, there are four points that exist between 0% and 25%. What values are those? That chart is even harder because now that we know what the transform is, we'd need to first think in terms of relative ratios, so 1.25 instead of 25%, then think in terms of log, if we want to know what those values are.

(2) The log scale used for change values is often said to have the advantage that equal distances on either side represent counterbalancing values. For example, (1.5) (0.66) = (3/2) (2/3)  = 1. But this is a very specific scenario that doesn't actually apply to our dataset.  Consider these scenarios:

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: # degrees went from 2000 to 3000 i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while the number of Psychology degrees grew by 1000 (Psychology I think is the more popular major)

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: from 1000 to 1500, i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while # of Psychology degrees grew by 500
(Assume same starting values)

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: from 666 to 666*3/2 = 999 i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while # of Psychology degrees grew by 333
(Assume Psychology's starting value to be History's ending value)

Psychology: # degrees went from 1000 to 1500 i.e. Relative ratio = 3/2
History: # degrees went from 1500 to 1000 i.e. Relative ratio = 2/3

The # of Psychology degrees grew by 500 while the # of History degrees dropped by 500
(Assume History's starting value to be Psychology's ending value)

## Where have the graduates gone?

##### Sep 27, 2022

Someone submitted this chart on Twitter as an example of good dataviz.

The chart shows the surprising leverage colleges have on where students live after graduation.

The primary virtue of this chart is conservation of space. If our main line of inquiry is the destination states of college graduations - by state, then it's hard to beat this chart's efficiency at delivering this information. For each state, it's easy to see what proportion of graduates leave the state after graduation, and then within those who leave, the reader can learn which are the most popular destination states, and their relative importance.

The colors link the most popular destination states (e.g. Texas in orange) but they are not enough because the designer uses state labels also. A next set of states are labeled without being differentiated by color. In particular, New York and Massachusetts share shades of blue, which also is the dominant color on the left side.

***

The following is a draft of a concept I have in my head.

I imagine this to be a tile map. The underlying data are not public so I just copied down a bunch of interesting states. This view brings out the spatial information, as we expect graduates are moving to neighboring states (or the states with big cities).

The students in the Western states are more likely to stay in their own state, and if they move, they stay in the West Coast. The graduates in the Eastern states also tend to stay nearby, except for California.

I decided to use groups of color - blue for East, green for South, red for West. Color is a powerful device, if used well. If the reader wants to know which states send graduates to New York, I'm hoping the reader will see the chart this way:

## Funnels and scatters

##### Jul 29, 2022

I took a peek at some of the work submitted by Ray Vella's students in his NYU dataviz class recently.

The following chart by Hosanah Bryan caught my eye:

The data concern the GDP gap between rich and poor regions in various countries. In some countries, especially in the U.K., the gap is gigantic. In other countries, like Spain and Sweden, the gap is much smaller.

The above chart uses a funnel metaphor to organize the data, although the funnel does not add more meaning (not that it has to). Between that, the color scheme and the placement of text, it's visually clean and pleasant to look at.

The data being plotted are messy. They are not actual currency values of GDP. Each number is an index, and represents the relative level of the GDP gap in a given year and country. The gap being shown by the colored bars are differences in these indices 15 years apart. (The students were given this dataset to work with.)

So the chart is very hard to understand if one focuses on the underlying data. Nevertheless, the same visual form can hold other datasets which are less complicated.

One can nitpick about the slight misrepresentation of the values due to the slanted edges on both sides of the bars. This is yet another instance of the tradeoff between beauty and precision.

***

The next chart by Liz Delessert engages my mind for a different reason.

The scatter plot sets up four quadrants. The top right is "everyone gets richer". The top left, where most of the dots lie, is where "the rich get richer, the poor get poorer".  This chart shows a thoughtfulness about organizing the data, and the story-telling.

The grid setup cues readers toward a particular way of looking at the data.

But power comes with responsibility. Such scatter plots are particularly susceptible to the choice of data, in this case, countries. It is tempting to conclude that there are no countries in which everyone gets poorer. But that statement more likely tells us more about which countries were chosen than the real story.

I like to see the chart applied to other data transformations that are easier. For example, we can start with the % change in GDP computed separately for rich and for poor. Then we can form a ratio of these two percent changes.

## The what of visualization, beyond the how

##### Mar 16, 2022

A long-time reader sent me the following chart from a Nature article, pointing out that it is rather worthless.

The simple bar chart plots the number of downloads, organized by country, from the website called Sci-Hub, which I've just learned is where one can download scientific articles for free - working around the exorbitant paywalls of scientific journals.

The bar chart is a good example of a Type D chart (Trifecta Checkup). There is nothing wrong with the purpose or visual design of the chart. Nevertheless, the chart paints a misleading picture. The Nature article addresses several shortcomings of the data.

The first - and perhaps most significant - problem is that many Sci-Hub users are expected to access the site via VPN servers that hide their true countries of origin. If the proportion of VPN users is high, the entire dataset is called into doubt. The data would contain both false positives (in countries with VPN servers) and false negatives (in countries with high numbers of VPN users).

The second problem is seasonality. The dataset covered only one month. Many users are expected to be academics, and in the southern hemisphere, schools are on summer vacation in January and February. Thus, the data from those regions may convey the wrong picture.

Another problem, according to the Nature article, is that Sci-Hub has many competitors. "The figures include only downloads from original Sci-Hub websites, not any replica or ‘mirror’ site, which can have high traffic in places where the original domain is banned."

This mirror-site problem may be worse than it appears. Yes, downloads from Sci-Hub underestimate the entire market for "free" scientific articles. But these mirror sites also inflate Sci-Hub statistics. Presumably, these mirror sites obtain their inventory from Sci-Hub by setting up accounts, thus contributing lots of downloads.

***

Even if VPN and seasonality problems are resolved, the total number of downloads should be adjusted for population. The most appropriate adjustment factor is the population of scientists, but that statistic may be difficult to obtain. A useful proxy might be the number of STEM degrees by country - obtained from a UNESCO survey (link).

A metric of the type "number of Sci-Hub downloads per STEM degree" sounds odd and useless. I'd argue it's better than the unadjusted total number of Sci-Hub downloads. Just don't focus on the absolute values but the relative comparisons between countries. Even better, we can convert the absolute values into an index to focus attention on comparisons.

## Why you should expunge the defaults from Excel or (insert your favorite graphing program)

##### Sep 25, 2020

Yesterday, I posted the following chart in the post about Cornell's Covid-19 case rate after re-opening for in-person instruction.

This is an edited version of the chart used in Peter Frazier's presentation.

The original chart carries with it the burden of Excel defaults.

What did I change and why?

I switched away from the default color scheme, which ignores the relationships between the two lines. In particular, the key comparison on this chart should be the actual case rate versus the nominal case rate. In addition, the three lines at the top are related as they all come from the same underlying mathematical model. I used the same color but different shades.

Also, instead of placing the legend as far away from the data labels as possible, I moved the line labels next to the data labels.

Instead of daily date labels, I moved to weekly labels, and set the month names on a separate level than the day names.

The dots were removed from the top three lines but I'd have retained them, perhaps with some level of transparency, if I spent more time making the edits. I'd definitely keep the last dot to make it clear that the blue lines contain one extra dot.

***

Every graphing program has defaults, typically computed by some algorithm tuned to the average chart. Don't settle for the average chart. Get rid of any default setting that slows down understanding.

## A testing mess: one chart, four numbers, four colors, three titles, wrong units, wrong lengths, wrong data

##### Aug 25, 2020

Twitterstan wanted to vote the following infographic off the island:

(The publisher's website is here but I can't find a direct link to this graphic.)

The mishap is particularly galling given the controversy swirling around this year's A-Level results in the U.K. For U.S. readers, you can think of A-Levels as SAT Subject Tests, which in the U.K. are required of all university applicants, and represent the most important, if not the sole, determinant of admissions decisions. Please see the upcoming post on my book blog for coverage of the brouhaha surrounding the statistical adjustments (to be posted sometime this week, it's here.).

The first issue you may notice about the chart is that the bar lengths have no relationship with the numbers printed on them. Here is a scatter plot correlating the bar lengths and the data.

As you can see, nothing.

Then, you may wonder what the numbers mean. The annotation at the bottom right says "Average number of A level qualifications per student". Wow, the British (in this case, English) education system is a genius factory - with the average student mastering close to three thousand subjects in secondary (high) school!

TES is the cool name for what used to be the Times Educational Supplement. I traced the data back to Ofqual, which is the British regulator for these examinations. This is the Ofqual version of the above chart:

The data match. You may see that the header of the data table reads "Number of students in England getting 3 x A*". This is a completely different metric than number of qualifications - in fact, this metric measures geniuses. "A*" is the U.K. equivalent of "A+". When I studied under the British system, there was no such grade. I guess grade inflation is happening all over the world. What used to be A is now A+, and what used to be B is now A. Scoring three A*s is tops - I wonder if this should say 3 or more because I recall that you can take as many subjects as you desire but most students max out at three (may have been four).

The number of students attaining the highest achievement has increased in the last two years compared to the two years before. We can't interpret these data unless we know if the number of students also grew at similar rates.

The units are students while the units we expect from the TES graphic should be subjects. The cutoff for the data defines top students while the TES graphic should connote minimum qualification, i.e. a passing grade.

***
Now, the next section of the Ofqual infographic resolves the mystery. Here is the chart:

This dataset has the right units and measurement. There is almost no meaningful shift in the last four years. The average number of qualifications per student is only different at the second decimal place. Replacing the original data with this set removes the confusion.

While I was re-making this chart, I also cleaned out the headers and sub-headers. This is an example of software hegemony: the designer wouldn't have repeated the same information three times on a chart with four numbers if s/he wasn't prompted by software defaults.

***

The corrected chart violates one of the conventions I described in my tutorial for DataJournalism.com: color difference should reflect data difference.

In the following side-by-side comparison, you see that the use of multiple colors on the left chart signals different data - note especially the top and bottom bars which carry the same number, but our expectation is frustrated.

***

[P.S. 8/25/2020. Dan V. pointed out another problem with these bar charts: the bars were truncated so that the bar lengths are not proportional to the data. The corrected chart is shown on the right below:

## Cornell must remove the logs before it reopens the campus in the fall

##### Jul 06, 2020

Against all logic, Cornell announced last week it would re-open in the fall because a mathematical model under development by several faculty members and grad students predicts that a "full re-opening" would lead to 80 percent fewer infections than a scenario of full virtual instruction. That's what was reported by the media.

The model is complicated, with loads of assumptions, and the report is over 50 pages long. I will put up my notes on how they attained this counterintuitive result in the next few days. The bottom line is - and the research team would agree - it is misleading to describe the analysis as "full re-open" versus "no re-open". The so-called full re-open scenario assumes the entire community including students, faculty and staff submit to a full program of test-trace-isolate, including (mandatory) PCR diagnostic testing once every five days throughout the 16-week semester, and immediate quarantine and isolation of new positive cases, as well as those in contact with such persons, plus full compliance with this program. By contrast, it assumes students do not get tested in the online instruction scenario. In other words, the researchers expect Cornell to get done what the U.S. governments at all levels failed to do until now.

[7/8/2020: The post on the Cornell model is now up on the book blog. Here.]

The report takes us back to the good old days of best-base-worst-case analysis. There is no data for validating such predictions so they performed sensitivity analyses, defined as changing one factor at a time assuming all other factors are fixed at "nominal" (i.e. base case) values. In a large section of the report, they publish a series of charts of the following style:

Each line here represents one of the best-base-worst cases (respectively, orange-blue-green). Every parameter except one is given the "nominal" value (which represents the base case). The parameter that is manpulated is shown on the horizontal axis, and for the above chart, the variable is the assumption of average number of daily contacts per person. The vertical axis shows the main outcome variable, which is the percentage of the community infected by the end of term.

This flatness of the lines in the above chart appears to say that the outcome is quite insensitive to the change in the average daily contact rate under all three scenarios - until the daily contact rises above 10 per person per day. It also appears to show that the blue line is roughly midway between the orange and the green so the percent infected is slightly less-than halved under the optimistic scenario, and a bit more than doubled under the pessimistic scenario, relative to the blue line.

Look again.

The vertical axis is presented in log scale, and only labeled at values 1% and 10%. About midway between 1 and 10 on the horizontal axis, the outcome value has already risen above 10%. Because of the log transformation, above 10%, each tick represents an increase of 10% in proportion. So, the top of the vertical axis indicates 80% of the community being infected! Nothing in the description or labeling of the vertical axis prepares the reader for this.

The report assumes a fixed value for average daily contacts of 8 (I rounded the number for discussion), which is invariable across all three scenarios. Drawing a vertical line about eight-tenths of the way towards 10 appears to signal that this baseline daily contact rate places the outcome in the relatively flat part of the curve.

Look again.

The horizontal axis too is presented in log scale. To birth one log-scale may be regarded as a misfortune; to birth two log scales looks like carelessness.

Since there exists exactly one tick beyond 10 on the horizontal axis, the right-most value is 20. The model has been run for values of average daily contacts from 1 to 20, with unit increases. I can think of no defensible reason why such a set of numbers should be expressed in a log scale.

For the vertical axis, the outcome is a proportion, which is confined to within 0 percent and 100 percent. It's not a number that can explode.

***

Every log scale on a chart is birthed by its designer. I know of no software that automatically performs log transforms on data without the user's direction. (I write this line with trepidation wishing that I haven't planted a bad idea in some software developer's head.)

Here is what the shape of the original data looks like - without any transformation. All software (I'm using JMP here) produces something of this type:

At the baseline daily contact rate value of 8, the model predicts that 3.5% of the Cornell community will get infected by the end of the semester (again, assuming strict test-trace-isolate fully implemented and complied).  Under the pessimistic scenario, the proportion jumps to 14%, which is 4 or 5 times higher than the base case. In this worst-case scenario, if the daily contact rate were about twice the assumed value (just over 16), half of the community would be infected in 16 weeks!

I actually do not understand how there could only be 8 contacts per person per day when the entire student body has returned to 100% in-person instruction. (In the report, they even say the 8 contacts could include multiple contacts with the same person.) I imagine an undergrad student in a single classroom with 50 students. This assumption says the average student in this class only comes into contact with at most 8 of those. That's one class. How about other classes? small tutorials? dining halls? dorms? extracurricular activities? sports? parties? bars?

Back to graphics. Something about the canonical chart irked the report writers so they decided to try a log scale. Here is the same chart with the vertical axis in log scale:

The log transform produces a visual distortion. On the right side, where the three lines are diverging rapidly, the log transform pulls them together. On the left side, where the three lines are close together, the log transform pulls them apart.

Recall that on the log scale, a straight line is exponential growth. Look at the green line (worst case). That line is approximately linear so in the pessimistic scenario, despite assuming full compliance to a strict test-trace-isolate regimen, the cases are projected to grow exponentially.

Something about that last chart still irked the report writers so they decided to birth a second log scale. Here is the chart they ultimately settled on:

As with the other axis, the effect of the log transform is to squeeze the larger values (on the right side) and spread out the smaller values (on the left side). After this cosmetic surgery, the left side looks relatively flat while the right side looks steep.

In the next version of the Cornell report, they should replace all these charts with ones using linear scales.

***

Upon discovering this graphical mischief, I wonder if the research team received a mandate that includes a desired outcome.

[P.S. 7/8/2020. For more on the Cornell model, see this post.]

## Gazing at petals

##### Jan 21, 2020

Reader Murphy pointed me to the following infographic developed by Altmetric to explain their analytics of citations of journal papers. These metrics are alternative in that they arise from non-academic media sources, such as news outlets, blogs, twitter, and reddit.

The key graphic is the petal diagram with a number in the middle.

I have a hard time thinking of this object as “data visualization”. Data visualization should visualize the data. Here, the connection between the data and the visual design is tenuous.

There are eight petals arranged around the circle. The legend below the diagram maps the color of each petal to a source of data. Red, for example, represents mentions in news outlets, and green represents mentions in videos.

Each petal is the same size, even though the counts given below differ. So, the petals are like a duplicative legend.

The order of the colors around the circle does not align with its order in the table below, for a mysterious reason.

Then comes another puzzle. The bluish-gray petal appears three times in the diagram. This color is mapped to tweets. Does the number of petals represent the much higher counts of tweets compared to other mentions?

To confirm, I pulled up the graphic for a different paper.

Here, each petal has a different color. Eight petals, eight colors. The count of tweets is still much larger than the frequencies of the other sources. So, the rule of construction appears to be one petal for each relevant data source, and if the total number of data sources fall below eight, then let Twitter claim all the unclaimed petals.

A third sample paper confirms this rule:

None of the places we were hoping to find data – size of petals, color of petals, number of petals – actually contain any data. Anything the reader wants to learn can be directly read. The “score” that reflects the aggregate “importance” of the corresponding paper is found at the center of the circle. The legend provides the raw data.

***

Some years ago, one of my NYU students worked on a project relating to paper citations. He eventually presented the work at a conference. I featured it previously.

Notice how the visual design provides context for interpretation – by placing each paper/researcher among its peers, and by using a relative scale (percentiles).

***

I’m ignoring the D corner of the Trifecta Checkup in this post. For any visualization to be meaningful, the data must be meaningful. The type of counting used by Altmetric treats every tweet, every mention, etc. as a tally, making everything worth the same. A mention on CNN counts as much as a mention by a pseudonymous redditor. A pan is the same as a rave. Let’s not forget the fake data menace (link), which  affects all performance metrics.