Graph workflow and defaults wreak havoc

For the past week or 10 days, every time I visited one news site, it insisted on showing me an article about precipitation in North Platte. It's baiting me to write a post about this lamentable bar chart (link):

Northplatte_rainfall

***

This chart got problems, and the problems start with the tooling, which dictates a workflow.

I imagine what the chart designer had to deal with.

For a bar chart, the tool requires one data series to be numeric, and the other to be categorical. A four-digit year is a number, which can be treated either as numeric or categorical. In most cases, and by default, numbers are considered numeric. To make this chart, the user asked the tool to treat years as categorical.

Junkcharts_northplattedry_datatypes

Many tools treat categories as distinct entities ("nominal"), mapping each category to a distinct color. So they have 11 colors for 11 years, which is surely excessive.

This happens because the year data is not truly categorical. These eleven years were picked based on the amount of rainfall. There isn't a single year with two values, it's not even possible. The years are just irregularly spaced indices. Nevertheless, the tool misbehaves if the year data are regarded as numeric. (It automatically selects a time-series line chart, because someone's data visualization flowchart says so.) Mis-specification in order to trick the tool has consequences.

The designer's intention is to compare the current year 2023 to the driest years in history. This is obvious from the subtitle in which 2023 is isolated and its purple color is foregrounded.

Junkcharts_northplattedry_titles

How unfortunate then that among the 11 colors, this tool grabbed 4 variations of purple! I like to think that the designer wanted to keep 2023 purple, and turn the other bars gray -- but the tool thwarted this effort.

Junkcharts_northplattedry_purples

The tool does other offensive things. By default, it makes a legend for categorical data. I like the placement of the legend right beneath the title, a recognition that on most charts, the reader must look at the legend first to comprehend what's on the chart.

Not so in this case. The legend is entirely redundant. Removing the legend does not affect our cognition one bit. That's because the colors encode nothing.

Worse, the legend sows confusion because it presents the same set of years in chronological order while the bars below are sorted by amount of precipitation: thus, the order of colors in the legend differs from that in the bar chart.

Junkcharts_northplattedry_legend

I can imagine the frustration of the designer who finds out that the tool offers no option to delete the legend. (I don't know this particular tool but I have encountered tools that are rigid in this manner.)

***

Something else went wrong. What's the variable being plotted on the numeric (horizontal) axis?

The answer is inches of rainfall but the answer is actually not found anywhere on the chart. How is it possible that a graphing tool does not indicate the variables being plotted?

I imagine the workflow like this: the tool by default puts an axis label which uses the name of the column that holds the data. That column may have a name that is not reader-friendly, e.g. PRECIP. The designer edits the name to "Rainfall in inches". Being a fan of the Economist graphics style, they move the axis label to the chart title area.

The designer now works the chart title. The title is made to spell out the story, which is that North Platte is experiencing a historically dry year. Instead of mentioning rainfall, the new title emphasizes the lack thereof.

The individual steps of this workflow make a lot of sense. It's great that the title is informative, and tells the story. It's great that the axis label was fixed to describe rainfall in words not database-speak. But the end result is a confusing mess.

The reader must now infer that the values being plotted are inches of rainfall.

Further, the tool also imposes a default sorting of the bars. The bars run from longest to shortest, in this case, the longest bar has the most rainfall. After reading the title, our expectation is to find data on the Top 11 driest years, from the driest of the driest to the least dry of the driest. But what we encounter is the opposite order.

Junkcharts_northplattedry_sorting

Most graphics software behaves like this as they are plotting the ranks of the categories with the driest being rank 1, counting up. Because the vertical axis moves upwards from zero, the top-ranked item ends up at the bottom of the chart.

***

_trifectacheckup_imageMoving now from the V corner to the D corner of the Trifecta checkup (link), I can't end this post without pointing out that the comparisons shown on the chart don't work. It's the first few months of 2023 versus the full years of the others.

The fix is to plot the same number of months for all years. This can be done in two ways: find the partial year data for the historical years, or project the 2023 data for the full year.

(If the rainy season is already over, then the chart will look exactly the same at the end of 2023 as it is now. Then, I'd just add a note to explain this.)

***

Here is a version of the chart after doing away with unhelpful default settings:


Redo_junkcharts_northplattedry


Showing both absolute and relative values on the same chart 1

Visual Capitalist has a helpful overview on the "uninsured" deposits problem that has become the talking point of the recent banking crisis. Here is a snippet of the chart that you can see in full at this link:

Visualcapitalist_uninsureddeposits_top

This is in infographics style. It's a bar chart that shows the top X banks. Even though the headline says "by uninsured deposits", the sort order is really based on the proportion of deposits that are uninsured, i.e. residing in accounts that exceed $250K.  They used a red color to highlight the two failed banks, both of which have at least 90% of deposits uninsured.

The right column provides further context: the total amounts of deposits, presented both as a list of numbers as well as a column of bubbles. As readers know, bubbles are not self-sufficient, and if the list of numbers were removed, the bubbles lost most of their power of communication. Big, small, but how much smaller?

There are little nuggets of text in various corners that provide other information.

Overall, this is a pretty good one as far as infographics go.

***

I'd prefer to elevate information about the Too Big to Fail banks (which are hiding in plain sight). Addressing this surfaces the usual battle between relative and absolute values. While the smaller banks have some of the highest concentrations of uninsured deposits, each TBTF bank has multiples of the absolute dollars of uninsured deposits as the smaller banks.

Here is a revised version:

Redo_visualcapitalist_uninsuredassets_1

The banks are still ordered in the same way by the proportions of uninsured value. The data being plotted are not the proportions but the actual deposit amounts. Thus, the three TBTF banks (Citibank, Chase and Bank of America) stand out of the crowd. Aside from Citibank, the other two have relatively moderate proportions of uninsured assets but the sizes of the red bars for any of these three dominate those of the smaller banks.

Notice that I added the gray segments, which portray the amount of deposits that are FDIC protected. I did this not just to show the relative sizes of the banks. Having the other part of the deposits allow readers to answer additional questions, such as which banks have the most insured deposits? They also visually present the relative proportions.

***

The most amazing part of this dataset is the amount of uninsured money. I'm trying to think who these account holders are. It would seem like a very small collection of people and/or businesses would be holding these accounts. If they are mostly businesses, is FDIC insurance designed to protect business deposits? If they are mostly personal accounts, then surely only very wealthy individuals hold most of these accounts.

In the above chart, I'm assuming that deposits and assets are referring to the same thing. This may not be the correct interpretation. Deposits may be only a portion of the assets. It would be strange though that the analysts only have the proportions but not the actual deposit amounts at these banks. Nevertheless, until proven otherwise, you should see my revision as a sketch - what you can do if you have both the total deposits and the proportions uninsured.


Finding the story in complex datasets

In CT Mirror's feature about Connecticut, which I wrote about in the previous post, there is one graphic that did not rise to the same level as the others.

Ctmirror_highschools

This section deals with graduation rates of the state's high school districts. The above chart focuses on exactly five districts. The line charts are organized in a stack. No year labels are provided. The time window is 11 years from 2010 to 2021. The column of numbers show the difference in graduation rates over the entire time window.

The five lines look basically the same, if we ignore what looks to be noisy year-to-year fluctuations. This is due to the weird aspect ratio imposed by stacking.

Why are those five districts chosen? Upon investigation, we learn that these are the five districts with the biggest improvement in graduation rates during the 11-year time window.

The same five schools also had some of the lowest graduation rates at the start of the analysis window (2010). This must be so because if a school graduated 90% of its class in 2010, it would be mathematically impossible for it to attain a 35% percent point improvement! This is a dissatisfactory feature of the dataviz.

***

In preparing an alternative version, I start by imagining how readers might want to utilize a visualization of this dataset. I assume that the readers may have certain school(s) they are particularly invested in, and want to see its/their graduation performance over these 11 years.

How does having the entire dataset help? For one thing, it provides context. What kind of context is relevant? As discussed above, it's futile to compare a school at the top of the ranking to one that is near the bottom. So I created groups of schools. Each school is compared to other schools that had comparable graduation rates at the start of the analysis period.

Amistad School District, which takes pole position in the original dataviz, graduated only 58% of its pupils in 2010 but vastly improved its graduation rate by 35% over the decade. In the chart below (left panel), I plotted all of the schools that had graduation rates between 50 and 74% in 2010. The chart shows that while Amistad is a standout, almost all schools in this group experienced steady improvements. (Whether this phenomenon represents true improvement, or just grade inflation, we can't tell from this dataset alone.)

Redo_junkcharts_ctmirrorhighschoolsgraduation_1

The right panel shows the group of schools with the next higher level of graduation rates in 2010. This group of schools too increased their graduation rates almost always. The rate of improvement in this group is lower than in the previous group of schools.

The next set of charts show school districts that already achieved excellent graduation rates (over 85%) by 2010. The most interesting group of schools consists of those with 85-89% rates in 2010. Their performance in 2021 is the most unpredictable of all the school groups. The majority of districts did even better while others regressed.

Redo_junkcharts_ctmirrorhighschoolsgraduation_2

Overall, there is less variability than I'd expect in the top two school groups. They generally appeared to have been able to raise or maintain their already-high graduation rates. (Note that the scale of each chart is different, and many of the lines in the second set of charts are moving within a few percentages.)

One more note about the charts: The trend lines are "smoothed" to focus on the trends rather than the year to year variability. Because of smoothing, there is some awkward-looking imprecision e.g. the end-to-end differences read from the curves versus the observed differences in the data. These discrepancies can easily be fixed if these charts were to be published.


Funnels and scatters

I took a peek at some of the work submitted by Ray Vella's students in his NYU dataviz class recently.

The following chart by Hosanah Bryan caught my eye:

Rich Get Richer_Hosanah Bryan (v2)

The data concern the GDP gap between rich and poor regions in various countries. In some countries, especially in the U.K., the gap is gigantic. In other countries, like Spain and Sweden, the gap is much smaller.

The above chart uses a funnel metaphor to organize the data, although the funnel does not add more meaning (not that it has to). Between that, the color scheme and the placement of text, it's visually clean and pleasant to look at.

The data being plotted are messy. They are not actual currency values of GDP. Each number is an index, and represents the relative level of the GDP gap in a given year and country. The gap being shown by the colored bars are differences in these indices 15 years apart. (The students were given this dataset to work with.)

So the chart is very hard to understand if one focuses on the underlying data. Nevertheless, the same visual form can hold other datasets which are less complicated.

One can nitpick about the slight misrepresentation of the values due to the slanted edges on both sides of the bars. This is yet another instance of the tradeoff between beauty and precision.

***

The next chart by Liz Delessert engages my mind for a different reason.

The Rich Get Richerv2

The scatter plot sets up four quadrants. The top right is "everyone gets richer". The top left, where most of the dots lie, is where "the rich get richer, the poor get poorer".  This chart shows a thoughtfulness about organizing the data, and the story-telling.

The grid setup cues readers toward a particular way of looking at the data.

But power comes with responsibility. Such scatter plots are particularly susceptible to the choice of data, in this case, countries. It is tempting to conclude that there are no countries in which everyone gets poorer. But that statement more likely tells us more about which countries were chosen than the real story.

I like to see the chart applied to other data transformations that are easier. For example, we can start with the % change in GDP computed separately for rich and for poor. Then we can form a ratio of these two percent changes.

 

 


A German obstacle course

Tagesschau_originalA twitter user sent me this chart from Germany.

It came with a translation:

"Explanation: The chart says how many car drivers plan to purchase a new state-sponsored ticket for public transport. And of those who do, how many plan to use their car less often."

Because visual language should be universal, we shouldn't be deterred by not knowing German.

The structure of the data can be readily understood: we expect three values that add up to 100% from the pie chart. The largest category accounts for 58% of the data, followed by the blue category (40%). The last and smallest category therefore has 2% of the data.

The blue category is of the most interest, and the designer breaks that up into four sub-groups, three of which are roughly similarly popular.

The puzzle is the identities of these categories.

The sub-categories are directly labeled so these are easy for German speakers. From a handy online translator, these labels mean "definitely", "probably", "rather not", "definitely not". Well, that's not too helpful when we don't know what the survey question is.

According to our correspondent, the question should be "of those who plan to buy the new ticket, how many plan to use their car less often?"

I suppose the question is found above the column chart under the car icon. The translator dutifully outputs "Thus rarer (i.e. less) car use". There is no visual cue to let readers know we are supposed to read the right hand side as a single column. In fact, for this reader, I was reading horizontally from top to bottom.

Now, the two icons on the left and the middle of the top row should map to not buying and buying the ticket. The check mark and cross convey that message. But... what do these icons map to on the chart below? We get no clue.

In fact, the will-buy ticket group is the 40% blue category while the will-not group is the 58% light gray category.

What about the dark gray thin sector? Well, one needs to read the fine print. The footnote says "I don't know/ no response".

Since this group is small and uninformative, it's fine to push it into the footnote. However, the choice of a dark color, and placing it at the 12-o'clock angle of the pie chart run counter to de-emphasizing this category!

Another twitter user visually depicts the journey we take to understand this chart:

Tagesschau_reply

The structure of the data is revealed better with something like this:

Redo_tagesschau_newticket

The chart doesn't need this many colors but why not? It's summer.

 

 

 

 


Improving simple bar charts

Here's another bar chart I came across recently. The chart - apparently published by Kaggle - appeared to present challenges data scientists face in industry:

Kaggle

This chart is pretty standard, and inoffensive. But we can still make it better.

Version 1

Redo_kaggle_nodecimals

I removed the decimals from the data labels.

Version 2

Redo_kaggle_noaxislabels

Since every bar is labelled, is anyone looking at the axis labels?

Version 3

Redo_kaggle_nodatalabels

You love axis labels. Then, let's drop the data labels.

Version 4

Redo_kaggle_categories

Ahh, so data scientists struggle with data problems, and people issues. They don't need better tools.


Ringing in the data

There is a lot of great stuff at Visual Capitalist.

This circular design isn't one of their best.

Visualcapitalist_GDPDebt2021_1800px_Finalized

***

A self-sufficiency test helps diagnose the problem. Notice that every data point is printed on the diagram. If the data labels were removed, there isn't much one can learn from the chart other than the ranking of countries from most indebted to least. It would be impossible to know the difference in debt levels between any pair of countries.

In other words, the data labels rather than visual elements are doing most of the work. In a good dataviz, we like the visual elements to carry the weight.

***

The concentric rings embed a visual hierarchy: Japan is singled out, then the next tier of countries include Sudan, Greece, Eritrea, Cape Verde, Italy, Suriname, and Barbados; and so on.

What is the clustering algorithm? What determines which countries fall into the same group?

It's implicitly determined by how many countries can fit inside the next ring. The designer carefully computed the number of rings, the widths of the rings, the density of the circles, etc. in such a way that there is no unsightly white space on the outer ring. Score a 10/10 for effort!

So the clustering of countries is not data-driven but constrained by the chart form. This limitation is similar to that found on maps used to illustrate spatial data.

 

 


Visualizing composite ratings

A twitter reader submitted the following chart from Autoevolution (link):

Google-maps-is-no-longer-the-top-app-for-navigation-and-offline-maps-179196_1

This is not a successful chart for the simple reason that readers want to look away from it. It's too busy. There is so much going on that one doesn't know where to look.

The underlying dataset is quite common in the marketing world. Through surveys, people are asked to rate some product along a number of dimensions (here, seven). Each dimension has a weight, and combined, the weighted sum becomes a composite ranking (shown here in gray).

Nothing in the chart stands out as particularly offensive even though the overall effect is repelling. Adding the overall rating on top of each column is not the best idea as it distorts the perception of the column heights. But with all these ingredients, the food comes out bland.

***

The key is editing. Find the stories you want to tell, and then deconstruct the chart to showcase them.

I start with a simple way to show the composite ranking, without any fuss:

Redo_junkcharts_autoevolution_top

[Since these are mockups, I have copied all of the data, just the top 11 items.]

Then, I want to know if individual products have particular strengths or weaknesses along specific dimensions. In a ranking like this, one should expect that some component ratings correlate highly with the overall rating while other components deviate from the overall average.

An example of correlated ratings is the Customers dimension.

Redo_junkcharts_autoevolution_customer

The general pattern of the red dots clings closely to that of the gray bars. The gray bars are the overall composite ratings (re-scaled to the rating range for the Customers dimension). This dimension does not tell us more than what we know from the composite rating.

By contrast, the Developers Ecosystem dimension provides additional information.

Redo_junkcharts_autoevolution_developer

Esri, AzureMaps and Mapbox performed much better on this dimension than on the average dimension. 

***

The following construction puts everything together in one package:

Redo_mapsplatformsratings.002


Two commendable student projects, showing different standards of beauty

A few weeks ago, I did a guest lecture for Ray Vella's dataviz class at NYU, and discussed a particularly hairy dataset that he assigns to students.

I'm happy to see the work of the students, and there are two pieces in particular that show promise.

The following dot plot by Christina Barretto shows the disparities between the richest and poorest nations increasing between 2000 and 2015.

BARRETTO  Christina - RIch Gets Richer Homework - 2021-04-14

The underlying dataset has the average GDP per capita for the richest and the poor regions in each of nine countries, for two years (2000 and 2015). With each year, the data are indiced to the national average income (100). In the U.K., the gap increased from around 800 to 1,100 in the 15 years. It's evidence that the richer regions are getting richer, and the poorer regions are getting poorer.

(For those into interpreting data, you should notice that I didn't say the rich getting richer. During the lecture, I explain how to interpret regional averages.)

Christina's chart reflects the tidy, minimalist style advocated by Tufte. The countries are sorted by the 2000-to-2015 difference, with Britain showing up as an extreme outlier.

***

The next chart by Adrienne Umali is more infographic than Tufte.

Adrienne Umali_v2

It's great story-telling. The top graphic explains the underlying data. It shows the four numbers and how the gap between the richest and poorest regions is computed. Then, it summarizes these four numbers into a single metric, "gap increase". She chooses to measure the change as a ratio while Christina's chart uses the difference, encoded as a vertical line.

Adrienne's chart is successful because she filters our attention to a single country - the U.S. It's much too hard to drink data from nine countries in one gulp.

This then sets her up for the second graphic. Now, she presents the other eight countries. Because of the work she did in the first graphic, the reader understands what those red and green arrows mean, without having to know the underlying index values.

Two small suggestions: a) order the countries from greatest to smallest change; b) leave off the decimals. These are minor flaws in a brilliant piece of work.

 

 


These are the top posts of 2020

It's always very interesting as a writer to look back at a year's of posts and find out which ones were most popular with my readers.

Here are the top posts on Junk Charts from 2020:

How to read this chart about coronavirus risk

This post about a New York Times scatter plot dates from February, a time when many Americans were debating whether Covid-19 was just the flu.

Proportions and rates: we are no dupes

This post about a ArsTechnica chart on the effects of Covid-19 by age is an example of designing the visual to reflect the structure of the data.

When the pie chart is more complex than the data

This post shows a 3D pie chart which is worse than a 2D pie chart.

Twitter people upset with that Covid symptoms diagram

This post discusses some complicated graphics designed to illustrate complicated datasets on Covid-19 symptoms.

Cornell must remove the logs before it reopens in the fall

This post is another warning to think twice before you use log scales.

What is the price of objectivity?

This post turns an "objective" data visualization into a piece of visual story-telling.

The snake pit chart is the best election graphic ever

This post introduces my favorite U.S. presidential election graphic, designed by the FiveThirtyEight team.

***

Here is a list of posts that deserve more attention:

Locating the political center

An example of bringing readers as close to the insights as possible

Visualizing change over time

An example of designing data visualization to reflect the structure of multivariate data

Bloomberg made me digest these graphics slowly

An example of simple and thoughtful graphics

The hidden bad assumption behind most dual-axis time-series charts

Read this before you make a dual-axis chart

Pie chart conventions

Read this before you make a pie chart

***
Looking forward to bring you more content in 2021!

Happy new year.