Using disaggregation in dataviz

This chart appears in a journal article on the use of AI (artificial intelligence) in healthcare (link).

Ai_healthcare_barchart

It's a stacked bar chart in which each bar is subdivided into four segments. The authors are interested in the relative frequency of research using AI by disease type. The chart only shows the top 10 disease types.

What is unusual is that the subdivisions are years. So these authors revealed four years of journal articles, and while the overall ranking of the disease types is by the aggregated four-year total counts, each total count has been disaggregated by color so readers can also see the annual counts.

***

A slight rearrangement yields the following:

Junkcharts_redo_aihealthcare

Most readers will only care about the left chart showing the total counts. More invested readers may consider the colored charts that show annual totals. These are arranged so that the annual counts are easily read and compared.

***

One annoying aspect of this type of presentation is that in almost all cases, the top 10 types in aggregate will not be the top 10 types by individual year. In some of those years, I expect that the 10 types shown do not include all of the top 10 types for a particular year.


The most dangerous day

Our World in Data published this interesting chart about infant mortality in the U.S.

Mostdangerousday

The article that sent me to this chart called the first day of life the "most dangerous day". This dot plot seems to support the notion, as the "per-day" death rate is the highest on the day of birth, and then drops quite fast (note log scale) over the the first year of life.

***

Based on the same dataset, I created the following different view of the data, using the same dot plot form:

Junkcharts_redo_ourworldindata_infantmortality

By this measure, a baby has 99.63% chance of surviving the first 30 days while the survival rate drops to 99.5% by day 180.

There is an important distinction between these two metrics.

The "per day" death rate is the chance of dying on a given day, conditional on having survived up to that day. The chance of dying on day 2 is lower partly because some of the more vulnerable ones have died on day 1 or day 0,  etc.

The survival rate metric is cumulative: it measures how many babies are still alive given they were born on day 0. The survival rate can never go up, so long as we can't bring back the dead.

***

If we are assessing a 5-day-old baby's chance of surviving day 6, the "per-day" death rate is relevant since that baby has not died in the first 5 days.

If the baby has just been born, and we want to know the chance it might die in the first five days (or survive beyond day 5), then the cumulative survival rate curve is the answer. If we use the per-day death rate, we can't add the first five "per-day" death rates It's a more complicated calculation of dying on day 0, then having not died on day 0, dying on day 1, then having not died on day 0 or day 1, dying on day 2, etc.

 


One doesn't have to plot raw data

Visual Capitalist chose a treemap to show us where gold is produced (link):

Viscap_gold2023

The treemap is embedded into a brick of gold. Any treemap is difficult to read, mostly because some block are vertical, others horizontal. A rough understanding is nevertheless possible: the entire global production can be roughly divided into four parts: China plus three other Asian producers account for roughly (not quite) a quarter; "rest of the world" (i.e. all countries not individually listed) is a quarter; Russia and Australia together is again a bit less than a quarter.

***

When I look at datasets that rank countries by some metric, I'm hoping to present insights, rather than the raw data. Insights typically involve comparing countries, or sets of countries, or one country against a set of countries. So, I made the following chart that includes some of these insights I found in the gold production dataset:

Junkcharts_redo_viscap_gold2023

For example, the top 4 producers in Asia account for almost a quarter of the world's output; Canada, U.S. and Australia together also roughly produce a quarter; the rest of the world has a similar output. In Asia, China's output is about the sum of the next 3 producers, which is about the same as U.S. and Canada, which is about the same as the top 5 in Africa.

 


The curse of dimensions

Usually the curse of dimensions concerns data with many dimensions. But today I want to talk about a different kind of curse. This is the curse of dimensions in mapping.

We are only talking about a few dimensions, typically between 3 and 6, so small number of dimensions. And yet it's already a curse. Maps are typically drawn in two dimensions. Those two dimensions are usually spoken for: they show the x- and y-coordinate of space. If we want to include a third, fourth or fifth dimension of data on the map, we have to appeal to colors, shapes, and so on. Cartographers have long realized that adding dimensions involves tradeoffs.

***

Andrew featured some colored bubble maps in a recent post. Here is one example:

Dorlingmap_percenthispanic

The above map shows the proportion of population in each U.S. county that is Hispanic. Each county is represented by a bubble pinned to the centroid of the county. The color of the bubble shows the data, divided into demi-deciles so they are using a equal-width binning method. The size of a bubble indicates the size of a county.

The map is sometimes called a "Dorling map" after its presumptive original designer.

I'm going to use this map to explore the curse of dimensions.

***

It's clear from the design that county-level details are regarded as extremely important. As there are about 3,000 counties in the U.S., I don't see how any visual design can satisfy this requirement without giving up clarity.

More details require more objects, which spread readers' attention. More details contain more stories, but that too dilutes their focus.

Another principle of this map is to not allow bubbles to overlap. Of course, having bubbles overlap or print on top of one another is a visual faux pas. But to prevent such behavior on this particular design means the precise locations are sacrificed. Consider the eastern seaboard where there are densely populated counties: they are not pinned to their centroids. Instead, the counties are pushed out of their normal positions, similar to making a cartogram.

I remarked at the start – erroneously but deliberately – that each bubble is centered at the centroid of each county. I wonder how many of you noticed the inaccuracy of that statement. If that rule were followed, then the bubbles in New England would have overlapped and overprinted. 

This tradeoff affects how we perceive regional patterns, as all the densely populated regions are bent out of shape.

Another aspect of the data that the designer treats as important is county population, or rather relative county population. Relative – because bubble size don't portray absolutes, plus the designer didn't bother to provide a legend to decipher bubble sizes.

The tradeoff is location. The varying bubble sizes, coupled with the previous stipulation of no overlapping, push bubbles from their proper centroids. This forced displacement disproportionately affects larger counties.

***

What if we are willing to sacrifice county-level details?

In this setting, we are not obliged to show every single county. One alternative is to perform spatial smoothing. Intuitively, think about the following steps: plot all these bubbles in their precise locations, turn the colors slightly transparent, let them overlap, blend away the edges, and then we have a nice picture of where the Hispanic people are located.

I have sacrificed the county-level details but the regional pattern becomes much clearer, and we don't need to deviate from the well-understood shape of the standard map.

This version reminds me of the language maps that Josh Katz made.

Joshkatz_languagemap

Here is an old post about these maps.

This map design only reduces but does not eliminate the geographical inaccuracy. It uses the same trick as the Dorling map: the "vertical" density of population has been turned into "horizontal" span. It's a bit better because the centroids are not displaced.

***

Which map is better depends on what tradeoffs one is making. In the above example, I'd have made different choices.

 

One final thing – it's minor but maybe not so minor. Most of the bubbles on the map especially in the middle are tiny; as most of them have Hispanic proportions that are on the left side of the scale, they should be showing light orange. However, all of them appear darker than they ought to be. That's because each bubble has a dark border. For small bubbles, the ratio of ink on the border is a high proportion of the ink for the entire object.


What's a histogram?

Almost all graphing tools make histograms, and almost all dataviz books cover the subject. But I've always felt there are many unanswered questions. In my talk this Thursday in NYC, I'll provide some answers. You can reserve a spot here.

***

Here's the most generic histogram:

Salaries_count_histogram

Even Excel can make this kind of histogram. Notice that we have counts in the y-axis. Is this really a useful chart?

I haven't found this type of histogram useful ever, since I don't do analyses in which I needed to know the exact count of something - when I analyze data, I'm generalizing from the observed sample to a larger group.

Speaking of Excel, I felt that the developers have always hated histograms. Why is it much harder to make histograms than other basic charts?

***

Another question. We often think of histograms as a crude approximation to a probability density function (PDF). An example of a PDF is the famous bell curve. Textbooks sometimes show the concept like this:

Histogram_normal_pdf

This is true of only some types of histograms (and not the one shown in the first section!) Instead, we often face the following situation:

Normals_histogram50_undercurve

This isn't a trick. The data in the histogram above were generated by sampling the pink bell curve.

***

If you've used histograms, you probably also have run into strange issues. I haven't found much materials out there to address these questions, and they have been lingering in my mind, hidden, for a long time.

My Thursday talk will hopefully fill in some of these gaps.


Do you want a taste of the new hurricane cone?

The National Hurricane Center (NHC) put out a press release (link to PDF) to announce upcoming changes (in August 2024) to their "hurricane cone" map. This news was picked up by Miami Herald (link).

New_hurricane_map_2024

The above example is what the map looks like. (The data are probably fake since the new map is not yet implemented.)

The cone map has been a focus of research because experts like Alberto Cairo have been highly critical of its potential to mislead. Unfortunately, the more attention paid to it, the more complicated the map has become.

The latest version of this map comprises three layers.

The bottom layer is the so-called "cone". This is the white patch labeled below as the "potential track area (day 1-5)".  Researchers dislike this element because they say readers tend to misinterpret the cone as predicting which areas would be damaged by hurricane winds when the cone is intended to depict the uncertainty about the path of the hurricane. Prior criticism has led the NHC to add the text at the top of the chart, saying "The cone contains the probable path of the storm center but does not show the size of the storm. Hazardous conditions can occur outside of the cone."

The middle layer are the multi-colored bits. Two of these show the areas for which the NHC has issued "watches" and "warnings". All of these color categories represent wind speeds at different times. Watches and warnings are forecasts while the other colors indicate "current" wind speeds. 

The top layer consists of black dots. These provide a single forecast of the most likely position of the storm, with the S, H, M labels indicating the most likely range of wind speeds at forecast times.

***

Let's compare the new cone map to a real hurricane map from 2020. (This older map came from a prior piece also by NHC.)

Old_hurricane_map_2020

Can we spot the differences?

To my surprise, the differences were minor, in spite of the pre-announced changes.

The first difference is a simplification. Instead of dividing the white cone (the bottom layer) into two patches -- a white patch for days 1-3, and a dotted transparent patch for days 4-5, the new map aggregates the two periods. Visually, simplifying makes the map less busy but loses the implicit acknowledge found in the old map that forecasts further out are not as reliable.

The second point of departure is the addition of "inland" warnings and watches. Notice how the red and blue areas on the old map hugged the coastline while the red and blue areas on the new map reach inland.

Both changes push the bottom layer, i.e. the cone, deeper into the background. It's like a shrink-flation ice cream cone that has a tiny bit of ice cream stuffed deep in its base.

***

How might one improve the cone map? I'd start by dismantling the layers. The three layers present answers to different problems, albeit connected.

Let's begin with the hurricane forecasting problem. We have the current location of the storm, and current measurements of wind speeds around its center. As a first requirement, a forecasting model predicts the path of the storm in the near future. At any time, the storm isn't a point in space but a "cloud" around a center. The path of the storm traces how that cloud will move, including any expansion or contraction of its radius.

That's saying a lot. To start with, a forecasting model issues the predicted average path -- the expected path of the storm's center. This path is (not competently) indicated by the black dots in the top layer of the cone map. These dots offer only a sampled view of the average path.

Not surprisingly, there is quite a bit of uncertainty about the future path of any storm. Many models simulate future worlds, generating many predictions of the average paths. The envelope of the most probable set of paths is the "cone". The expanding width of the cone over time reflects the higher uncertainty of our predictions further into the future. Confusingly, this cone expansion does not depict spatial expansion of either the storm's size or the potential areas that may suffer the greatest damage. Both of those tend to shrink as hurricanes move inland.

Nevertheless, the cone and the black dots are connected. The path drawn out by the black dots should be the average path of the center of the storm.

The forecasting model also generates estimates of wind speeds. Those are given as labels inside the black dots. The cone itself offers no information about wind speeds. The map portrays the uncertainty of the position of the storm's center but omits the uncertainty of the projected wind speeds.

The middle layer of colored patches also inform readers about model projections - but in an interpreted manner. The colors portray hurricane warnings and watches for specific areas, which are based on projected wind speeds from the same forecasting models described above. The colors represent NHC's interpretation of these model outputs. Each warning or watch simultaneously uses information on location, wind speed and time. The uncertainty of the projected values is suppressed.

I think it's better to use two focused maps instead of having one that captures a bit of this and a bit of that.

One map can present the interpreted data, and show the areas that have current warnings and watches. This map is about projected wind strength in the next 1-3 days. It isn't about the center of the storm, or its projected path. Uncertainty can be added by varying the tint of the colors, reflecting the confidence of the model's prediction.

Another map can show the projected path of the center of the storm, plus the cone of uncertainty around that expected path. I'd like to bring more attention to the times of forecasting, perhaps shading the cone day by day, if the underlying model has this level of precision.

***

Back in 2019, I wrote a pretty long post about these cone maps. Well worth revisiting today!


Lost in the middle class

Washington Post asks people what it means to be middle class in the U.S. (link; paywall)

The following graphic illustrates one type of definition, purely based on income ranges.

Wpost_middleclass

For me, this chart is more taxing to read than it appears.

It can be read column by column. Each column represents a hypotheticial annual income for a family of four. People are asked whether they consider that family lower/working class, middle class or upper class. Be careful as the increments from column to column are not uniform.

Now, what's the question again? We're primarily interested in what incomes constitute middle class.

So, we should be looking at the deep green blocks that hang in the middle of each column. It's not easy to read the proportion of middle blocks in a stacked column chart.

***

I tried separating out the three perceived income classes, using a small-multiples design.

Junkcharts_redo_wpost_middleclass

One can more directly see what income ranges are most popularly perceived as being in each income class.

***

The article also goes into alternative definitions of middle class, using more qualitative metrics, such as "able to pay all bills on time without worry". That's a whole other post.

 


Messing with expectations

A co-worker sent me to the following map, found in Forbes:

Forbes_gastaxmap

It shows the amount of state tax surcharge per gallon of gas in the U.S. And it's got one of the most common issues found in choropleth maps - the color scheme runs opposite to reader expectations.

Typically, if we see a red-green color scale, we would expect red to represent large numbers and green, small numbers. This map reverses the typical setup: California, the state with the heftiest gas tax, is shown green.

I know, I know - if we apply the typical color scheme, California would bleed red, and it's a blue state, damn it.

The solution is to avoid the red color. Just don't use red or blue.

Junkcharts_redo_forbes_gastaxmap_green

There is no need to use two colors either.

***

A few minor fixes. Given that all dollar amounts on the map are shown to two decimal places, the legend labels should also be shown to 2 decimal places, and with dollar signs.

Forbes_gastaxmap_legend

The subtitle should read "Dollars per gallon" instead of "Cents per gallon". Alternatively, keep "Cents per gallon" but convert all data labels into cents.

Some of the states are missing data labels.

***

I recast this as a small-multiples by categorizing states into four subgroups.

Junkcharts_redo_forbes_gastaxmap_split

With this change, one can almost justify using maps because there is sort of a spatial pattern.

 

 


Stranger things found on scatter plots

Washington Post published a nice scatter plot which deconstructs scores from the recent World Championships in Gymnastics. (link)

Wpost_simonebiles

The chart presents the main message clearly - the winner Simone Biles scored the highest on both components of the score (difficulty and execution), by quite some margin.

What else can we learn from this chart?

***

Every athlete who qualified for the final scored at or above average on both components.

Scoring below average on either component is a death knell: no athlete scored enough on the other component to compensate. (The top left and bottom right quadrants would have had some yellow dots otherwise.)

Several athletes in the top right quadrant presumably scored enough to qualify but didn't. The footnote likely explains it: each country can send at most two athletes to the final. It may be useful to mark out these "unlucky" athletes using a third color.

Curiously, it's not easy to figure out who these unlucky athletes were from this chart alone. We need two pieces of data: the minimum qualifying score, and the total score for each athlete. The scatter plot isn't the best chart form to show totals, but qualification to the final is based on the sum of the difficulty and execution scores. (Note also, neither axis starts at zero, compounding the challenge.)

***

This scatter plot is most memorable for shattering one of my expectations about risk and reward in sports.

I expect risk-seeking athletes to suffer from higher variance in performance. The tennis player who goes for big serves tend to also commit more double faults. The sluggers who hit home runs tend to strike out more often. Similarly, I expect gymnasts who attempt more difficult skills to receive lower execution scores.

Indeed, the headline writer seemed to agree, suggesting that Biles is special because she's both high in difficulty and strong in execution.

The scatter plot, however, sends the opposite message - this should not surprise. The entire field shows a curiously strong positive correlation between difficulty and execution scores. The more difficult is the routine, the higher the excution score!

It's hard to explain such a pattern. My guesses are:

a) judges reward difficult routines, and subconsciously confound execution and difficulty scores. They use separate judges for excecution and difficulty. Paradoxically, this arrangement may have caused separation anxiety - the judges for execution might just feel the urge to reward high difficulty.

b) those athletes who are skilled enough to attempt more difficult routines are also those who are more consistent in execution. This is a type of self-selection bias frequently found in observational data.

Regardless of the reasons for the strong correlation, the chart shows that these two components of the total score are not independent, i.e. the metrics have significant overlap in what they measure. Thus, one cannot really talk about a difficult routine without also noting that it's a well-executed routine, and vice versa. In an ideal scoring design, we'd like to have independent components.


The choice to encode data using colors

NBC News published the following heatmap that shows inflation by product category in the last year or so:

Nbcnews_inflationtracker

The general story might be that inflation was rampant in airfare and electricity prices about a year ago but these prices have moderated recently, especially in airfare. Gas prices appear to have inflated far less than overall inflation during these months.

***

Now, if you're someone who cares about the magnitude of differences, not just the direction, then revisit the above statements, and you'll feel a sense of inadequacy.

When we choose to encode data in colors, we're giving up on showing magnitudes or precision. The color scale shown up top sends the message that the continuous nature of the number line is being displayed but it really isn't.

The largest value of the chart is found on the left side of the airfare row:

Nbcnews_inflationtracker_highest

The value is about 36% which strangely enough is far larger than the maximum value shown in the legend above. Even if those values align, it is still impossible to guess what values the different colors and shades in the cells map to from the legend.

***

The following small-multiples chart shows the underlying values more precisely:

Redo_junkcharts_nbcnewsinflation

I have transformed the data differently. In these line charts, the data are indexed to the first month (100) so each chart shows the cumulative change in prices from that month to the current month, for each category, compared to the overall.

The two most interesting categories are airfare and gas. Airfare has recently decreased quite drastically relative to September 2022, and thus the line is far below the overall inflation trend. Gas prices moved in reverse: they dropped in the last quarter of 2022 but have steadily risen over 2023, and in the most recent month, is tracking overall inflation.