Tidying up the details

This column chart caught my attention because of the color labels.

Thall_financials2023_pandl

Well, it also concerns me that the chart takes longer to take in than you'd think.

***

The color labels say "FY2123", "FY2022", and "FY1921". It's possible but unlikely that the author is making comparisons across centuries. The year 2123 hasn't yet passed, so such an interpretation would map the three categories to long-ago past, present and far-into-the-future.

Perhaps hyphens were inadvertently left off so "FY2123" means "FY2021 - FY2023". It's odd to report financial metrics in multi-year aggregations. I rule this out because the three categories would then also overlap.

Here's what I think the mistake is: somehow the prefix is rolled forward when it is applied to the years. "FY23", "FY22", "FY21" got turned into "FY[21]23", "FY[20]22", "FY[19]21" instead of putting 20 in all three slots.

The chart appeared in an annual financial report, and the comparisons were mostly about the reporting year versus the year before so I'm pretty confident the last two digits are accurately represented.

Please let me know if you have another key to this puzzle.

In the following, I'm going to assume that the three colors represent the most recent three fiscal years.

***

A few details conspire to blow up our perception time.

There was no extra spacing between groups of columns.

The columns are arranged in reverse time order, with the most recent year shown on the left. (This confuses those of us that use the left-to-right convention.)

The colors are not ordered. If asked to sort the three colors, you will probably suggest what is described as "intuitive" below:

Junkcharts_color_order

The intuitive order aligns with the amount of black added to a base color (hue). But this isn't the order assigned to the three years on the original chart.

***

Some of the other details on the chart are well done. For example, I like the handling of the gridlines and the axes.

The following revision tidies up some of the details mentioned above, without changing the key features of the chart:

Junkcharts_redo_trinhallfinancials

 


The most dangerous day

Our World in Data published this interesting chart about infant mortality in the U.S.

Mostdangerousday

The article that sent me to this chart called the first day of life the "most dangerous day". This dot plot seems to support the notion, as the "per-day" death rate is the highest on the day of birth, and then drops quite fast (note log scale) over the the first year of life.

***

Based on the same dataset, I created the following different view of the data, using the same dot plot form:

Junkcharts_redo_ourworldindata_infantmortality

By this measure, a baby has 99.63% chance of surviving the first 30 days while the survival rate drops to 99.5% by day 180.

There is an important distinction between these two metrics.

The "per day" death rate is the chance of dying on a given day, conditional on having survived up to that day. The chance of dying on day 2 is lower partly because some of the more vulnerable ones have died on day 1 or day 0,  etc.

The survival rate metric is cumulative: it measures how many babies are still alive given they were born on day 0. The survival rate can never go up, so long as we can't bring back the dead.

***

If we are assessing a 5-day-old baby's chance of surviving day 6, the "per-day" death rate is relevant since that baby has not died in the first 5 days.

If the baby has just been born, and we want to know the chance it might die in the first five days (or survive beyond day 5), then the cumulative survival rate curve is the answer. If we use the per-day death rate, we can't add the first five "per-day" death rates It's a more complicated calculation of dying on day 0, then having not died on day 0, dying on day 1, then having not died on day 0 or day 1, dying on day 2, etc.

 


Pie charts and self-sufficiency

This graphic shows up in a recent issue of Princeton alumni magazine, which has a series of pie charts.

Pu_aid sm

The story being depicted is clear: the school has been generously increasing the amount of financial aid given to students since 1998. The proportion receiving any aid went from 43% to 67% so about two out of three students who enrolled in 2023 are getting aid.

The key components of the story are the values in 1998 and 2023, and the growth trend over this period.

***

Here is an exercise worth doing. Think about how you figured out the story components.

Is it this?

Junkcharts_redo_pu_aid_1

Or is it this?

Junkcharts_redo_pu_aid_2

***

This is what I've been calling a "self-sufficiency test" (link). How much work are the visual elements doing in conveying the graph's message to you? If the visual elements aren't doing much, then the designer hasn't taken advantage of the visual medium.


Expert handling of multiple dimensions of data

I enjoyed reading this Washington Post article about immigration in America. It features a number of graphics. Here's one graphic I particularly like:

Wpost_smallmultiplesmap

This is a small multiples of six maps, showing the spatial distribution of immigrants from different countries. The maps reveal some interesting patterns: Los Angeles is a big favorite of Guatamalans while Houston is preferred by Hondurans. Venezuelans like Salt Lake City and Denver (where there are also some Colombians and Mexicans). The breadth of the spatial distribution surprises me.

The dataset behind this graphic is complex. It's got country of origin, place of settlement, and time of arrival. The maps above collapsed the time dimension, while drawing attention to the other two dimensions.

***

They have another set of charts that highlight the time dimension while collapsing the place of settlement dimension. Here's one view of it:

Wpost_inkblot_overall

There are various names for this chart form. Stream river is one. I like to call it "inkblot", where the two sides are symmetric around the middle vertical line. The chart shows that "migrants in the U.S. immigration court" system have grown substantially since the end of the Covid-19 pandemic, during which they stopped coming.

I'm not a fan of the inkblot. One reason is visible in the following view, which showcases three Central American countries.

Wpost_inkblot_centralamerica

The main message is clear enough. The volume of immigrants from these three countries have been relatively stable over the last decade, with a bulge in the late 2000s. The recent spurt in migrants have come from other places.

But try figuring out what proportion of total immigration is accounted for by these three countries say in 2024. It's a task that is tougher than it should be, and the culprit is that the "other countries" category has been split in half with the two halves separated.

 


The radial is still broken

It's puzzling to me why people like radial charts. Here is a recent set of radial charts that appear in an article in Significance magazine (link to paywall, currently), analyzing NBA basketball data.

Significance radial nba

This example is not as bad as usual (the color scheme notwithstanding) because the story is quite simple.

The analysts divided the data into three time periods: 1980-94, 1995-15, 2016-23. The NBA seasons were summarized using a battery of 15 metrics arranged in a circle. In the first period, all but 3 of the metrics sat much above the average level (indicated by the inner circle). In the second period, all 15 metrics reduced below the average, and the third period is somewhat of a mirror image of the first, which is the main message.

***

The puzzle: why prefer this circular arrangement to a rectangular arrangement?

Here is what the same graph looks like in a rectangular arrangement:

Junkcharts_redo_significanceslamdunkstats

One plausible justification for the circular arrangement is if the metrics can be clustered so that nearby metrics are semantically related.

Nevertheless, the same semantics appear in a rectangular arrangement. For example, P3-P3A are three point scores and attempts while P2-P2A are two-pointers. That is a key trend. They are neighborhoods in this arrangement just as they are in the circular arrangement.

So the real advantage is when the metrics have some kind of periodicity, and the wraparound point matters. Or, that the data are indexed to directions so north, east, south, west are meaningful concepts.

If you've found other use cases, feel free to comment below.

***


I can't end this post without returning to the colors. If one can take a negative image of the original chart, one should. Notice that the colors that dominate our attention - the yellow background, and the black lines - have no data in them: yellow being the canvass, and black being the gridlines. The data are found in the white polygons.

The other informative element, as one learns from the caption, is the "blue dashed line" that represents the value zero (i.e. average) in the standardized scale. Because the size of the image was small in the print magazine that I was reading, and they selected a dark blue encroaching on black, I had to squint hard to find the blue line.

 

 


Adjust, and adjust some more

This Financial Times report illustrates the reason why we should adjust data.

The story explores the trend in economic statistics during 14 years of governing by conservatives. One of those metrics is so-called council funding (local governments). The graphic is interactive: as the reader scrolls the page, the chart transforms.

The first chart shows the "raw" data.

Ft_councilfunding1

The vertical axis shows year-on-year change in funding. It is an index relative to the level in 2010. From this line chart, one concludes that council funding decreased from 2010 to around 2016, then grew; by 2020, funding has recovered to the level of 2010 and then funding expanded rapidly in recent years.

When the reader scrolls down, this chart is replaced by another one:

Ft_councilfunding2

This chart contains a completely different picture. The line dropped from 2010 to 2016 as before. Then, it went flat, and after 2021, it started raising, even though by 2024, the value is still 10 percent below the level in 2010.

What happened? The data journalist has taken the data from the first chart, and adjusted the values for inflation. Inflation was rampant in recent years, thus, some of the raw growth have been dampened. In economics, adjusting for inflation is also called expressing in "real terms". The adjustment is necessary because the same dollar (hmm, pound) is worth less when there is inflation. Therefore, even though on paper, council funding in 2024 is more than 25 percent higher than in 2010, inflation has gobbled up all of that and more, to the point in which, in real terms, council funding has fallen by 20 percent.

This is one material adjustment!

Wait, they have a third chart:

Ft_councilfunding3

It's unfortunate they didn't stabilize the vertical scale. Relative to the middle chart, the lowest point in this third chart is about 5 percent lower, while the value in 2024 is about 10 percent lower.

This means, they performed a second adjustment - for population change. It is a simple adjustment of dividing by the population. The numbers look worse probably because population has grown during these years. Thus, even if the amount of funding stayed the same, the money would have to be split amongst more people. The per-capita adjustment makes this point clear.

***

The final story is much different from the initial one. Not only was the magnitude of change different but the direction of change reversed.

Whenever it comes to adjustments, remember that all adjustments are subjective. In fact, choosing not to adjust is also subjective. Not adjusting is usually much worse.

 

 

 

 


Excess delay

The hot topic in New York at the moment is congestion pricing for vehicles entering Manhattan, which is set to debut during the month of June. I found this chart (link) that purports to prove the effectiveness of London's similar scheme introduced a while back.

Transportxtra_2

This is a case of the visual fighting against the data. The visual feels very busy and yet the story lying beneath the data isn't that complex.

This chart was probably designed to accompany some text which isn't available free from that link so I haven't seen it. The reader's expectation is to compare the periods before and after the introduction of congestion charges. But even the task of figuring out the pre- and post-period is taking more time than necessary. In particular, "WEZ" is not defined. (I looked this up, it's "Western Extension Zone" so presumably they expanded the area in which charges were applied when the travel rates went back to pre-charging levels.)

The one element of the graphic that raises eyebrows is the legend which screams to be read.

Transportxtra_londoncongestioncharge_legend

Why are there four colors for two items? The legend is not self-sufficient. The reader has to look at the chart itself and realize that purple is the pre-charging period while green (and blue) is the post-charging period (ignoring the distinction between CCZ and WEZ).

While we are solving this puzzle, we also notice that the bottom two colors are used to represent an unchanging quantity - which is the definition of "no congestion". This no-congestion travel rate is a constant throughout the chart and yet a lot of ink of two colors have been spilled on it. The real story is in the excess delay, which the congestion charging scheme was supposed to reduce.

The excess on the chart isn't harmless. The excess delay on the roads has been transferred to the chart reader. It actually distracts from the story the analyst is wanting to tell. Presumably, the story is that the excess delays dropped quite a bit after congestion charging was introduced. About four years later, the travel rates had creeped back to pre-charging levels, whereupon the authorities responded by extending the charging zone to WEZ (which as of the time of the chart, wasn't apparently bringing the travel rate down.)

Instead of that story, the excess of the chart makes me wonder... the roads are still highly congested with travel rates far above the level required to achieve no congestion, even after the charging scheme was introduced.

***

I started removing some of the excess from the chart. Here's the first cut:

Junkcharts_redo_transportxtra_londoncongestioncharge

This is better but it is still very busy. One problem is the choice of columns, even though the data are found strictly on the top of each column. (Besides, when I chop off the unchanging sections of the columns, I created a start-not-from-zero problem.) Also, the labeling of the months leaves much to be desired, there are too many grid lines, etc.

***

Here is the version I landed on. Instead of columns, I use lines. When lines are used, there is no need for month labels since we can assume a reader knows the structure of months within a year.

Junkcharts_redo_transportxtra_londoncongestioncharge-2

A priniciple I hold dear is not to have legends unless it is absolutely required. In this case, there is no need to have a legend. I also brought back the notion of a uncongested travel speed, with a single line (and annotation).

***

The chart raises several questions about the underlying analysis. I'd interested in learning more about "moving car observer surveys". What are those? Are they reliable?

Further, for evidence of efficacy, I think the pre-charging period must be expanded to multiple years. Was 2002 a particularly bad year?

Thirdly, assuming WEZ indicates the expansion of the program to a new geographical area, I'm not sure whether the data prior to its introduction represents the travel rate that includes the WEZ (despite no charging) or excludes it. Arguments can be made for each case so the key from a dataviz perspective is to clarify what was actually done.

 

P.S. [6-6-24] On the day I posted this, NY State Governer decided to cancel the congestion pricing scheme that was set to start at the end of June.


Do you want a taste of the new hurricane cone?

The National Hurricane Center (NHC) put out a press release (link to PDF) to announce upcoming changes (in August 2024) to their "hurricane cone" map. This news was picked up by Miami Herald (link).

New_hurricane_map_2024

The above example is what the map looks like. (The data are probably fake since the new map is not yet implemented.)

The cone map has been a focus of research because experts like Alberto Cairo have been highly critical of its potential to mislead. Unfortunately, the more attention paid to it, the more complicated the map has become.

The latest version of this map comprises three layers.

The bottom layer is the so-called "cone". This is the white patch labeled below as the "potential track area (day 1-5)".  Researchers dislike this element because they say readers tend to misinterpret the cone as predicting which areas would be damaged by hurricane winds when the cone is intended to depict the uncertainty about the path of the hurricane. Prior criticism has led the NHC to add the text at the top of the chart, saying "The cone contains the probable path of the storm center but does not show the size of the storm. Hazardous conditions can occur outside of the cone."

The middle layer are the multi-colored bits. Two of these show the areas for which the NHC has issued "watches" and "warnings". All of these color categories represent wind speeds at different times. Watches and warnings are forecasts while the other colors indicate "current" wind speeds. 

The top layer consists of black dots. These provide a single forecast of the most likely position of the storm, with the S, H, M labels indicating the most likely range of wind speeds at forecast times.

***

Let's compare the new cone map to a real hurricane map from 2020. (This older map came from a prior piece also by NHC.)

Old_hurricane_map_2020

Can we spot the differences?

To my surprise, the differences were minor, in spite of the pre-announced changes.

The first difference is a simplification. Instead of dividing the white cone (the bottom layer) into two patches -- a white patch for days 1-3, and a dotted transparent patch for days 4-5, the new map aggregates the two periods. Visually, simplifying makes the map less busy but loses the implicit acknowledge found in the old map that forecasts further out are not as reliable.

The second point of departure is the addition of "inland" warnings and watches. Notice how the red and blue areas on the old map hugged the coastline while the red and blue areas on the new map reach inland.

Both changes push the bottom layer, i.e. the cone, deeper into the background. It's like a shrink-flation ice cream cone that has a tiny bit of ice cream stuffed deep in its base.

***

How might one improve the cone map? I'd start by dismantling the layers. The three layers present answers to different problems, albeit connected.

Let's begin with the hurricane forecasting problem. We have the current location of the storm, and current measurements of wind speeds around its center. As a first requirement, a forecasting model predicts the path of the storm in the near future. At any time, the storm isn't a point in space but a "cloud" around a center. The path of the storm traces how that cloud will move, including any expansion or contraction of its radius.

That's saying a lot. To start with, a forecasting model issues the predicted average path -- the expected path of the storm's center. This path is (not competently) indicated by the black dots in the top layer of the cone map. These dots offer only a sampled view of the average path.

Not surprisingly, there is quite a bit of uncertainty about the future path of any storm. Many models simulate future worlds, generating many predictions of the average paths. The envelope of the most probable set of paths is the "cone". The expanding width of the cone over time reflects the higher uncertainty of our predictions further into the future. Confusingly, this cone expansion does not depict spatial expansion of either the storm's size or the potential areas that may suffer the greatest damage. Both of those tend to shrink as hurricanes move inland.

Nevertheless, the cone and the black dots are connected. The path drawn out by the black dots should be the average path of the center of the storm.

The forecasting model also generates estimates of wind speeds. Those are given as labels inside the black dots. The cone itself offers no information about wind speeds. The map portrays the uncertainty of the position of the storm's center but omits the uncertainty of the projected wind speeds.

The middle layer of colored patches also inform readers about model projections - but in an interpreted manner. The colors portray hurricane warnings and watches for specific areas, which are based on projected wind speeds from the same forecasting models described above. The colors represent NHC's interpretation of these model outputs. Each warning or watch simultaneously uses information on location, wind speed and time. The uncertainty of the projected values is suppressed.

I think it's better to use two focused maps instead of having one that captures a bit of this and a bit of that.

One map can present the interpreted data, and show the areas that have current warnings and watches. This map is about projected wind strength in the next 1-3 days. It isn't about the center of the storm, or its projected path. Uncertainty can be added by varying the tint of the colors, reflecting the confidence of the model's prediction.

Another map can show the projected path of the center of the storm, plus the cone of uncertainty around that expected path. I'd like to bring more attention to the times of forecasting, perhaps shading the cone day by day, if the underlying model has this level of precision.

***

Back in 2019, I wrote a pretty long post about these cone maps. Well worth revisiting today!


Neither the forest nor the trees

On the NYT's twitter feed, they featured an article titled "These Seven Tech Stocks are Driving the Market". The first sentence of the article reads: "The S&P 500 is at an all-time high, and investors have just a handful of stocks to thank for it."

Without having seen any data, I'd surmise from that line that (a) the S&P 500 index has gone up recently, and (b) most if not all of the gain in the index can be attributed to gains in the tech stocks mentioned in the headline. (For purists, a handful is five, not seven.)

The chart accompanying the tweet is a treemap:

Nyt_magnificentseven

The treemap is possibly the most overhyped chart type of the modern era. Its use here is tangential to the story of surging market value. That's because the treemap presents a snapshot of the composition of the index, but contains nothing about the trend (change over time) of the average index value or of its components.

***

Even in representing composition, the treemap is inferior to, gasp, a pie chart. Of course, we can only use a pie chart for small numbers of components. The following illustration takes the data from the NYT chart on the Magnificent Seven tech stocks, and compares a treemap versus a pie chart side by side:

Junkcharts_redo_nyt_magnificent7

The reason why the treemap is worse is that both the width and the height of the boxes are changing while only the radius (or angle) of the pie slices is varying. (Not saying use a pie chart, just saying the treemap is worse.)

There is a reason why the designer appended data labels to each of the seven boxes. The effect of not having those labels is readily felt when our eyes reach the next set of stocks – which carry company names but not their market values. What is the market value of Berkshire Hathaway?

Even more so, what proportion of the total is the market value of Berkshire Hathaway? Indeed, if the designer did not write down 29%, it would take a bit of work to figure out the aggregate value of yellow boxes relative to the entire box!

This design sucessfully draws our attention to the structural importance of various components of the whole. There are three layers - the yellow boxes (Magnificent Seven), the gray boxes with company names, and the other gray boxes. I also like how they positioned the text on the right column.

***

Going inside the NYT article itself, we find two line charts that convey the story as told.

Here's the first one:

Nyt_magnificent7_linechart1

They are comparing the most recent stock prices with those from October 12 2022, which is identified as the previous "low". (I'm actually confused by how the most recent "low" is defined, but that's a different subject.)

This chart carries a lot of good information, even though it does not plot "all the data", as in each of the 500 S&P components individually. Over the period under analysis, the average index value has gone up about 35% while the Magnificent Seven's value have skyrocketed by 65% in aggregate. The latter accounted for 30% of the total value at the most recent time point.

If we set the S&P 500 index value in 2024 as 100, then the M7 value in 2024 is 30. After unwinding the 65% growth, the M7 value in October 2022 was 18; the S&P 500 in October 2022 was 74. Thus, the weight of M7 was 24% (18/74) in October 2022, compared to 30% now. Consequently, the weight of the other 473 stocks declined from 76% to 70%.

This isn't even the full story because most of the action within the M7 is in Nvidia, the stock most tightly associated with the current AI hype, as shown in the other line chart.

Nyt_magnificent7_linechart2

Nvidia's value jumped by 430% in that time window. From the treemap, the total current value of M7 is $12.3 b while Nvidia's value is $1.4 b, thus Nvidia is 11.4% of M7 currently. Since M7 is 29% of the total S&P 500, Nvidia is 11.4%*29% = 3% of the S&P. Thus, in 2024, against 100 for the S&P, Nvidia's share is 3. After unwinding the 430% growth, Nvidia's share in October 2022 was 0.6, about 0.8% of 74. Its weight tripled during this period of time.


The cult of raw unadjusted data

Long-time reader Aleks came across the following chart on Facebook:

Unadjusted temp data fgfU4-ia fb post from aleks

The author attached a message: "Let's look at raw, unadjusted temperature data from remote US thermometers. What story do they tell?"

I suppose this post came from a climate change skeptic, and the story we're expected to take away from the chart is that there is nothing to see here.

***

What are we looking at, really?

"Nothing to see" probably refers to the patch of blue squares that cover the entire plot area, as time runs left to right from the 1910s to the present.

But we can't really see what's going on in the middle of the patch. So, "nothing to see" is effectively only about the top-to-bottom range of roughly 29.8 to 82.0. What does that range signify?

The blue patch is subdivided into vertical lines consisting of blue squares. Each line is a year's worth of temperature measurements. Each square is the average temperature on a specific day. The vertical range is the difference between the maximum and minimum daily temperatures in a given year. These are extreme values that say almost nothing about the temperatures in the other ~363 days of the year.

We know quite a bit more about the density of squares along each vertical line. They are broken up roughly by seasons. Those values near the top came from summers while the values near the bottom came from winters. The density is the highest near the middle, where the overplotting is so severe that we can barely see anything.

Within each vertical line, the data are not ordered chronologically. This is a very key observation. From left to right, the data are ordered from earliest to latest but not from top to bottom! Therefore, it is impossible for the human eye to trace the entire trajectory of the daily temperature readings from this chart. At best, you can trace the yearly average temperature – but only extremely roughly by eyeballing where the annual averages are inside the blue patch.

Indeed, there is "nothing to see" on this chart because its design has pulverized the data.

***

_numbersense_bookcoverIn Numbersense (link), I wrote "not adjusting the raw data is to knowingly publish bad information. It is analogous to a restaurant's chef knowingly sending out spoilt fish."

It's a fallacy to think that "raw unadjusted" data are the best kind of data. It's actually the opposite. Adjustments are designed to correct biases or other problems in the data. Of course, adjustments can be subverted to introduce biases in the data as well. It is subversive to presume that all adjustments are of the subversive kind.

What kinds of adjustments are of interest in this temperature dataset?

Foremost is the seasonal adjustment. See my old post here. If we want to learn whether temperatures have risen over these decades, we can't do so without separating out the seasons.

The whole dataset can be simplified by drawing the smoothed annual average temperature grouped by season of the year, and when that is done, the trend of rising temperatures is obvious.

***

The following chart by the EPA roughly implements the above:

Epa-seasonal-temperature_2022

The original can be found here. They made one adjustment which isn't the one I expected.

Note the vertical scale is titled "temperature anomaly". So, they are not plotting the actual recorded average temperatures, but the "anomalies", i.e. the difference between the recorded temperatures and some kind of "expected" temperature. This is a type of data adjustment as well. The purpose is to focus attention on the relative rather than absolute values. Think of this formula: recorded value = expected value + anomaly. The chart shows how many degrees above or below expectation, rather than how many degrees.

For a chart like this, there should be a required footnote that defines what "anomaly" is. Specifically, the reader should know about the model behind the "expectation". Typically, it's a kind of long-term average value.

For me, this adjustment is not necessary. Without the adjustment, the four panels can be combined into one panel with four lines. That's because the data nicely fit into four levels based on seasons.

The further adjustment I'd have liked to see is "smoothing". Each line above has a "smooth" trend, as well as some variability around this trend. The latter is not a big part of the story.

***

It's weird to push back on climate change advocacy by attacking data adjustments. The more productive direction, in my view, is to ask whether the observed trend is caused by human activities or part of some long-term up-and-down cycle. That is a very challenging question to answer.