Patiently looking

Voronoi (aka Visual Economist) made this map about service times at emergency rooms around the U.S.

 

Voronoi_EmergencyRoomWaitTImes

This map shows why one shouldn’t just stick state-level data into a state-level map by default.

The data are median service times, defined as the duration of the visit from the moment a patients arrive to the moment they leave. For reasons to be explained below, I don’t like this metric. The data are in terms of hours and minutes, and encoded in the color scale.

As with any choropleth, the dominant features of this map are the shapes and sizes of various pieces but these don’t carry any data. The eastern seaboard contains many states that are small in area but dense in population, and always produces a messy, crowded smorgasbord of labels and guiding lines.

The color scale is progressive (continuous) making it even harder to gain an appreciation of the spatial pattern. For the sake of argument, imagine a truly continuous color scale tuned to the median service times in number of minutes. There would be as many shades as there are unique number of values on the map. For example, the state with 2 hr 12 min median time would receive a different shade than the one with 2 hr 11 min. Looking at the dataset, I found 43 unique values of median service time in the 52 states and territories. Thus, almost every state would wear its unique shade, making it hard to answer such common questions as: which cluster of states have high/medium/low median service times?

(As the underlying software may only be capable of printing a finite number of shades so in reality, there aren’t any true continuous scales. A continuous scale is just a discrete scale with many levels of shades. For this map, I’d group the states into at most five categories, requiring five shades.)

***

We’re now reaching the D corner of the Trifecta Checkup (link). _trifectacheckup_image

I’d transform the data to relative values, such as an index against the median or average in the nation. The colors now indicate how much higher or lower is the state’s median service time than that of the nation. With this transformed data, it makes more sense to use a bidirectional color scale so that there are different colors for higher vs lower than average.

Lastly, I’m not sure about the use of median service time, as opposed to average (mean) service time. I suspect that the distribution is heavily skewed toward longer values so that the median service time falls below the mean service time. If, however, the service time distribution is roughly symmetric around the median, then the mean and median service times will be very similar, and thus the metric selection doesn’t matter.

Imagine you're the healthcare provider and your bonus is based on managing median service times. You have an incentive to let a small number of patients wait an extraordinary amount of time, while serving a bunch of patients who require relatively simple procedures. If it's a mean service time, the values of the extreme outliers will be spread over all the patients while the median service time is affected by the number of such outliers but not their magnitudes.

When I pulled down the publicly available data (link), I found additional data fields. The emergency room visits are further broken into four categories (low, medium, high, very high), and a median is reported within each category. Thus, we have a little idea how extreme the top values can be.

The following dotplot shows this:

Junkcharts_redo_voronoi_emergencyrooms

A chart like this is still challenging to read since there are 52 territories, ordered by the value on a metric. If the analyst can say what are interesting questions, e.g. breaking up the territories into regions, then a grouping can be applied to the above chart to aid comprehension.

 


The canonical U.S. political map

The previous posts feature the canonical political map of U.S. presidential elections, the vote margin shift map. The following realization of it, made by NBC News (link), drills down to the counties with the largest Asian-American populations:

Nbcnews_votemarginshiftmap_asians

How does this map form encode the data?

***

The key visual element is the arrow. The arrow has a color, a length and also an angle.

The color scheme is fixed to the canonical red-blue palette attached to America's two major political parties.

The angle of the arrow, as seen in the legend, carries no data at all. All arrows are slanted at the same angles. Not quite; the political party is partially encoded into the angle, as the red arrows slant one way while the blue arrows always slant the other way. The degree of slant is constant everywhere, though.

So only the lengths of the arrows contain the vote margin gain/loss data. The legend shows arrows of two different lengths but vote margins have not been reduced to two values. As evident on the map, the arrow lengths are continuous.

The designer has a choice when it comes to assigning colors to these arrows. The colors found on the map above depicts the direction of the vote margin shift so red arrows indicate counties in which the Republicans gained share. (The same color encoding is used by the New York Times.)

Note that a blue county could have shifted to the right, and therefore appear as a red arrow even though the county voted for Kamala Harris in 2024. Alternatively, the designer could have encoded the 2024 vote margin in the arrow color. While this adds more data to the map, it could wreak havoc with our perception as now all four combinations are possible: red, pointing left; red, pointing right; blue, pointing left; and blue, pointing right.

***

To sum this all up, the whole map is built from a single data series, the vote margin shift expressed as a positive or negative percentage, in which a positive number indicates Republicans increased the margin. The magnitude of this data is encoded in the arrow length, ignoring the sign. The sign (direction) of the data, a binary value, is encoded into the arrow color as well as the direction of the arrow.

In other words, it's a proportional symbol map in which each geographical region is represented by a symbol (typically a bubble), and a single numeric measure is encoded in the size of the symbol. In many situations, the symbol's color is used to display a classification of the geographical regions.

The symbol used for the "wind map" are these slanted arrows. The following map, pulled from CNN (link), makes it clear that the arrows play only the role of a metaphor, the left-right axis of political attitude.

Cnn_votemarginshiftmap_triangles

This map is essentially the same as the "wind map" used by the New York Times and NBC News, the key difference being that instead of arrows, the symbol is a triangle. On proportional triangle maps, the data is usually encoded in the height of the triangles, so that the triangles can be interpreted as "hills". Thus, the arrow length in the wind map is the hill height in the triangle map. The only thing left behind is the left-right metaphor.

The CNN map added a detail. Some of the counties have a dark gray color. These are "flipped". A flip is defined as a change in "sign" of the vote margin from 2020 to 2024. A flipped county can exhibit either a blue or a red hill. The direction of the flip is actually constrained by the hill color. If it's a red hill, we know there is a shift towards Republicans, and in addition, the county flipped, it must be that Democrats won that county in 2020, and it flipped to Republicans. Similiar, if a blue hill sits on a dark gray county, then the county must have gone for Republicans in 2020 and flipped to Democrats in 2024.

 


Dot plots with varying dot sizes

In a prior post, I appreciated the effort by the Bloomberg Graphics team to describe the diverging fortunes of Japanese and Chinese car manufacturers in various Asian markets.

The most complex chart used in that feature is the following variant of a dot plot:

Bloomberg_japancars_chinamarket

This chart plots the competitors in the Chinese domestic car market. Each bubble represents a car brand. Using the styling of the entire article, the red color is associated with Japanese brands while the medium gray color indicates Chinese brands. The light gray color shows brands from the rest of the world. (In my view, adding the pink for U.S. and blue for German brands - seen on the first chart in this series - isn't too much.)

The dot size represents the current relative market share of the brand. The main concern of the Bloomberg article is the change in market share in the period 2019-2024. This is placed on the horizontal axis, so the bubbles on the right side represent growing brands while the bubbles on the left, weakening brands.

All the Japanese brands are stagnating or declining, from the perspective of market share.

The biggest loser appears to be Volkswagen although it evidently started off at a high level since its bubble size after shrinkage is still among the largest.

***

This chart form is a composite. There are at least two ways to describe it. I prefer to see it as a dot plot with an added dimension of dot size. A dot plot typically plots a single dimension on a single axis, and here, a second dimension is encoded in the sizes of the dots.

An alternative interpretation is that it is a scatter plot with a third dimension in the dot size. Here, the vertical dimension is meaningless, as the dots are arbitrarily spread out to prevent overplotting. This arrangement is also called the bubble plot if we adopt a convention that a bubble is a dot of variable size. In a typical bubble plot, both vertical and horizontal axes carry meaning but here, the vertical axis is arbitrary.

The bubble plot draws attention to the variable in the bubble size, the scatter plot emphasizes two variables encoded in the grid while the dot plot highlights a single metric. Each shows secondary metrics.

***

Another revelation of the graph is the fragmentation of the market. There are many dots, especially medium gray dots. There are quite a few Chinese local manufacturers, most of which experienced moderate growth. Most of these brands are startups - this can be inferred because the size of the dot is about the same as the change in market share.

The only foreign manufacturer to make material gains in the Chinese market is Tesla.

The real story of the chart is BYD. I almost missed its dot on first impression, as it sits on the far right edge of the chart (in the original webpage, the right edge of the chart is aligned with the right edge of the text). BYD is the fastest growing brand in China, and its top brand. The pedestrian gray color chosen for Chinese brands probably didn't help. Besides, I had a little trouble figuring out if the BYD bubble is larger than the largest bubble in the size legend shown on the opposite end of BYD. (I measured, and indeed the BYD bubble is slightly larger.)

This dot chart (with variable dot sizes) is nice for highlighting individual brands. But it doesn't show aggregates. One of the callouts on the chart reads: "Chinese cars' share rose by 23%, with BYD at the forefront". These words are necessary because it's impossible to figure out that the total share gain by all Chinese brands is 23% from this chart form.

They present this information in the line chart that I included in the last post, repeated here:

Bloomberg_japancars_marketshares

The first chart shows that cumulatively, Chinese brands have increased their share of the Chinese market by 23 percent while Japanese brands have ceded about 9 percent of market share.

The individual-brand view offers other insights that can't be found in the aggregate line chart. We can see that in addition to BYD, there are a few local brands that have similar market shares as Tesla.

***

It's tough to find a single chart that brings out insights at several levels of analysis, which is why we like to talk about a "visual story" which typically comprises a sequence of charts.

 


Fantastic auto show from the Bloomberg crew

I really enjoyed the charts in this Bloomberg feature on the state of Japanese car manufacturers in the Southeast Asian and Chinese markets (link). This article contains five charts, each of which is both engaging and well-produced.

***

Each chart has a clear message, and the visual display is clearly adapted for purpose.

The simplest chart is the following side-by-side stacked bar chart, showing the trend in share of production of cars:

Bloomberg_japancars_production

Back in 1998, Japan was the top producer, making about 22% of all passenger cars in the world. China did not have much of a car industry. By 2023, China has dominated global car production, with almost 40% of share. Japan has slipped to second place, and its share has halved.

The designer is thoughtful about each label that is placed on the chart. If something is not required to tell the story, it's not there. Consistently across all five charts, they code Japan in red, and China in a medium gray color. (The coloring for the rest of the world is a bit inconsistent; we'll get to that later.)

Readers may misinterpret the cause of this share shift if this were the only chart presented to them. By itself, the chart suggests that China simply "stole" share from Japan (and other countries). What is true is that China has invested in a car manufacturing industry. A more subtle factor is that the global demand for cars has grown, with most of the growth coming from the Chinese domestic market and other emerging markets - and many consumers favor local brands. Said differently, the total market size in 2023 is much higher than that in 1998.

***

Bloomberg also made a chart that shows market share based on demand:

Bloomberg_japancars_marketshares

This is a small-multiples chart consisting of line charts. Each line chart shows market share trends in five markets (China and four Southeast Asian nations) from 2019 to 2024. Take the Chinese market for example. The darker gray line says Chinese brands have taken 20 percent additional market share since 2019; note that the data series is cumulative over the entire window. Meanwhile, brands from all other countries lost market share, with the Japanese brands (in red) losing the most.

The numbers are relative, which means that the other brands have not necessarily suffered declines in sales. This chart by itself doesn't tell us what happened to sales; all we know is the market shares of brands from different countries relative to their baseline market share in 2019. (Strange period to pick out as it includes the entire pandemic.)

The designer demonstrates complete awareness of the intended message of the chart. The lines for Chinese and Japanese brands were bolded to highlight the diverging fortunes, not just in China, but also in Southeast Asia, to various extents.

On this chart, the designer splits out US and German brands from the rest of the world. This is an odd decision because the categorization is not replicated in the other four charts. Thus, the light gray color on this chart excludes U.S. and Germany while the same color on the other charts includes them. I think they could have given U.S. and Germany their own colors throughout.

***

The primacy of local brands is hinted at in the following chart showing how individual brands fared in each Southeast Asian market:

Bloomberg_japancars_seasiamarkets

 

This chart takes the final numbers from the line charts above, that is to say, the change in market share from 2019 to 2024, but now breaks them down by individual brand names. As before, the red bubbles represent Japanese brands, and the gray bubbles Chinese brands. The American and German brands are lumped in with the rest of the world and show up as light gray bubbles.

I'll discuss this chart form in a next post. For now, I want to draw your attention to the Malaysia market which is the last row of this chart.

What we see there are two dominant brands (Perodua, Proton), both from "rest of the world" but both brands are Malaysian. These two brands are the biggest in Malaysia and they account for two of the three highest growing brands there. The other high-growth brand is Chery, which is a Chinese brand; even though it is growing faster, its market share is still much smaller than the Malaysian brands, and smaller than Toyota and Honda. Honda has suffered a lot in this market while Toyota eked out a small gain.

The impression given by this bubble chart is that Chinese brands have not made much of a dent in Malaysia. But that would not be correct, if we believe the line chart above. According to the line chart, Chinese brands roughly earned the same increase in market share (about 3%) as "other" brands.

What about the bubble chart might be throwing us off?

It seems that the Chinese brands were starting from zero, thus the growth is the whole bubble. For the Malaysian brands, the growth is in the outer ring of the bubbles, and the larger the bubble, the thinner is the ring. Our attention is dominated by the bubble size which represents a snapshot in the ending year, providing no information about the growth (which is shown on the horizontal axis).

***

For more discussion of Bloomberg graphics, see here.


the wtf moment

You're reading some article that contains a standard chart. You're busy looking for the author's message on the chart. And then, the wtf moment strikes.

It's the moment when you discover that the chart designer has done something unexpected, something that changes how you should read the chart. It's when you learn that time is running right to left, for example. It's when you realize that negative numbers are displayed up top. It's when you notice that the columns are ordered by descending y-value despite time being on the x-axis.

Tell me about your best wtf moments!

***

The latest case of the wtf moment occurred to me when I was reading Rajiv Sethi's blog post on his theory that Kennedy voters crowded out Cheney voters in the 2024 Presidential election (link). Was the strategy to cosy up to Cheney and push out Kennedy wise?

In the post, Rajiv has included this chart from Pew:

Pew_science_confidence

The chart is actually about the public's confidence in scientists. Rajiv summarizes the message as: 'Public confidence in scientists has fallen sharply since the early days of the pandemic, especially among Republicans. There has also been a shift among Democrats, but of a slightly different kind—the proportion with “a great deal” of trust in scientists to act in our best interests rose during the first few months of the pandemic but has since fallen back.'

Pew produced a stacked column chart, with three levels for each demographic segment and month of the survey. The question about confidence in scientists admits three answers: a great deal, a fair amount, and not too much/None at all. [It's also possible that they offered 4 responses, with the bottom two collapsed as one level in the visual display.]

As I scan around the chart understanding the data, suddenly I realized that the three responses were not listed in the expected order. The top (light blue) section is the middling response of "a fair amount", while the middle (dark blue) section is the "a great deal" answer.

wtf?

***

Looking more closely, this stacked column chart has bells and whistles, indicating that the person who made it expended quite a bit of effort. Whether it's worthwhile effort, it's for us readers to decide.

By placing "a great deal" right above the horizon, the designer made it easier to see the trend in the proportion responding with "a great deal". It's also easy to read the trend of those picking the "negative" response because of how the columns are anchored. In effect, the designer is expressing the opinion that the middle group (which is also the most popular answer) is just background, and readers should not pay much attention to it.

The designer expects readers to care about one other trend, that of the "top 2 box" proportion. This is why sitting atop the columns are the data labels called "NET" which is the sum of those responding "a great deal" or "a fair amount".

***

For me, it's interesting to know whether the prior believers in science who lost faith in science went down one notch or two. Looking at the Republicans, the proportion of "a great deal" roughly went down by 10 percentage points while the proportion saying "Not too much/None at all" went up about 13%. Thus, the shift in the middle segment wasn't enough to explain all of the jump in negative sentiment; a good portion went from believer to skeptic during the pandemic.

As for Democrats, the proportion of believers also dropped by about 10 percentage points while the proportion saying "a fair amount" went up by almost 10 percent, accounting for most of the shift. The proportion of skeptics increased by about 2 percent.

So, for Democrats, I'm imagining a gentle slide in confidence that applies to the whole distribution while for Republicans, if someone loses confidence, it's likely straight to the bottom.

If I'm interested in the trends of all three responses, it's more effective to show the data in a panel like this:

Junkcharts_redo_pew_scientists

***

Remember to leave a comment when you hit your wtf moment next time!

 


Dizziness

Statista uses side-by-side stacked column charts to show the size of different religious groups in the world:

Statista_religiousgroups

It's hard to know where to look. It's so colorful and even the middle section is filled in whereas the typical such chart would only show guiding lines.

What's more, the chart includes gridlines, as well as axis labels.

The axis labels compete with the column section labels, the former being cumulative while the latter isn't.

The religious groups are arranged horizontally in two rows at the top while they are stacked up from bottom to top inside the columns.

The overall effect is dizzying.

***

The key question this chart purportedly address is the change in the importance of religions over the time frame depicted.

Look at the green sections in the middle of the chart, signifying "Unaffiliated" people. The change between the two time points is 16 vs 13 which is -3 percent.

Where is this -3 percent encoded?

It's in the difference in height between the two green blocks. On this design, that's a calculation readers have to do themselves.

One might take the slope of the guiding line that links the tops of the green blocks as indicative of the change, but it's not. In fact the top guiding line slopes upwards, implying an increase over time. That increase is associated with the cumulative total of the top three religious groups, not the share of the Unaffiliated group.

So, if we use those guiding lines, we have to take the difference of two lines, not just the top line. The line linking the bottoms of the green blocks is also relevant. However, the top and bottom lines will in general not be parallel, so readers have to somehow infer from the parallelogram bounded by the guiding lines and vertical block edges that the change in the Unaffiliated group is 3 percent.

Ouch.

***

I generally like to use Bumps charts (also called slopegraphs) to show change across two points in time:

Junkcharts_redo_statistareligiousgroups

What's sacrificed is the cumulation of percentages. I also am pleased that Christian and Muslim, where the movements are greatest, are found at the top of the chart. (There isn't a need to use so many colors; I just inherited them from the original chart.)


Visualizing extremes

The New York Times published this chart to illustrate the extreme ocean heat driving the formation of hurricane Milton (link):

Nyt_oceanheatmilton

The chart expertly shows layers of details.

The red line tracks the current year's data on ocean heat content up to yesterday.

Meaning is added through the gray line that shows the average trajectory of the past decade. With the addition of this average line, we can readily see how different the current year's data is from the past. In particular, we see that the current season can be roughly divided into three parts: from May to mid June, and also from August to October, the ocean heat this year was quite a bit higher than the 10-year average, whereas from mid June to August, it was just above average.

Even more meaning is added when all the individual trajectories from the last decade are shown (in light gray lines). With this addition, we can readily learn how extreme is this year's data. For the post-August period, it's clear that the Gulf of Mexico is the hottest it's been in the past decade. Also, this extreme is not too far from the previous extreme. On the other hand, the extreme in late May-early June is rather more scary. 

 

 


Aligning the visual and the message to hot things up

The headline of this NBC News chart (link) tells readers that Phoenix (Arizona) has been very, very hot this year. It has over 120 days in which the average temperature exceeded 100F (38 C).

Nbcnews_phoenix_tmax

It's not obvious how extreme this situation is. To help readers, it would be useful to add some kind of reference points.

A couple of possibilities come to mind:

First, how many days are depicted in the chart? Since there is one cell for each day of the year, and the day of week is plotted down the vertical axis, we just need to count the number of columns. There are 38 columns, but the first column has one missing cell while the last column has only 3 cells. Thus, the number of days depicted is (36*7)+6+3 = 261. So, the average temperature in Phoenix exceeded 100F on about 46% of the days of the year thus far.

That sounds like a high number. For a better reference point, we'd also like to know the historical average. Is Phoenix just a very hot place? Is 2024 hotter than usual?

***

Let's walk through how one reads the Phoenix "heatmap".

We already figured out that each column represents a week of the year, and each row shows a cross-section of a given day of week throughout the year.

The first column starts on a Monday because the first day of 2024 falls on a Monday. The last column ends on a Tuesday, which corresponds to Sept 17, 2024, the last day of data when this chart was created.

The columns are grouped into months, although such division is complicated by the fact that the number of days in a month (except for a leap month) isn't ever divisible by seven. The designer subtly inserted a thicker border between months. This feature allows readers to comment on the average temperature in a given month. It also lets readers learn quickly that we are two weeks and three days into September.

The color legend explains that temperature readings range from yellow (lower) to red (higher). The range of average daily temperatures during 2024 was 54-118F (12-48C). The color scale is progressive.

Nbcnews_phoenix_colorlegend

Given that 100F is used as a threshold to define "hot days," it makes sense to accentuate this in the visual presentation. For example:

Junkcharts_redo_nbcnewsphoenixmaxtemp

Here, all days with maximum temperature at 100F or above have a red hue.


Small tweaks that make big differences

It's one of those days that a web search led me to an unfamiliar corner, and I found myself poring over a pile of column charts that look like this:

GO-and-KEGG-diagrams-A-Forty-nine-different-GO-term-annotations-of-the-parental-genes

This pair of charts appears to be canonical in a type of genetics analysis. I'll focus on the column chart up top.

The chart plots a variety of gene functions along the horizontal axis. These functions are classified into three broad categories, indicated using axis annotation.

What are some small tweaks that readers will enjoy?

***

First, use colors. Here is an example in which the designer uses color to indicate the function classes:

Fcvm-09-810257-g006-3-colors

The primary design difference between these two column charts is using three colors to indicate the three function classes. This little change makes it much easier to recognize the ending of one class and the start of the other.

Color doesn't have to be limited to column areas. The following example extends the colors to the axis labels:

Fcell-09-755670-g004-coloredlabels

Again, just a smallest of changes but it makes a big difference.

***

It bugs me a lot that the long axis labels are printed in a slanted way, forcing every serious reader to read with slanted heads.

Slanting it the other way doesn't help:

Fig7-swayright

Vertical labels are best read...

OR-43-05-1413-g06-vertical

These vertical labels are best read while doing side planks.

Side-Plank

***

I'm surprised the horizontal alignment is rather rare. Here's one:

Fcell-09-651142-g004-horizontal

 


Tidying up the details

This column chart caught my attention because of the color labels.

Thall_financials2023_pandl

Well, it also concerns me that the chart takes longer to take in than you'd think.

***

The color labels say "FY2123", "FY2022", and "FY1921". It's possible but unlikely that the author is making comparisons across centuries. The year 2123 hasn't yet passed, so such an interpretation would map the three categories to long-ago past, present and far-into-the-future.

Perhaps hyphens were inadvertently left off so "FY2123" means "FY2021 - FY2023". It's odd to report financial metrics in multi-year aggregations. I rule this out because the three categories would then also overlap.

Here's what I think the mistake is: somehow the prefix is rolled forward when it is applied to the years. "FY23", "FY22", "FY21" got turned into "FY[21]23", "FY[20]22", "FY[19]21" instead of putting 20 in all three slots.

The chart appeared in an annual financial report, and the comparisons were mostly about the reporting year versus the year before so I'm pretty confident the last two digits are accurately represented.

Please let me know if you have another key to this puzzle.

In the following, I'm going to assume that the three colors represent the most recent three fiscal years.

***

A few details conspire to blow up our perception time.

There was no extra spacing between groups of columns.

The columns are arranged in reverse time order, with the most recent year shown on the left. (This confuses those of us that use the left-to-right convention.)

The colors are not ordered. If asked to sort the three colors, you will probably suggest what is described as "intuitive" below:

Junkcharts_color_order

The intuitive order aligns with the amount of black added to a base color (hue). But this isn't the order assigned to the three years on the original chart.

***

Some of the other details on the chart are well done. For example, I like the handling of the gridlines and the axes.

The following revision tidies up some of the details mentioned above, without changing the key features of the chart:

Junkcharts_redo_trinhallfinancials