Small tweaks that make big differences

It's one of those days that a web search led me to an unfamiliar corner, and I found myself poring over a pile of column charts that look like this:

GO-and-KEGG-diagrams-A-Forty-nine-different-GO-term-annotations-of-the-parental-genes

This pair of charts appears to be canonical in a type of genetics analysis. I'll focus on the column chart up top.

The chart plots a variety of gene functions along the horizontal axis. These functions are classified into three broad categories, indicated using axis annotation.

What are some small tweaks that readers will enjoy?

***

First, use colors. Here is an example in which the designer uses color to indicate the function classes:

Fcvm-09-810257-g006-3-colors

The primary design difference between these two column charts is using three colors to indicate the three function classes. This little change makes it much easier to recognize the ending of one class and the start of the other.

Color doesn't have to be limited to column areas. The following example extends the colors to the axis labels:

Fcell-09-755670-g004-coloredlabels

Again, just a smallest of changes but it makes a big difference.

***

It bugs me a lot that the long axis labels are printed in a slanted way, forcing every serious reader to read with slanted heads.

Slanting it the other way doesn't help:

Fig7-swayright

Vertical labels are best read...

OR-43-05-1413-g06-vertical

These vertical labels are best read while doing side planks.

Side-Plank

***

I'm surprised the horizontal alignment is rather rare. Here's one:

Fcell-09-651142-g004-horizontal

 


Excess delay

The hot topic in New York at the moment is congestion pricing for vehicles entering Manhattan, which is set to debut during the month of June. I found this chart (link) that purports to prove the effectiveness of London's similar scheme introduced a while back.

Transportxtra_2

This is a case of the visual fighting against the data. The visual feels very busy and yet the story lying beneath the data isn't that complex.

This chart was probably designed to accompany some text which isn't available free from that link so I haven't seen it. The reader's expectation is to compare the periods before and after the introduction of congestion charges. But even the task of figuring out the pre- and post-period is taking more time than necessary. In particular, "WEZ" is not defined. (I looked this up, it's "Western Extension Zone" so presumably they expanded the area in which charges were applied when the travel rates went back to pre-charging levels.)

The one element of the graphic that raises eyebrows is the legend which screams to be read.

Transportxtra_londoncongestioncharge_legend

Why are there four colors for two items? The legend is not self-sufficient. The reader has to look at the chart itself and realize that purple is the pre-charging period while green (and blue) is the post-charging period (ignoring the distinction between CCZ and WEZ).

While we are solving this puzzle, we also notice that the bottom two colors are used to represent an unchanging quantity - which is the definition of "no congestion". This no-congestion travel rate is a constant throughout the chart and yet a lot of ink of two colors have been spilled on it. The real story is in the excess delay, which the congestion charging scheme was supposed to reduce.

The excess on the chart isn't harmless. The excess delay on the roads has been transferred to the chart reader. It actually distracts from the story the analyst is wanting to tell. Presumably, the story is that the excess delays dropped quite a bit after congestion charging was introduced. About four years later, the travel rates had creeped back to pre-charging levels, whereupon the authorities responded by extending the charging zone to WEZ (which as of the time of the chart, wasn't apparently bringing the travel rate down.)

Instead of that story, the excess of the chart makes me wonder... the roads are still highly congested with travel rates far above the level required to achieve no congestion, even after the charging scheme was introduced.

***

I started removing some of the excess from the chart. Here's the first cut:

Junkcharts_redo_transportxtra_londoncongestioncharge

This is better but it is still very busy. One problem is the choice of columns, even though the data are found strictly on the top of each column. (Besides, when I chop off the unchanging sections of the columns, I created a start-not-from-zero problem.) Also, the labeling of the months leaves much to be desired, there are too many grid lines, etc.

***

Here is the version I landed on. Instead of columns, I use lines. When lines are used, there is no need for month labels since we can assume a reader knows the structure of months within a year.

Junkcharts_redo_transportxtra_londoncongestioncharge-2

A priniciple I hold dear is not to have legends unless it is absolutely required. In this case, there is no need to have a legend. I also brought back the notion of a uncongested travel speed, with a single line (and annotation).

***

The chart raises several questions about the underlying analysis. I'd interested in learning more about "moving car observer surveys". What are those? Are they reliable?

Further, for evidence of efficacy, I think the pre-charging period must be expanded to multiple years. Was 2002 a particularly bad year?

Thirdly, assuming WEZ indicates the expansion of the program to a new geographical area, I'm not sure whether the data prior to its introduction represents the travel rate that includes the WEZ (despite no charging) or excludes it. Arguments can be made for each case so the key from a dataviz perspective is to clarify what was actually done.

 

P.S. [6-6-24] On the day I posted this, NY State Governer decided to cancel the congestion pricing scheme that was set to start at the end of June.


Reading log: HBR's specialty bar charts

Today, I want to talk about a type of analysis that I used to ask students to do. I'm calling it a reading log analysis – it's a reading report that traces how one consumes a dataviz work from where your eyes first land to the moment of full comprehension (or abandonment, if that is the outcome). Usually, we do this orally during a live session, but it's difficult to arrive at a full report within the limited class time. A written report overcomes this problem. A stack of reading logs should be a gift to any chart designer.

My report below is very detailed, reflecting the amount of attention I pay to the craft. Most readers won't spend as much time consuming a graphic. The value of the report is not only in what it covers but also in what it does not mention.

***

The chart being analyzed showed up in a Harvard Business Review article (link), and it was submitted by longtime reader Howie H.

Hbr_specialbarcharts

First and foremost, I recognized the chart form as a bar chart. It's an advanced bar chart in which each bar has stacked sections and a vertical line in the middle. Now, I wanted to figure out how data enter the picture.

My eyes went to the top legend which tells me the author was comparing the proportion of respondents who said "business should take responsibility" to the proportion who rated "business is doing well". The difference in proportions is called the "performance gap". I glanced quickly at the first row label to discover the underlying survey addresses social issues such as environmental concerns.

Next, I looked at the first bar, trying to figure out its data encoding scheme. The bold, blue vertical line in the middle of the bar caused me to think each bar is split into left and right sections. The right section is shaded and labeled with the performance gap numbers so I focused on the segment to the left of the blue line.

My head started to hurt a little. The green number (76%) is associated with the left edge of the left section of the bar. And if the blue line represents the other number (29%), then the width of the left section should map to the performance gap. This interpretation was obviously incorrect since the right section already showed the gap, and the width of the left section was not equal to that of the right shaded section.

I jumped to the next row. My head hurt a little bit more. The only difference between the two rows is the green number being 74%, 2 percent smaller. I couldn't explain how the left sections of both bars have the same width, which confirms that the left section doesn't display the performance gap (assuming that no graphical mistakes have been made). It also appeared that the left edge of the bar was unrelated to the green number. So I retreated to square one. Let's start over. How were the data encoded in this bar chart?

I scrolled down to the next figure, which applies the same chart form to other data.

Hbr_specialbarcharts_2

I became even more confused. The first row showed labels (green number 60%, blue number 44%, performance gap -16%). This bar is much bigger than the one in the previous figure, even though 60% was less than 76%. Besides, the left section, which is bracketed by the green number on the left and the blue number on the right, appeared much wider than the 16% difference that would have been merited. I again lapsed into thinking that the left section represents performance gaps.

Then I noticed that the vertical blue lines were roughly in proportion. Soon, I realized that the total bar width (both sections) maps to the green number. Now back to the first figure. The proportion of respondents who believe business should take responsibility (green number) is encoded in the full bar. In other words, the left edges of all the bars represent 0%. Meanwhile the proportion saying business is doing well is encoded in the left section. Thus, the difference between the full width and the left-section width is both the right-section width and the performance gap.

Here is an edited version that clarifies the encoding scheme:

Hbr_specialbarcharts_2

***

That's my reading log. Howie gave me his take:

I had to interrupt my reading of the article for quite a while to puzzle this one out. It's sorted by performance gap, and I'm sure there's a better way to display that. Maybe a dot plot, similar to here - https://junkcharts.typepad.com/junk_charts/2023/12/the-efficiency-of-visual-communications.html.

A dot plot might look something like this:

Junkcharts_redo_hbr_specialcharts_2
Howie also said:

I interpret the authros' gist to be something like "Companies underperform public expectations on a wide range of social challenges" so I think I'd want to focus on the uniform direction and breadth of the performance gap more than the specifics of each line item.

And I agree.


Neither the forest nor the trees

On the NYT's twitter feed, they featured an article titled "These Seven Tech Stocks are Driving the Market". The first sentence of the article reads: "The S&P 500 is at an all-time high, and investors have just a handful of stocks to thank for it."

Without having seen any data, I'd surmise from that line that (a) the S&P 500 index has gone up recently, and (b) most if not all of the gain in the index can be attributed to gains in the tech stocks mentioned in the headline. (For purists, a handful is five, not seven.)

The chart accompanying the tweet is a treemap:

Nyt_magnificentseven

The treemap is possibly the most overhyped chart type of the modern era. Its use here is tangential to the story of surging market value. That's because the treemap presents a snapshot of the composition of the index, but contains nothing about the trend (change over time) of the average index value or of its components.

***

Even in representing composition, the treemap is inferior to, gasp, a pie chart. Of course, we can only use a pie chart for small numbers of components. The following illustration takes the data from the NYT chart on the Magnificent Seven tech stocks, and compares a treemap versus a pie chart side by side:

Junkcharts_redo_nyt_magnificent7

The reason why the treemap is worse is that both the width and the height of the boxes are changing while only the radius (or angle) of the pie slices is varying. (Not saying use a pie chart, just saying the treemap is worse.)

There is a reason why the designer appended data labels to each of the seven boxes. The effect of not having those labels is readily felt when our eyes reach the next set of stocks – which carry company names but not their market values. What is the market value of Berkshire Hathaway?

Even more so, what proportion of the total is the market value of Berkshire Hathaway? Indeed, if the designer did not write down 29%, it would take a bit of work to figure out the aggregate value of yellow boxes relative to the entire box!

This design sucessfully draws our attention to the structural importance of various components of the whole. There are three layers - the yellow boxes (Magnificent Seven), the gray boxes with company names, and the other gray boxes. I also like how they positioned the text on the right column.

***

Going inside the NYT article itself, we find two line charts that convey the story as told.

Here's the first one:

Nyt_magnificent7_linechart1

They are comparing the most recent stock prices with those from October 12 2022, which is identified as the previous "low". (I'm actually confused by how the most recent "low" is defined, but that's a different subject.)

This chart carries a lot of good information, even though it does not plot "all the data", as in each of the 500 S&P components individually. Over the period under analysis, the average index value has gone up about 35% while the Magnificent Seven's value have skyrocketed by 65% in aggregate. The latter accounted for 30% of the total value at the most recent time point.

If we set the S&P 500 index value in 2024 as 100, then the M7 value in 2024 is 30. After unwinding the 65% growth, the M7 value in October 2022 was 18; the S&P 500 in October 2022 was 74. Thus, the weight of M7 was 24% (18/74) in October 2022, compared to 30% now. Consequently, the weight of the other 473 stocks declined from 76% to 70%.

This isn't even the full story because most of the action within the M7 is in Nvidia, the stock most tightly associated with the current AI hype, as shown in the other line chart.

Nyt_magnificent7_linechart2

Nvidia's value jumped by 430% in that time window. From the treemap, the total current value of M7 is $12.3 b while Nvidia's value is $1.4 b, thus Nvidia is 11.4% of M7 currently. Since M7 is 29% of the total S&P 500, Nvidia is 11.4%*29% = 3% of the S&P. Thus, in 2024, against 100 for the S&P, Nvidia's share is 3. After unwinding the 430% growth, Nvidia's share in October 2022 was 0.6, about 0.8% of 74. Its weight tripled during this period of time.


Messing with expectations

A co-worker sent me to the following map, found in Forbes:

Forbes_gastaxmap

It shows the amount of state tax surcharge per gallon of gas in the U.S. And it's got one of the most common issues found in choropleth maps - the color scheme runs opposite to reader expectations.

Typically, if we see a red-green color scale, we would expect red to represent large numbers and green, small numbers. This map reverses the typical setup: California, the state with the heftiest gas tax, is shown green.

I know, I know - if we apply the typical color scheme, California would bleed red, and it's a blue state, damn it.

The solution is to avoid the red color. Just don't use red or blue.

Junkcharts_redo_forbes_gastaxmap_green

There is no need to use two colors either.

***

A few minor fixes. Given that all dollar amounts on the map are shown to two decimal places, the legend labels should also be shown to 2 decimal places, and with dollar signs.

Forbes_gastaxmap_legend

The subtitle should read "Dollars per gallon" instead of "Cents per gallon". Alternatively, keep "Cents per gallon" but convert all data labels into cents.

Some of the states are missing data labels.

***

I recast this as a small-multiples by categorizing states into four subgroups.

Junkcharts_redo_forbes_gastaxmap_split

With this change, one can almost justify using maps because there is sort of a spatial pattern.

 

 


An elaborate data vessel

Visualcapitalist_globaloilproductionI recently came across the following dataviz showing global oil production (link).

This is an ambitious graphic that addresses several questions of composition.

The raw data show the amount of production by country adding up to the global total. The countries are then grouped by region. Further, the graph presents an oil-and-gas specific grouping, as indicated by the legend shown just below the chart title. This grouping is indicated by the color of the circumference of the circle containing the flag of the country.

This chart form is popular in modern online graphics programs. It is like an elaborate data vessel. Because the countries are lined up around the barrel, a space has been created on three sides to admit labels and text annotations. This is a strength of this chart form.

***

The chart conveys little information about the underlying data. Each country is given a unique odd shaped polygon, making it impossible to compare sizes. It’s definitely possible to pick out U.S., Russia, Saudi Arabia as the top producers. But in presenting the ranks of the data, this chart form pales in comparison to a straightforward data table, or a bar chart. The less said about presenting values, the better.

Indeed, our self-sufficiency test exposes the inability of these polygons to convey the data. This is precisely why almost all values of the dataset are present on the chart.

***

The dataviz subtly presumes some knowledge on the part of the readers.

The regions are not directly labeled. The readers must know that Saudi Arabia is in the Middle East, U.S. is part of North America, etc. Admittedly this is not a big ask, but it is an ask.

It is also assumed that readers know their flags, especially those of smaller countries. Some of the small polygons have no space left for country names and they are labeled with just flags.

Visualcapitalist_globaloilproduction_nocountrylabels

In addition, knowing country acronyms is required for smaller countries as well. For example, in Africa, we find AGO, COG and GAB.

Visualcapitalist_globaloilproduction_countryacronyms

For this chart form the designer treats each country according to the space it has on the chart (except those countries that found themselves on the edges of the barrel). Font sizes, icons, labels, acronyms, data labels, etc. vary.

The readers are assumed to know the significance of OPEC and OPEC+. This grouping is given second fiddle, and can be found via the color of the circumference of the flag icons.

Visualcapitalist_globaloilproduction_opeclegend

I’d have not assigned a color to the non-OPEC countries, and just use the yellow and blue for OPEC and OPEC+. This is a little edit but makes the search for the edges more efficient.

Visualcapitalist_globaloilproduction_twoopeclabels

***

Let’s now return to the perception of composition.

In exactly the same manner as individual countries, the larger regions are represented by polygons that have arbitrary shapes. One can strain to compile the rank order of regions but it’s impossible to compare the relative values of production across regions. Perhaps this explains the presence of another chart at the bottom that addresses this regional comparison.

The situation is worse for the OPEC/OPEC+ grouping. Now, the readers must find all flag icons with edges of a specific color, then mentally piece together these arbitrarily shaped polygons, then realizing that they won’t fit together nicely, and so must now mentally morph the shapes in an area-preserving manner, in order to complete this puzzle.

This is why I said earlier this is an elaborate data vessel. It’s nice to look at but it doesn’t convey information about composition as readers might expect it to.

Visualcapitalist_globaloilproduction_excerpt


What is the question is the question

I picked up a Fortune magazine while traveling, and saw this bag of bubbles chart.

Fortune_global500 copy

This chart is visually appealing, that must be said. Each circle represents the reported revenues of a corporation that belongs to the “Global 500 Companies” list. It is labeled by the location of the company’s headquarters. The largest bubble shows Beijing, the capital of China, indicating that companies based in Beijing count $6 trillion dollars of revenues amongst them. The color of the bubbles show large geographical units; the red bubbles are cities in Greater China.

I appreciate a couple of the design decisions. The chart title and legend are placed on the top, making it easy to find one’s bearing – effective while non-intrusive. The labeling signals a layering: the first and biggest group have icons; the second biggest group has both name and value inside the bubbles; the third group has values inside the bubbles but names outside; the smallest group contains no labels.

Note the judgement call the designer made. For cities that readers might not be familiar with, a country name (typically abbreviated) is added. This is a tough call since mileage varies.

***

As I discussed before (link), the bag of bubbles does not elevate comprehension. Just try answering any of the following questions, which any of us may have, using just the bag of bubbles:

  • What proportion of the total revenues are found in Beijing?
  • What proportion of the total revenues are found in Greater China?
  • What are the top 5 cities in Greater China?
  • What are the ranks of the six regions?

If we apply the self-sufficiency test and remove all the value labels, it’s even harder to figure out what’s what.

***

_trifectacheckup_image

Moving to the D corner of the Trifecta Checkup, we aren’t sure how to interpret this dataset. It’s unclear if these companies derive most of their revenues locally, or internationally. A company headquartered in Washington D.C. may earn most of its revenues in other places. Even if Beijing-based companies serve mostly Chinese customers, only a minority of revenues would be directly drawn from Beijing. Some U.S. corporations may choose its headquarters based on tax considerations. It’s a bit misleading to assign all revenues to one city.

As we explore this further, it becomes clear that the designer must establish a target – a strong idea of what question s/he wants to address. The Fortune piece comes with a paragraph. It appears that an important story is the spatial dispersion of corporate revenues in different countries. They point out that U.S. corporate HQs are more distributed geographically than Chinese corporate HQs, which tend to be found in the key cities.

There is a disconnect between the Question and the Data used to create the visualization. There is also a disconnect between the Question and the Visual display.


When words speak louder than pictures

I've been staring at this chart from the Wall Street Journal (link) about U.S. workers working remotely:

Wsj_remotework_byyear

It's one of those offerings I think on which the designer spent a lot of effort, but ultimately didn't realize that the reader would spend equal if not more effort deciphering.

However, the following paragraph lifted straight from the article says exactly what needs to be said:

Workers overall spent an average of 5 hours and 25 minutes a day working from home in 2022. That is about two hours more than in 2019, the year before Covid-19 sent millions of workers scrambling to set up home oces, and down just 12 minutes from 2021, according to the Labor Department’s American Time Use Survey.

***

Why is the chart so hard to read?

_trifectacheckup_imageIt's mostly because the visual is fighting the message. In the Trifecta Checkup (link), this is represented by a disconnect between the Q(uestion) and the V(isual) corners - note the green arrow between these two corners.

The message concentrates on two comparisons: first, the increase in amount of remote work after the pandemic; and second, the mild decrease in 2022 relative to 2021.

On the chart, the elements that grab my attention are (a) the green and orange columns (b) the shading in the bottom part of those green and orange columns (c) the thick black line that runs across the chart (d) the indication on the left side that tells me one unit is an hour.

None of those visual elements directly addresses the comparisons. The first comparison - before and after the pandemic - is found by how much the green column spikes above the thick black line. Our comprehension is retarded by the decision to forego the typical axis labels in favor of chopping columns into one-hour blocks.

The second comparison - between 2022 and 2021 - is found in the white space above the top of the orange column.

So, in reality, the text labels that say exactly what needs to be said are carrying a lot of weight. A slight edit to the pointers helps connect those descriptions to the visual depiction, like this:

Redo_junkcharts_wsj_remotework

I've essentially flipped the tactics used in the various pointers. For the average level of remote work pre-pandemic, I dispense of any pointers while I'm using double-headed arrows to indicate differences across time.

Nevertheless, this modified chart is still too complex.

***

Here is a version that aligns the visual to the message:

Redo_junkcharts_wsj_remotework_2

It's a bit awkward because the 2 hour 48 minutes calculation is the 2021 number minus the average of 2015-19, skipping the 2020 year.

 


Parsons Student Projects

I had the pleasure of attending the final presentations of this year's graduates from Parsons's MS in Data Visualization program. You can see the projects here.

***

A few of the projects caught my eye.

A project called "Authentic Food in NYC" explores where to find "authentic" cuisine in New York restaurants. The project is notable for plowing through millions of Yelp reviews, and organizing the information within. Reviews mentioning "authentic" or "original" were extracted.

During the live presentation, the student clicked on Authentic Chinese, and the name that popped up was Nom Wah Tea Parlor, which serves dim sum in Chinatown that often has lines out the door.

Shuyaoxiao_authenticfood_parsons

Curiously, the ranking is created from raw counts of authentic reviews, which favors restaurants with more reviews, such as restaurants that have been operating for a longer time. It's unclear what rule is used to transfer authenticity from reviews to restaurants: does a single review mentioning "authentic" qualify a restaurant as "authentic", or some proportion of reviews?

Later, we see a visualization of the key words found inside "authentic" reviews for each cuisine. Below are words for Chinese and Italian cuisines:

Shuyaoxiao_authenticcuisines_parsons_words

These are word clouds with a twist. Instead of encoding the word counts in the font sizes, she places each word inside a bubble, and uses bubble sizes to indicate relative frequency.

Curiously, almost all the words displayed come from menu items. There isn't any subjective words to be found. Algorithms that extract keywords frequently fail in the sense that they surface the most obvious, uninteresting facts. Take the word cloud for Taiwanese restaurants as an example:

Shuyaoxiao_authenticcuisines_parsons_taiwan

The overwhelming keyword found among reviews of Taiwanese restaurants is... "taiwanese". The next most important word is "taiwan". Among the remaining words, "886" is the name of a specific restaurant, "bento" is usually associated with Japanese cuisine, and everything else is a menu item.

Getting this right is time-consuming, and understandably not a requirement for a typical data visualization course.

The most interesting insight is found in this data table.

Shuyaoxiao_authenticcuisines_ratios

It appears that few reviewers care about authenticity when they go to French, Italian, and Japanese restaurants but the people who dine at various Asian restaurants, German restaurants, and Eastern European restaurants want "authentic" food. The student concludes: "since most Yelp reviewers are Americans, their pursuit of authenticity creates its own trap: Food authenticity becomes an americanized view of what non-American food is."

This hits home hard because I know what authentic dim sum is, and Nom Wah Tea Parlor it ain't. Let me check out what Yelpers are saying about Nom Wah:

  1. Everything was so authentic and delicious - and cheap!!!
  2. Your best bet is to go around the corner and find something more authentic.
  3. Their dumplings are amazing everything is very authentic and tasty!
  4. The food was delicious and so authentic, and the staff were helpful and efficient.
  5. Overall, this place has good authentic dim sum but it could be better.
  6. Not an authentic experience at all.
  7. this dim sum establishment is totally authentic
  8. The onions, bean sprouts and scallion did taste very authentic and appreciated that.
  9. I would skip this and try another spot less hyped and more authentic.
  10. I would have to take my parents here the next time I visit NYC because this is authentic dim sum.

These are the most recent ten reviews containing the word "authentic". Seven out of ten really do mean authentic, the other three are false friends. Text mining is tough business! The student removed "not authentic" which helps. As seen from above, "more authentic" may be negative, and there may be words between "not" and "authentic". Also, think "not inauthentic", "people say it's authentic, and it's not", etc.

One thing I learned from this project is that "authentic" may be a synonym for "I like it" when these diners enjoy the food at an ethnic restaurant. I'm most curious about what inauthentic onions, bean sprouts and scallion taste like.

I love the concept and execution of this project. Nice job!

***

Another project I like is about tourism in Venezuela. The back story is significant. Since a dictatorship took over the country, the government stopped reporting tourism statistics. It's known that tourism collapsed, and that it may be gradually coming back in recent years.

This student does not have access to ready-made datasets. But she imaginatively found data to pursue this story. Specifically, she mentioned grabbing flight schedules into the country from the outside.

The flow chart is a great way to explore this data:

Ibonnet_parsons_dataviz_flightcities

A map gives a different perspective:

Ibonnet_parsons_dataviz_flightmap

I'm glad to hear the student recite some of the limitations of the data. It's easy to look at these visuals and assume that the data are entirely reliable. They aren't. We don't know that what proportion of the people traveling on those flights are tourists, how full those planes are, or the nationalities of those on board. The fact that a flight originated from Panama does not mean that everyone on board is Panamanian.

***

The third project is interesting in its uniqueness. This student wants to highlight the effect of lead in paint on children's health. She used the weight of lead marbles to symbolize the impact of lead paint. She made a dress with two big pockets to hold these marbles.

Scherer_parsons_dataviz_leaddress sm

It's not your standard visualization. One can quibble that dividing the marbles into two pockets doesn't serve a visualziation purpose, and so on. But at the end, it's a memorable performance.


Deconstructing graphics as an analysis tool in dataviz

One of the useful exercises I like to do with charts is to "deconstruct" them. (This amounts to a deeper version of the self-sufficiency test.)

Here is a chart stripped down to just the main visual elements.

Junkcharts_cbcrevenues_deconstructed1

The game is to guess what is the structure of the data given these visual elements.

I guessed the following:

  • The data has a top-level split into two groups
  • Within each group, the data is further split into 3 parts, corresponding to the 3 columns
  • With each part, there are a variable number of subparts, each of which is given a unique color
  • The color legend suggests that each group's data are split into 7 subparts, so I'm guessing that the 7 subparts are aggregated into 3 parts
  • The core chart form is a stacked column chart with absolute values so relative proportions within each column (part) is important
  • Comparing across columns is not supported because each column has its own total value
  • Comparing same-color blocks across the two groups is meaningful. It's easier to compare their absolute values but harder to compare the relative values (proportions of total)

If I knew that the two groups are time periods, I'd also guess that the group on the left is the earlier time period, and the one on the right is the later time period. In addition to the usual left-to-right convention for time series, the columns are getting taller going left to right. Many things (not all, obviously) grow over time.

The color choice is a bit confusing because if the subparts are what I think they are, then it makes more sense to use one color and different shades within each column.

***

The above guesses are a mixed bag. What one learns from the exercise is what cues readers are receiving from the visual structure.

Here is the same chart with key contextual information added back:

Junkcharts_cbcrevenues_deconstructed2

Now I see that the chart concerns revenues of a business over two years.

My guess on the direction of time was wrong. The more recent year is placed on the left, counter to convention. This entity therefore suffered a loss of revenues from 2017-8 to 2018-9.

The entity receives substantial government funding. In 2017-8, it has 1 dollar of government funds for every 2 dollars of revenues. In 2018-9, it's roughly 2 dollars of government funds per every 3 dollars of revenues. Thus, the ratio of government funding to revenues has increased.

On closer inspection, the 7 colors do not represent 7 components of this entity's funding. The categories listed in the color legend overlap.

It's rather confusing but I missed one very important feature of the chart in my first assessment: the three columns within each year group are nested. The second column breaks down revenues into 3 parts while the third column subdivides advertising revenues into two parts.

What we've found is that this design does not offer any visual cues to help readers understand how the three columns within a year-group relates to each other. Adding guiding lines or changing the color scheme helps.

***

Next, I add back the data labels:

Cbc_revenues_original

The system of labeling can be described as: label everything that is not further broken down into parts on the chart.

Because of the nested structure, this means two of the column segments, which are the sums of subparts, are not labeled. This creates a very strange appearance: usually, the largest parts are split into subparts, so such a labeling system means the largest parts/subparts are not labeled while the smaller, less influential, subparts are labeled!

You may notice another oddity. The pink segment is well above $1 billion but it is roughly the size of the third column, which represents $250 million. Thus, these columns are not drawn to scale. What happened? Keep reading.

***

Here is the whole chart:

Cbc_revenues_original

A twitter follower sent me this chart. Elon Musk has been feuding with the Canadian broadcaster CBC.

Notice the scale of the vertical axis. It has a discontinuity between $700 million and $1.7 billion. In other words, the two pink sections are artificially shortened. The erased section contains $1 billion (!) Notice that the erased section is larger than the visible section.

The focus of Musk's feud with CBC is on what proportion of the company's funds come from the government. On this chart, the only way to figure that out is to copy out the data and divide. It's roughly 1.2/1.7 = 70% approx.

***

The exercise of deconstructing graphics helps us understand what parts are doing what, and it also reveals what cues certain parts send to readers.

In better dataviz, every part of the chart is doing something useful, it's free of redundant parts that take up processing time for no reason, and the cues to readers move them towards the intended message, not away from it.

***

A couple of additional comments:

I'm not sure why old data was cited because in the most recent accounting report, the proportion of government funding was around 65%.

Source of funding is not a useful measure of pro- or anti-government bias, especially in a democracy where different parties lead the government at different times. There are plenty of mouthpiece media that do not apparently receive government funding.