An elaborate data vessel

Visualcapitalist_globaloilproductionI recently came across the following dataviz showing global oil production (link).

This is an ambitious graphic that addresses several questions of composition.

The raw data show the amount of production by country adding up to the global total. The countries are then grouped by region. Further, the graph presents an oil-and-gas specific grouping, as indicated by the legend shown just below the chart title. This grouping is indicated by the color of the circumference of the circle containing the flag of the country.

This chart form is popular in modern online graphics programs. It is like an elaborate data vessel. Because the countries are lined up around the barrel, a space has been created on three sides to admit labels and text annotations. This is a strength of this chart form.

***

The chart conveys little information about the underlying data. Each country is given a unique odd shaped polygon, making it impossible to compare sizes. It’s definitely possible to pick out U.S., Russia, Saudi Arabia as the top producers. But in presenting the ranks of the data, this chart form pales in comparison to a straightforward data table, or a bar chart. The less said about presenting values, the better.

Indeed, our self-sufficiency test exposes the inability of these polygons to convey the data. This is precisely why almost all values of the dataset are present on the chart.

***

The dataviz subtly presumes some knowledge on the part of the readers.

The regions are not directly labeled. The readers must know that Saudi Arabia is in the Middle East, U.S. is part of North America, etc. Admittedly this is not a big ask, but it is an ask.

It is also assumed that readers know their flags, especially those of smaller countries. Some of the small polygons have no space left for country names and they are labeled with just flags.

Visualcapitalist_globaloilproduction_nocountrylabels

In addition, knowing country acronyms is required for smaller countries as well. For example, in Africa, we find AGO, COG and GAB.

Visualcapitalist_globaloilproduction_countryacronyms

For this chart form the designer treats each country according to the space it has on the chart (except those countries that found themselves on the edges of the barrel). Font sizes, icons, labels, acronyms, data labels, etc. vary.

The readers are assumed to know the significance of OPEC and OPEC+. This grouping is given second fiddle, and can be found via the color of the circumference of the flag icons.

Visualcapitalist_globaloilproduction_opeclegend

I’d have not assigned a color to the non-OPEC countries, and just use the yellow and blue for OPEC and OPEC+. This is a little edit but makes the search for the edges more efficient.

Visualcapitalist_globaloilproduction_twoopeclabels

***

Let’s now return to the perception of composition.

In exactly the same manner as individual countries, the larger regions are represented by polygons that have arbitrary shapes. One can strain to compile the rank order of regions but it’s impossible to compare the relative values of production across regions. Perhaps this explains the presence of another chart at the bottom that addresses this regional comparison.

The situation is worse for the OPEC/OPEC+ grouping. Now, the readers must find all flag icons with edges of a specific color, then mentally piece together these arbitrarily shaped polygons, then realizing that they won’t fit together nicely, and so must now mentally morph the shapes in an area-preserving manner, in order to complete this puzzle.

This is why I said earlier this is an elaborate data vessel. It’s nice to look at but it doesn’t convey information about composition as readers might expect it to.

Visualcapitalist_globaloilproduction_excerpt


Dataviz in camouflage

This subway timetable in Tokyo caught my eye:

Tokyosubway_timetable_red

It lists the departure times of all trains going toward Shibuya on Saturdays and holidays.

It's a "stem and leaf" plot.

The stem-and-leaf plot is a crude histogram. In this version, the stem is the hour of the day (24-hour clock) and the leaf is the minute (between 0 and 59). The longer the leaf, the higher the frequency of trains.

We can see that there isn't one peak but rather a plateau between hours 9 and 18.

***

Contrast this with the weekday schedule in blue:

Tokyosubway_timetable_blue

We can clearly see two rush hours, one peak at hour 8 and a second one at hours 17-18.

Love seeing dataviz in camouflage!

 


What is the question is the question

I picked up a Fortune magazine while traveling, and saw this bag of bubbles chart.

Fortune_global500 copy

This chart is visually appealing, that must be said. Each circle represents the reported revenues of a corporation that belongs to the “Global 500 Companies” list. It is labeled by the location of the company’s headquarters. The largest bubble shows Beijing, the capital of China, indicating that companies based in Beijing count $6 trillion dollars of revenues amongst them. The color of the bubbles show large geographical units; the red bubbles are cities in Greater China.

I appreciate a couple of the design decisions. The chart title and legend are placed on the top, making it easy to find one’s bearing – effective while non-intrusive. The labeling signals a layering: the first and biggest group have icons; the second biggest group has both name and value inside the bubbles; the third group has values inside the bubbles but names outside; the smallest group contains no labels.

Note the judgement call the designer made. For cities that readers might not be familiar with, a country name (typically abbreviated) is added. This is a tough call since mileage varies.

***

As I discussed before (link), the bag of bubbles does not elevate comprehension. Just try answering any of the following questions, which any of us may have, using just the bag of bubbles:

  • What proportion of the total revenues are found in Beijing?
  • What proportion of the total revenues are found in Greater China?
  • What are the top 5 cities in Greater China?
  • What are the ranks of the six regions?

If we apply the self-sufficiency test and remove all the value labels, it’s even harder to figure out what’s what.

***

_trifectacheckup_image

Moving to the D corner of the Trifecta Checkup, we aren’t sure how to interpret this dataset. It’s unclear if these companies derive most of their revenues locally, or internationally. A company headquartered in Washington D.C. may earn most of its revenues in other places. Even if Beijing-based companies serve mostly Chinese customers, only a minority of revenues would be directly drawn from Beijing. Some U.S. corporations may choose its headquarters based on tax considerations. It’s a bit misleading to assign all revenues to one city.

As we explore this further, it becomes clear that the designer must establish a target – a strong idea of what question s/he wants to address. The Fortune piece comes with a paragraph. It appears that an important story is the spatial dispersion of corporate revenues in different countries. They point out that U.S. corporate HQs are more distributed geographically than Chinese corporate HQs, which tend to be found in the key cities.

There is a disconnect between the Question and the Data used to create the visualization. There is also a disconnect between the Question and the Visual display.


Losing the plot while stacking up the bars

I came across this chart from an infographics that claims to show which zip codes in the U.S. are the "dirtiest" (link). I won't go into the data analysis in this post - it's the usual "open data" style analysis that takes whatever data they could find (in this case, 311 calls) and make some hay out of it.

03_Dirtiest-Zip-Codes-in-New-York

It's amazing how such analyses frequently land on the Top N, Bottom N table. Top/Bottom N is euphemistically called "insights". But "insights" should answer at least one of these following questions: Where are these zip codes? What's the reason why 11216 has the highest rate of complaints while 11040 has the lowest? What measures can be taken to make the city cleaner?

***

The basic form chosen for this graphic is the bar chart. The data concerns the number of complaints per 100,000 people (about sanitation - they didn't disclose how they classified a complaint as about sanitation).

To mitigate the "boredom" of bar charts, the designer made the edges of the bars swiggly, and added icons of items found in trash inside the bars. These are thankfully not too intrusive.

Why are all the data printed on the chart? Try mentally wiping the data labels, and you'll understand why the designer did it.

If readers look at data labels rather than the bars, then the data visualization surely has failed. I'd prefer to use an axis

If you spend a few more minutes on the chart, you may notice the gray parts. This is not the simple bar chart but a stacked bar chart. In effect, every bar is referenced to the first bar, which shows the maximum number of complaints per 100K people. For example, zip code 10474 has about 90% of the complaints experienced in zip code 11216, the "dirtiest" place in New York.

***

The infographic then moves on to Los Angeles, and repeats the Top N/Bottom N presentation:

04_Dirtiest-Zip-Codes-in-Los-Angeles

With this, the plot is lost.

For an inexplicable reason, the dirtiest zip code in LA does not occupy the entire length of the bar. The worst zip code here fills out 87% of the bar length, implying that the entire bar represents the value of 34,978 complaints per 100K people. How did the designer decide on this number?

As a result, every other value is referenced to 34,978 and not to the rate of complaints in the dirtiest zip code!

***

The infographic eventually covers Houston. Here are the dirtiest two zip codes in Houston:

Housefresh_houston_dirtiest2

How does one interpret the orange section of the second bar? The original intention is for us to see that this zip code is about 80% as dirty as the dirtiest zip code. However, the full length of the bar does not here represent the dirtiest zip code.

***

We also got a hint as to why this entire analysis is problematic. The values in LA are way bigger than those in NY, about 4 times higher at the top of the table. Is LA really that much dirtier than NY? Or perhaps the data have not been properly aligned between cities?

 

P.S. [8-26-2023] Added link to the infographic.

 


Partition of Europe

A long-time reader sent me the following map via twitter:

Europeelects_map

This map tells how the major political groups divide up the European Parliament. I’ll spare you the counting. There are 27 countries, and nine political groups (including the "unaffiliated").

The key chart type is a box of dots. Each country gets its own box. Each box has its own width. What determines the width? If you ask me, it’s the relative span of the countries on the map. For example, the narrow countries like Ireland and Portugal have three dots across while the wider countries like Spain, Germany and Italy have 7, 10 and 8 dots across respectively.

Each dot represents one seat in the Parliament. Each dot has one of 9 possible colors. Each color shows a political lean e.g. the green dots represent Green parties while the maroon dots display “Left” parties.

The end result is a counting game. If we are interested in counts of seats, we have to literally count each dot. If we are interested in proportion of seats, take your poison: either eyeball it or count each color and count the total.

Who does the underlying map serve? Only readers who know the map of Europe. If you don’t know where Hungary or Latvia is, good luck. The physical constraints of the map work against the small-multiples set up of the data. In a small multiples, you want each chart to be identical, except for the country-specific data. The small-multiples structure requires a panel of equal-sized cells. The map does not offer this feature, as many small countries are cramped into Eastern Europe. Also, Europe has a few tiny states e.g. Luxembourg (population 660K)  and Malta (population 520K). To overcome the map, the designer produces boxes of different sizes, substantially loading up the cognitive burden on readers.

The map also dictates where the boxes are situated. The centroids of each country form the scaffolding, with adjustments required when the charts overlap. This restriction ensures a disorderly appearance. By contrast, the regular panel layout of a small multiples facilitates comparisons.

***

Here is something I sketched using a tile map.

Eu parties print sm

First, I have to create a tile map of European countries. Some parts, e.g. western part, are straightforward. The eastern side becomes very congested.

The tile map encodes location in an imprecise sense. Think about the scaffolding of centroids of countries referred to prior. The tile map imposes an order to the madness - we're shifting these centroids so that they line up in a tidier pattern. What we gain in comparability we concede in location precision.

For the EU tile map, I decided to show the Baltic countries in a row rather than a column; the latter would have been more faithful to the true geography. Malta is shown next to Italy even though it could have been placed below. Similarly, Cyprus in relation to Greece. I also included several key countries that are not part of the EU for context.

Instead of raw seat counts, I'm showing the proportion of seats within each country claimed by each political group. I think this metric is more useful to readers.

The legend is itself a chart that shows the aggregate statistics for all 27 countries.


Graphics that stretch stomachs and make merry

Washington Post has a fun article about the Hot Dog Eating Contest in Coney Island here.

This graphic shows various interesting insights about the annual competition:

Washingtonpost_hotdogeating_scatter

Joey Chestnut is the recent king of hot-dog eating. Since the late 2000s, he's dominated the competition. He typically chows down over 60 hot dogs in 10 minutes. This is shown by the yellow line. Even at that high level, Chestnut has shown steady growth over time.

The legend tells us that the chart shows the results of all the other competitors. It's pretty clear that few have been able to even get close to Chestnut all these years. Most contestants were able to swallow 30 hot dogs or fewer.

It doesn't appear that the general standard has increased over time.

In 2011, a separate competition for women started. There is also a female champion (Miki Sudo) who has won almost every competition since she started playing.

One strange feature is the lack of competition in the early years. The footnote informs us that the trend is not real - they simply did not keep records of other competitors in early contests.

The only question I can't answer from this chart is the general standard and number of female competitors. The chart designer chooses not to differentiate between male and female contestants, other than the champions. I can understand that. Adding another dimension to the chart is a double-edged sword.

***

There is even more fun. There is a little video illustrating theories about what kind of human bodies can take in that many hot dogs in a short time. Here is a screen shot of it:

Washingtonpost_hotdogeating_body

 

 


One bubble is a tragedy, and a bag of bubbles is...

From Kathleen Tyson's twitter account, I came across a graphic showing the destinations of Ukraine's grain exports since 2022 under the auspices of a UN deal. This graphic, made by AFP, uses one of the chart forms that baffle me - the bag of bubbles.

Ukraine_grains_bubbles

The first trouble with a bag of bubbles is the single bubble. The human brain is just not fit for comparing bubble sizes. The self-sufficiency test is my favorite device for demonstrating this weakness. The following is the European section of the above chart, with the data labels removed.

Redo_junkcharts_afp_ukrainegrains_europe_1

How much bigger is Spain than the Netherlands? What's the difference between Italy and the Netherlands? The answers don't come easily to mind. (The Netherlands is about 40% the size of Spain, and Italy is about 20% larger than the Netherlands.)

While comparing relative circular areas is a struggle, figuring out the relative ranks is not. Sure, it gets tougher with small differences (Germany vs S. Korea, Belgium vs Portugal) but saying those pairs are tied isn't a tragedy.

***

Another issue with bubble charts is how difficult it is to assess absolute values. A circle on its own has no reference point. The designer needs to add data labels or a legend. Adding data labels is an act of giving up. The data labels become the primary instrument for communicating the data, not the visual construct. Adding one data label is not enough, as the following shows:

Redo_junkcharts_afpukrainegrains_2

Being told that Spain's value is 4.1 does little to help estimate the values for the non-labelled bubbles.

The chart does come with the following legend:

Afp_ukrianegrains_legend

For this legend to work, the sample bubble sizes should span the range of the data. Notice that it's difficult to extrapolate from the size of the 1-million-ton bubble to 2-million, 4-million, etc. The analogy is a column chart in which the vertical axis does not extend through the full range of the dataset.

The designer totally gets this. The chart therefore contains both selected data labels and the partial legend. Every bubble larger than 1 million tons has an explicit data label. That's one solution for the above problem.

Nevertheless, why not use another chart form that avoids these problems altogether?

***

In Tyson's tweet, she showed another chart that pretty much contains the same information, this one from TASS.

Ukraine_grains_flows

This chart uses the flow diagram concept - in an abstract way, as I explained in previous post.

This chart form imposes structure on the data. The relative ranks of the countries within each region are listed from top to bottom. The relative amounts of grains are shown in black columns (and also in the thickness of the flows).

The aggregate value of movements within each region is called out in that middle section. It is impossible to learn this from the bag of bubbles version.

The designer did print the entire dataset onto this chart (except for the smallest countries grouped together as "other"). This decision takes away from the power of the underlying flow chart. Instead of thinking about the proportional representation of each country within its respective region, or the distribution of grains among regions, our eyes hone in on the data labels.

This brings me back to the principle of self-sufficiency: if we expect readers to consume the data labels - which comprise the entire dataset, why not just print a data table? If we decide to visualize, make the visual elements count!


When words speak louder than pictures

I've been staring at this chart from the Wall Street Journal (link) about U.S. workers working remotely:

Wsj_remotework_byyear

It's one of those offerings I think on which the designer spent a lot of effort, but ultimately didn't realize that the reader would spend equal if not more effort deciphering.

However, the following paragraph lifted straight from the article says exactly what needs to be said:

Workers overall spent an average of 5 hours and 25 minutes a day working from home in 2022. That is about two hours more than in 2019, the year before Covid-19 sent millions of workers scrambling to set up home oces, and down just 12 minutes from 2021, according to the Labor Department’s American Time Use Survey.

***

Why is the chart so hard to read?

_trifectacheckup_imageIt's mostly because the visual is fighting the message. In the Trifecta Checkup (link), this is represented by a disconnect between the Q(uestion) and the V(isual) corners - note the green arrow between these two corners.

The message concentrates on two comparisons: first, the increase in amount of remote work after the pandemic; and second, the mild decrease in 2022 relative to 2021.

On the chart, the elements that grab my attention are (a) the green and orange columns (b) the shading in the bottom part of those green and orange columns (c) the thick black line that runs across the chart (d) the indication on the left side that tells me one unit is an hour.

None of those visual elements directly addresses the comparisons. The first comparison - before and after the pandemic - is found by how much the green column spikes above the thick black line. Our comprehension is retarded by the decision to forego the typical axis labels in favor of chopping columns into one-hour blocks.

The second comparison - between 2022 and 2021 - is found in the white space above the top of the orange column.

So, in reality, the text labels that say exactly what needs to be said are carrying a lot of weight. A slight edit to the pointers helps connect those descriptions to the visual depiction, like this:

Redo_junkcharts_wsj_remotework

I've essentially flipped the tactics used in the various pointers. For the average level of remote work pre-pandemic, I dispense of any pointers while I'm using double-headed arrows to indicate differences across time.

Nevertheless, this modified chart is still too complex.

***

Here is a version that aligns the visual to the message:

Redo_junkcharts_wsj_remotework_2

It's a bit awkward because the 2 hour 48 minutes calculation is the 2021 number minus the average of 2015-19, skipping the 2020 year.

 


Redundancy is great

I have been watching some tennis recently, and noticed that some venues (or broadcasters) have adopted a more streamlined way of showing tiebreak results.

Tennis_tiebreak

(This is an old example I found online. Can't seem to find more recent ones. Will take a screenshot next time I see this on my TV.)

For those not familiar with tennis scoring, the match is best-of-three sets (for Grand Slam men's tournaments, it's best-of-five sets); each set is first to six games, but if the scoreline reaches 5-5, a player must win two consecutive games to win the set at 7-5, or else, the scoreline reaches 6-6, and a tiebreak is played. The tiebreak is first to seven points, or if 6-6 is reached, it's first player to get two points clear. Thus, the possible tiebreak scores are 7-0, 7-1, ..., 7-5, 8-6, 9-7, etc.

A tiebreak score is usually represented in two parts, e.g., 7-6 (7-2).

At some point, some smart person discovered that the score 7-2 contains redundant information. In fact, it is sufficient to show just the score of the losing side in a tiebreak - because the winner's points can be inferred from it.

The rule can be stated as: if the displayed number is 5 or below, then the winner of the tiebreak scored exactly 7 points; and if the displayed number is 6 or above, then the winner scored two points more than that number.

For example, in the attached image, Djokovic won a tiebreak 7-6 (2) which means 7-6 (7-2) while Del Potro won a tiebreak 7-6 (6) which means 7-6 (8-6).

***

While this discovery satisfies my mathematical side - we always like to find the most concise way to do a proof or computation - it is bad for data communications!

It's just bad practice to make readers do calculations in their heads when the information can be displayed visually.

I found where I saw this single-digit display. It's on the official ATP Tour website.

Atptour score display

***

Just for fun, if we applied the same principle to the display of the entire scoreline, we would arrive at something even more succinct :)

4-6, 7-6(6), 6-4 can simply be written as 4-, -6(6), -4

6-3, 7-6(4), 6-3 is -3, -6(4), -3

6-1, 6-4 is -1, -4

7-5, 4-6, 6-1 is -5, 4-, -1

The shortened display contains the minimal information needed to recover the long-form scoreline. But it fails at communications.

In this case, redundancy is great.

 


Tile maps on a trip

My friend Ray sent me to a recent blog about tile maps. Typical tile maps use squares or hexagons, although in theory many other shapes will do. Unsurprisingly, the field follows the latest development of math researchers who study the space packing problem. The space packing problem concerns how to pack a space with objects. The study of tesselations is to pack space with one or a few shapes.

It was an open question until recently whether there exists an "aperiodic monotile," that is to say, a single shape that can cover space in a non-repeating manner. We all know that we can use squares to cover a space, which creates the familiar grid of squares, but in that case, a pattern repeats itself all over the space.

Now, some researchers have found an elusive aperiodic monotile, which they dubbed the Einstein monotile. Below is a tesselation using these tiles:

Einsteintiles

Within this design, one cannot find a set of contiguous tiles that repeats itself.

The blogger then made a tile map using this new tesselation. Here's one:

Gravitywitheinsteintiles

It doesn't matter what this is illustrating. The blog author cites a coworker, who said: "I can think of no proper cartographic use for Penrose binning, but it’s fun to look at, and so that’s good enough for me." Penrose tiles is another mathematical invention that can be used in a tesselation. The story is still the same: there is no benefit from using these strange-looking shapes. Other than the curiosity factor.

***

Let's review the pros and cons of using tile maps.

Compare a typical choropleth map of the United States (by state) and a tile map by state. The former has the well-known problem that states with the largest areas usually have the lowest population densities, and thus, if we plot demographic data on such maps, the states that catch the most attention are the ones that don't weigh as much - by contrast, the densely populated states in New England barely show up.

The tile map removes this area bias, thus resolving this problem. Every state is represented by equal area.

While the tesselated design is frequently better, it's not always. In many data visualization, we do intend to convey the message that not all states are equal!

The grid arrangement of the state tiles also makes it easier to find regional patterns. A regional pattern is defined here as a set of neighboring states that share similar data (encoded in the color of the tiles). Note that the area of each state is of zero interest here, and thus, the accurate descriptions of relative areas found on the usual map is a distractor.

However, on the tile map, these regional patterns are conceptual. One must not read anything into the shape of the aggregated region, or its boundaries. Indeed, if we use strange-looking shapes like Einstein tiles, the boundaries are completely meaningless, and even misleading.

There also usually is some distortion of the spatial coordinates on a tile map because we'd like to pack the squares or hexagons into a lattice-like structure.

Lastly, the tile map is not scalable. We haven't seen a tile map of the U.S. by county or precinct but we have enjoyed many choropleth maps displaying county- or precinct-level data, e.g. the famous Purple Map of America. There is a reason for this.

***

Here is an old post that contains links to various other posts I've written about tile maps.