Reading log: HBR's specialty bar charts

Today, I want to talk about a type of analysis that I used to ask students to do. I'm calling it a reading log analysis – it's a reading report that traces how one consumes a dataviz work from where your eyes first land to the moment of full comprehension (or abandonment, if that is the outcome). Usually, we do this orally during a live session, but it's difficult to arrive at a full report within the limited class time. A written report overcomes this problem. A stack of reading logs should be a gift to any chart designer.

My report below is very detailed, reflecting the amount of attention I pay to the craft. Most readers won't spend as much time consuming a graphic. The value of the report is not only in what it covers but also in what it does not mention.

***

The chart being analyzed showed up in a Harvard Business Review article (link), and it was submitted by longtime reader Howie H.

Hbr_specialbarcharts

First and foremost, I recognized the chart form as a bar chart. It's an advanced bar chart in which each bar has stacked sections and a vertical line in the middle. Now, I wanted to figure out how data enter the picture.

My eyes went to the top legend which tells me the author was comparing the proportion of respondents who said "business should take responsibility" to the proportion who rated "business is doing well". The difference in proportions is called the "performance gap". I glanced quickly at the first row label to discover the underlying survey addresses social issues such as environmental concerns.

Next, I looked at the first bar, trying to figure out its data encoding scheme. The bold, blue vertical line in the middle of the bar caused me to think each bar is split into left and right sections. The right section is shaded and labeled with the performance gap numbers so I focused on the segment to the left of the blue line.

My head started to hurt a little. The green number (76%) is associated with the left edge of the left section of the bar. And if the blue line represents the other number (29%), then the width of the left section should map to the performance gap. This interpretation was obviously incorrect since the right section already showed the gap, and the width of the left section was not equal to that of the right shaded section.

I jumped to the next row. My head hurt a little bit more. The only difference between the two rows is the green number being 74%, 2 percent smaller. I couldn't explain how the left sections of both bars have the same width, which confirms that the left section doesn't display the performance gap (assuming that no graphical mistakes have been made). It also appeared that the left edge of the bar was unrelated to the green number. So I retreated to square one. Let's start over. How were the data encoded in this bar chart?

I scrolled down to the next figure, which applies the same chart form to other data.

Hbr_specialbarcharts_2

I became even more confused. The first row showed labels (green number 60%, blue number 44%, performance gap -16%). This bar is much bigger than the one in the previous figure, even though 60% was less than 76%. Besides, the left section, which is bracketed by the green number on the left and the blue number on the right, appeared much wider than the 16% difference that would have been merited. I again lapsed into thinking that the left section represents performance gaps.

Then I noticed that the vertical blue lines were roughly in proportion. Soon, I realized that the total bar width (both sections) maps to the green number. Now back to the first figure. The proportion of respondents who believe business should take responsibility (green number) is encoded in the full bar. In other words, the left edges of all the bars represent 0%. Meanwhile the proportion saying business is doing well is encoded in the left section. Thus, the difference between the full width and the left-section width is both the right-section width and the performance gap.

Here is an edited version that clarifies the encoding scheme:

Hbr_specialbarcharts_2

***

That's my reading log. Howie gave me his take:

I had to interrupt my reading of the article for quite a while to puzzle this one out. It's sorted by performance gap, and I'm sure there's a better way to display that. Maybe a dot plot, similar to here - https://junkcharts.typepad.com/junk_charts/2023/12/the-efficiency-of-visual-communications.html.

A dot plot might look something like this:

Junkcharts_redo_hbr_specialcharts_2
Howie also said:

I interpret the authros' gist to be something like "Companies underperform public expectations on a wide range of social challenges" so I think I'd want to focus on the uniform direction and breadth of the performance gap more than the specifics of each line item.

And I agree.


Do you want a taste of the new hurricane cone?

The National Hurricane Center (NHC) put out a press release (link to PDF) to announce upcoming changes (in August 2024) to their "hurricane cone" map. This news was picked up by Miami Herald (link).

New_hurricane_map_2024

The above example is what the map looks like. (The data are probably fake since the new map is not yet implemented.)

The cone map has been a focus of research because experts like Alberto Cairo have been highly critical of its potential to mislead. Unfortunately, the more attention paid to it, the more complicated the map has become.

The latest version of this map comprises three layers.

The bottom layer is the so-called "cone". This is the white patch labeled below as the "potential track area (day 1-5)".  Researchers dislike this element because they say readers tend to misinterpret the cone as predicting which areas would be damaged by hurricane winds when the cone is intended to depict the uncertainty about the path of the hurricane. Prior criticism has led the NHC to add the text at the top of the chart, saying "The cone contains the probable path of the storm center but does not show the size of the storm. Hazardous conditions can occur outside of the cone."

The middle layer are the multi-colored bits. Two of these show the areas for which the NHC has issued "watches" and "warnings". All of these color categories represent wind speeds at different times. Watches and warnings are forecasts while the other colors indicate "current" wind speeds. 

The top layer consists of black dots. These provide a single forecast of the most likely position of the storm, with the S, H, M labels indicating the most likely range of wind speeds at forecast times.

***

Let's compare the new cone map to a real hurricane map from 2020. (This older map came from a prior piece also by NHC.)

Old_hurricane_map_2020

Can we spot the differences?

To my surprise, the differences were minor, in spite of the pre-announced changes.

The first difference is a simplification. Instead of dividing the white cone (the bottom layer) into two patches -- a white patch for days 1-3, and a dotted transparent patch for days 4-5, the new map aggregates the two periods. Visually, simplifying makes the map less busy but loses the implicit acknowledge found in the old map that forecasts further out are not as reliable.

The second point of departure is the addition of "inland" warnings and watches. Notice how the red and blue areas on the old map hugged the coastline while the red and blue areas on the new map reach inland.

Both changes push the bottom layer, i.e. the cone, deeper into the background. It's like a shrink-flation ice cream cone that has a tiny bit of ice cream stuffed deep in its base.

***

How might one improve the cone map? I'd start by dismantling the layers. The three layers present answers to different problems, albeit connected.

Let's begin with the hurricane forecasting problem. We have the current location of the storm, and current measurements of wind speeds around its center. As a first requirement, a forecasting model predicts the path of the storm in the near future. At any time, the storm isn't a point in space but a "cloud" around a center. The path of the storm traces how that cloud will move, including any expansion or contraction of its radius.

That's saying a lot. To start with, a forecasting model issues the predicted average path -- the expected path of the storm's center. This path is (not competently) indicated by the black dots in the top layer of the cone map. These dots offer only a sampled view of the average path.

Not surprisingly, there is quite a bit of uncertainty about the future path of any storm. Many models simulate future worlds, generating many predictions of the average paths. The envelope of the most probable set of paths is the "cone". The expanding width of the cone over time reflects the higher uncertainty of our predictions further into the future. Confusingly, this cone expansion does not depict spatial expansion of either the storm's size or the potential areas that may suffer the greatest damage. Both of those tend to shrink as hurricanes move inland.

Nevertheless, the cone and the black dots are connected. The path drawn out by the black dots should be the average path of the center of the storm.

The forecasting model also generates estimates of wind speeds. Those are given as labels inside the black dots. The cone itself offers no information about wind speeds. The map portrays the uncertainty of the position of the storm's center but omits the uncertainty of the projected wind speeds.

The middle layer of colored patches also inform readers about model projections - but in an interpreted manner. The colors portray hurricane warnings and watches for specific areas, which are based on projected wind speeds from the same forecasting models described above. The colors represent NHC's interpretation of these model outputs. Each warning or watch simultaneously uses information on location, wind speed and time. The uncertainty of the projected values is suppressed.

I think it's better to use two focused maps instead of having one that captures a bit of this and a bit of that.

One map can present the interpreted data, and show the areas that have current warnings and watches. This map is about projected wind strength in the next 1-3 days. It isn't about the center of the storm, or its projected path. Uncertainty can be added by varying the tint of the colors, reflecting the confidence of the model's prediction.

Another map can show the projected path of the center of the storm, plus the cone of uncertainty around that expected path. I'd like to bring more attention to the times of forecasting, perhaps shading the cone day by day, if the underlying model has this level of precision.

***

Back in 2019, I wrote a pretty long post about these cone maps. Well worth revisiting today!


To a new year of pleasant surprises

Happy new year!

This year promises to be the year of AI. Already last year, we pretty much couldn't lift an eyebrow without someone making an AI claim. This year will be even noisier. Visual Capitalist acknowledged this by making the noisiest map of 2023:

Visualcapitalist_01_Generative_AI_World_map sm

I kept thinking they have a geography teacher on the team, who really, really wants to give us a lesson of where each country is on the world map.

All our attention is drawn to the guiding lines and the random scatter of numbers. We have to squint to find the country names. All this noise drowns out the attempt to make sense of the data, namely, the inset of the top 10 countries in the lower left corner, and the classification of countries into five colored groups.

A small dose of editing helps. Remove most data labels except for the countries for which they have a story. Provide a data table below for those who want details.

***

In the Methodology section, the data analysts (possibly from a third party called ElectronicsHub) indicated that they used Google search volume of "over 90 of the most popular generative AI tools", calculating the "overall volume across all tools per 100k population". Then came a baffling line: "all search volumes were scaled up according to the search engine market share in each country, using figures from statscounter.com." (Note: in the following, I'm calling the data "AI-related search" for simplicity even though their measurement is restricted to the terms described above.)

It took me a while to comprehend what they could have meant by that line. I believe this is what that sentence means: Google is not the only search engine out there so by only researching Google search volume, they undercount the true search volume. How did they deal with the missing data problem? They "scaled up" so if Google is 80% of the search volume in a country, then they divide the Google volume by 80% to "scale up" to 100%.

Whenever we use heuristics like this, we should investigate its foundations. What is the implicit assumption behind this scaling-up procedure? It is that all search engines are effectively the same. The users of non-Google search engines behave exactly as the Google search engine users. If the analysts somehow could get their hands on the data of other search engines, they would discover that the proportion of search volume that is AI-related is effectively the same as seen on Google.

This is one of those convenient, and obviously wrong assumptions – if true, the market would have no need for more than one search engine. Each search engine's audience is just a random sample from the population of all users.

Let's make up some numbers. Let's say Google has 80% share of search volume in Country A, and AI-related search 10% of the overall Google search volume. The remaining search engines have 20% share. Scaling up here means taking the 8% of Google AI-related search volume, divide by 80%, which yields 10%. Since Google owns 8% of the 10%, the other search engines see 2% of overall search volume attributed to AI searches in Country A. Thus, the proportion of AI-related searches on those other search engines is 2%/20% = 10%.

Now, in certain countries, Google is not quite as dominant. Let's say Google only has 20% share of Country B's search volume. AI-related search on Google is 2%, which is 10% of its total. Using the same scaling-up procedure, the analysts have effectively assumed that the proportion of AI-related search volume in the dominant search engines in Country B to be also 10%.

I'm using the above calculations to illustrate a shortcoming of this heuristic. Using this procedure inflates the search volume in countries in which Google is less dominant because the inflation factor is the reciprocal of Google's market share. The less dominant Google is, the larger the inflation factor.

What's also true? The less dominant Google is, the smaller proportion of the total data the analysts are able to see, the lower the quality of the available information. So the heuristic is the most influential where it has the greatest uncertainty.

***

Hope your new year is full of uncertainty, and your heuristics shall lead you to pleasant surprises.

If you like the blog's content, please spread the word. I'm looking forward to sharing more content as the world of data continues to evolve at an amazing pace.

Disclosure: This blog post is not written by AI.


Several tips for visualizing matrices

Continuing my review of charts that were spammed to my inbox, today I look at the following visualization of a matrix of numbers:

Masterworks_chart9

The matrix shows pairwise correlations between the returns of 16 investment asset classes. Correlation is a number between -1 and 1. It is a symmetric scale around 0. It embeds two dimensions: the magnitude of the correlation, and its direction (positive or negative).

The correlation matrix is a special type of matrix: a bit easier to deal with as the data already come “standardized”. As with the other charts in this series, there is a good number of errors in the chart's execution.

I’ll leave the details maybe for a future post. Just check two key properties of a correlation matrix: the diagonal consisting of self-correlations should contain all 1s; and the matrix should be symmetric across that diagonal.

***

For this post, I want to cover nuances of visualizing matrices. The chart designer knows exactly what the message of the chart is - that the asset class called "art" is attractive because it has little correlation with other popular asset classes. Regardless of the chart's errors, it’s hard for the reader to find the message in the matrix shown above.

That's because the specific data carrying the message sit in the bottom row (and the rightmost column). The cells in this row (and column) has a light purple color, which has been co-opted by the even lighter gray color used for the diagonal cells. These diagonal cells pop out of the chart despite being the least informative (they have the same values for all correlation matrices!)

***

Several tactics can be deployed to push the message to the fore.

First, let's bring the key data to the prime location on the chart - this is the top row and left column (for cultures which read top to bottom, left to right).

Redo_masterwork9_matrix_arttop

For all the drafts in this post, I have dropped the text descriptions of the asset classes, and replaced them with numbers so that it's easier to follow the changes. (For those who're paying attention, I also edited the data to make the matrix symmetric.)

Second, let's look at the color choice. Here, the designer made a wise choice of restricting the number of color levels to three (dark, medium and light). I retained that decision in the above revision - actually, I used four colors but there are no values in one of the four sections, therefore, effectively, only three colors appear. But let's look at what happens when the number of color levels is increased.

Redo_masterwork9_matrix_colors

The more levels of color, the more strain it puts on our processing... with little reward.

Third, and most importantly, the order of the categories affects perception majorly. I have no idea what the designer used as the sorting criterion. In step one of the fix, I moved the art category to the front but left all the other categories in the original order.

The next chart has the asset classes organized from lowest to highest average correlation. Conveniently, using this sorting metric leaves the art category in its prime spot.

Redo_masterwork9_matrix_orderbyavg

Notice that the appearance has completely changed. The new version brings out clusters in the data much more effectively. Most of the assets in the bottom of the chart have high correlation with each other.

Finally, because the correlation matrix is symmetric across the diagonal of self-correlations, the two halves are mirror images and thus redundant. The following removes one of the mirrored halves, and also removes the diagonal, leading to a much cleaner look.

Redo_masterwork9_matrix_orderbyavg_tri

Next time you visualize a matrix, think about how you sort the rows/columns, how you choose the color scale, and whether to plot the mirrored image and the diagonal.

 

 

 


Two metrics in-fighting

The Wall Street Journal shows the following chart which pits two metrics against each other:

Wsj_salaries25to29

The primary metric is the change in median yearly salary between the two periods of time. We presume it's primary because of its presence in the chart title, and the blue bars being more readable than the green bubbles. The secondary metric is the median yearly salary in the later period.

That, I believe, was the intended design. When I saw this chart, my eyes went to the numbers inside the green bubbles. Perhaps it's because I didn't read the chart title first, and the horizontal axis wasn't labelled so it wasn't obvious what the blue bars coded.

As with most bubble charts, the data labels exist to cover up the inadequacy of circular areas. The self-sufficiency test - removing the data labels - shows this well:

Redo_wsj_salaries25to29

It's simply impossible to know what values should be in each bubble, or to perceive the relative sizes of those bubbles.

***

Reversing the order of the blue bars also helps:

Redo_wsjsalaries25to29_2

The original order is one of the more annoying features in most visualization packages. Because internally, the categories are numbered 1, 2, 3, ..., and because the convention is to have values run higher as they run up the vertical axis, these packages would place the top-ranked item at the bottom of the chart.

Most people read top to bottom, which means that they read the least important item first, and the most important item last!

In most visualization packages, it takes only 1 click or 1 action to reverse the order of the items. Please do it!

***

For change over time, I like using a Bumps chart, otherwise called a slope graph:

Redo_wsjsalaries25to29_3


An elaborate data vessel

Visualcapitalist_globaloilproductionI recently came across the following dataviz showing global oil production (link).

This is an ambitious graphic that addresses several questions of composition.

The raw data show the amount of production by country adding up to the global total. The countries are then grouped by region. Further, the graph presents an oil-and-gas specific grouping, as indicated by the legend shown just below the chart title. This grouping is indicated by the color of the circumference of the circle containing the flag of the country.

This chart form is popular in modern online graphics programs. It is like an elaborate data vessel. Because the countries are lined up around the barrel, a space has been created on three sides to admit labels and text annotations. This is a strength of this chart form.

***

The chart conveys little information about the underlying data. Each country is given a unique odd shaped polygon, making it impossible to compare sizes. It’s definitely possible to pick out U.S., Russia, Saudi Arabia as the top producers. But in presenting the ranks of the data, this chart form pales in comparison to a straightforward data table, or a bar chart. The less said about presenting values, the better.

Indeed, our self-sufficiency test exposes the inability of these polygons to convey the data. This is precisely why almost all values of the dataset are present on the chart.

***

The dataviz subtly presumes some knowledge on the part of the readers.

The regions are not directly labeled. The readers must know that Saudi Arabia is in the Middle East, U.S. is part of North America, etc. Admittedly this is not a big ask, but it is an ask.

It is also assumed that readers know their flags, especially those of smaller countries. Some of the small polygons have no space left for country names and they are labeled with just flags.

Visualcapitalist_globaloilproduction_nocountrylabels

In addition, knowing country acronyms is required for smaller countries as well. For example, in Africa, we find AGO, COG and GAB.

Visualcapitalist_globaloilproduction_countryacronyms

For this chart form the designer treats each country according to the space it has on the chart (except those countries that found themselves on the edges of the barrel). Font sizes, icons, labels, acronyms, data labels, etc. vary.

The readers are assumed to know the significance of OPEC and OPEC+. This grouping is given second fiddle, and can be found via the color of the circumference of the flag icons.

Visualcapitalist_globaloilproduction_opeclegend

I’d have not assigned a color to the non-OPEC countries, and just use the yellow and blue for OPEC and OPEC+. This is a little edit but makes the search for the edges more efficient.

Visualcapitalist_globaloilproduction_twoopeclabels

***

Let’s now return to the perception of composition.

In exactly the same manner as individual countries, the larger regions are represented by polygons that have arbitrary shapes. One can strain to compile the rank order of regions but it’s impossible to compare the relative values of production across regions. Perhaps this explains the presence of another chart at the bottom that addresses this regional comparison.

The situation is worse for the OPEC/OPEC+ grouping. Now, the readers must find all flag icons with edges of a specific color, then mentally piece together these arbitrarily shaped polygons, then realizing that they won’t fit together nicely, and so must now mentally morph the shapes in an area-preserving manner, in order to complete this puzzle.

This is why I said earlier this is an elaborate data vessel. It’s nice to look at but it doesn’t convey information about composition as readers might expect it to.

Visualcapitalist_globaloilproduction_excerpt


Partition of Europe

A long-time reader sent me the following map via twitter:

Europeelects_map

This map tells how the major political groups divide up the European Parliament. I’ll spare you the counting. There are 27 countries, and nine political groups (including the "unaffiliated").

The key chart type is a box of dots. Each country gets its own box. Each box has its own width. What determines the width? If you ask me, it’s the relative span of the countries on the map. For example, the narrow countries like Ireland and Portugal have three dots across while the wider countries like Spain, Germany and Italy have 7, 10 and 8 dots across respectively.

Each dot represents one seat in the Parliament. Each dot has one of 9 possible colors. Each color shows a political lean e.g. the green dots represent Green parties while the maroon dots display “Left” parties.

The end result is a counting game. If we are interested in counts of seats, we have to literally count each dot. If we are interested in proportion of seats, take your poison: either eyeball it or count each color and count the total.

Who does the underlying map serve? Only readers who know the map of Europe. If you don’t know where Hungary or Latvia is, good luck. The physical constraints of the map work against the small-multiples set up of the data. In a small multiples, you want each chart to be identical, except for the country-specific data. The small-multiples structure requires a panel of equal-sized cells. The map does not offer this feature, as many small countries are cramped into Eastern Europe. Also, Europe has a few tiny states e.g. Luxembourg (population 660K)  and Malta (population 520K). To overcome the map, the designer produces boxes of different sizes, substantially loading up the cognitive burden on readers.

The map also dictates where the boxes are situated. The centroids of each country form the scaffolding, with adjustments required when the charts overlap. This restriction ensures a disorderly appearance. By contrast, the regular panel layout of a small multiples facilitates comparisons.

***

Here is something I sketched using a tile map.

Eu parties print sm

First, I have to create a tile map of European countries. Some parts, e.g. western part, are straightforward. The eastern side becomes very congested.

The tile map encodes location in an imprecise sense. Think about the scaffolding of centroids of countries referred to prior. The tile map imposes an order to the madness - we're shifting these centroids so that they line up in a tidier pattern. What we gain in comparability we concede in location precision.

For the EU tile map, I decided to show the Baltic countries in a row rather than a column; the latter would have been more faithful to the true geography. Malta is shown next to Italy even though it could have been placed below. Similarly, Cyprus in relation to Greece. I also included several key countries that are not part of the EU for context.

Instead of raw seat counts, I'm showing the proportion of seats within each country claimed by each political group. I think this metric is more useful to readers.

The legend is itself a chart that shows the aggregate statistics for all 27 countries.


Tile maps on a trip

My friend Ray sent me to a recent blog about tile maps. Typical tile maps use squares or hexagons, although in theory many other shapes will do. Unsurprisingly, the field follows the latest development of math researchers who study the space packing problem. The space packing problem concerns how to pack a space with objects. The study of tesselations is to pack space with one or a few shapes.

It was an open question until recently whether there exists an "aperiodic monotile," that is to say, a single shape that can cover space in a non-repeating manner. We all know that we can use squares to cover a space, which creates the familiar grid of squares, but in that case, a pattern repeats itself all over the space.

Now, some researchers have found an elusive aperiodic monotile, which they dubbed the Einstein monotile. Below is a tesselation using these tiles:

Einsteintiles

Within this design, one cannot find a set of contiguous tiles that repeats itself.

The blogger then made a tile map using this new tesselation. Here's one:

Gravitywitheinsteintiles

It doesn't matter what this is illustrating. The blog author cites a coworker, who said: "I can think of no proper cartographic use for Penrose binning, but it’s fun to look at, and so that’s good enough for me." Penrose tiles is another mathematical invention that can be used in a tesselation. The story is still the same: there is no benefit from using these strange-looking shapes. Other than the curiosity factor.

***

Let's review the pros and cons of using tile maps.

Compare a typical choropleth map of the United States (by state) and a tile map by state. The former has the well-known problem that states with the largest areas usually have the lowest population densities, and thus, if we plot demographic data on such maps, the states that catch the most attention are the ones that don't weigh as much - by contrast, the densely populated states in New England barely show up.

The tile map removes this area bias, thus resolving this problem. Every state is represented by equal area.

While the tesselated design is frequently better, it's not always. In many data visualization, we do intend to convey the message that not all states are equal!

The grid arrangement of the state tiles also makes it easier to find regional patterns. A regional pattern is defined here as a set of neighboring states that share similar data (encoded in the color of the tiles). Note that the area of each state is of zero interest here, and thus, the accurate descriptions of relative areas found on the usual map is a distractor.

However, on the tile map, these regional patterns are conceptual. One must not read anything into the shape of the aggregated region, or its boundaries. Indeed, if we use strange-looking shapes like Einstein tiles, the boundaries are completely meaningless, and even misleading.

There also usually is some distortion of the spatial coordinates on a tile map because we'd like to pack the squares or hexagons into a lattice-like structure.

Lastly, the tile map is not scalable. We haven't seen a tile map of the U.S. by county or precinct but we have enjoyed many choropleth maps displaying county- or precinct-level data, e.g. the famous Purple Map of America. There is a reason for this.

***

Here is an old post that contains links to various other posts I've written about tile maps.


Visual story-telling: do you know or do you think?

One of the most important data questions of all time is: do you know? or do you think?

And one of the easiest traps to fall into is: I think, therefore I know.

***

Visual story-telling can be great but it can also mislead. Deception sometimes happens when readers are nudged to "fill in the blanks" with stuff they think they know, but they don't.

A Twitter reader asked me to look at the map in this Los Angeles Times (paywall) opinion column.

Latimes_lifeexpectancy_postcovid

The column promptly announces its premise:

Years of widening economic inequality, compounded by the pandemic and political storm and stress, have given Americans the impression that the country is on the wrong track. Now there’s empirical data to show just how far the country has run off the rails: Life expectancies have been falling.

The writer creates the expectation that he will reveal evidence in the form of data to show that life expectancies have been driven down by economic inequality, pandemic, and politics. Does he succeed?

***

The map portrays average life expectancy (at birth) for some mysterious, presumably very recent, year for every county in the United States. From the color legend, we learn that the bottom-to-top range is about 20 years. There is a clear spatial pattern, with the worst results in the south (excepting south Florida).

The choice of colors is telling. Red and blue on a U.S. map has heavy baggage, as they signify the two main political parties in the country. Given that the author believes politics to be a key driver of health outcomes, the usage of red and blue here is deliberate. Throughout the article, the columnist connects the lower life expectancies in southern states to its politics.

For example, he said "these geographical disparities aren't artifacts of pure geography or demographics; they're the consequences of policy decisions at the state level... Of the 20 states with the worst life expectancies, eight are among the 12 that have not implemented Medicaid expansion under the Affordable Care Act..."

Casual readers may fall into a trap here. There is nothing on the map itself that draws the connection between politics and life expectancies; the idea is evoked purely through the red-blue color scheme. So, as readers, we are filling in the blanks with our own politics.

What could have been done instead? Let's look at the life expectancy map side by side with the map of the U.S. 2020 Presidential election.

Junkcharts_lifeexpectancy_elections

Because of how close recent elections have been, we may think the political map has a nice balance of red and blue but it isn't. The Democrats' votes are heavily concentrated in densely-populated cities so most of the Presidential election map is red. When placed next to each other, it's obvious that politics don't explain the variance in life expectancy well. The Midwest is deep red and yet they have above average life expectancies. I have circled out various regions that contradict the claim that Republican politics drove life expectancies down.

It's not sufficient to point to the South, in which Republican votes and life expectancy are indeed inversely correlated. A good theory has to explain most of the country.

***

The columnist also suggests that poverty is the cause of low life expectancy. That too cannot be gleaned from the published map. Again, readers are nudged to use their wild imagination to fill in the blank.

Data come to the rescue. Here is a side-by-side comparison of the map of life expectancies and the map of median incomes.

Junkcharts_lifeexpectancy_income

A similar conundrum. While the story feels right in the South, it fails to explain the northwest, Florida, and various other parts of the country. Take a look again at the circled areas. Lower income brackets are also sometimes associated with high life expectancies.

***

The author supplies a third cause of lower life expectancies: Covid-19 response. Because Covid-19 was the "most obvious and convenient" explanation for the loss of life expectancy during the pandemic, this theory suggests that the red areas on the life expectancy map should correspond to the regions most ravaged by Covid-19.

Let's see the data.

Junkcharts_lifeexpectancy_covidcases

The map on the right shows the number of confirmed cases until June 2021. As before, the correlation holds somewhat in the South but there are notable exceptions, e.g. the Midwest. We also have states with low Covid-19 cases but below-average life expectancy.

***

What caused the decline of life expectancy in the U.S. - which began before the pandemic, and has continued beyond - is highly complex, beyond what a single map or a pair of maps or a few pairs of maps could convey. Showing a red-blue map presents a trap for readers to fall into, in which they start thinking, without knowing.

 


Parsons Student Projects

I had the pleasure of attending the final presentations of this year's graduates from Parsons's MS in Data Visualization program. You can see the projects here.

***

A few of the projects caught my eye.

A project called "Authentic Food in NYC" explores where to find "authentic" cuisine in New York restaurants. The project is notable for plowing through millions of Yelp reviews, and organizing the information within. Reviews mentioning "authentic" or "original" were extracted.

During the live presentation, the student clicked on Authentic Chinese, and the name that popped up was Nom Wah Tea Parlor, which serves dim sum in Chinatown that often has lines out the door.

Shuyaoxiao_authenticfood_parsons

Curiously, the ranking is created from raw counts of authentic reviews, which favors restaurants with more reviews, such as restaurants that have been operating for a longer time. It's unclear what rule is used to transfer authenticity from reviews to restaurants: does a single review mentioning "authentic" qualify a restaurant as "authentic", or some proportion of reviews?

Later, we see a visualization of the key words found inside "authentic" reviews for each cuisine. Below are words for Chinese and Italian cuisines:

Shuyaoxiao_authenticcuisines_parsons_words

These are word clouds with a twist. Instead of encoding the word counts in the font sizes, she places each word inside a bubble, and uses bubble sizes to indicate relative frequency.

Curiously, almost all the words displayed come from menu items. There isn't any subjective words to be found. Algorithms that extract keywords frequently fail in the sense that they surface the most obvious, uninteresting facts. Take the word cloud for Taiwanese restaurants as an example:

Shuyaoxiao_authenticcuisines_parsons_taiwan

The overwhelming keyword found among reviews of Taiwanese restaurants is... "taiwanese". The next most important word is "taiwan". Among the remaining words, "886" is the name of a specific restaurant, "bento" is usually associated with Japanese cuisine, and everything else is a menu item.

Getting this right is time-consuming, and understandably not a requirement for a typical data visualization course.

The most interesting insight is found in this data table.

Shuyaoxiao_authenticcuisines_ratios

It appears that few reviewers care about authenticity when they go to French, Italian, and Japanese restaurants but the people who dine at various Asian restaurants, German restaurants, and Eastern European restaurants want "authentic" food. The student concludes: "since most Yelp reviewers are Americans, their pursuit of authenticity creates its own trap: Food authenticity becomes an americanized view of what non-American food is."

This hits home hard because I know what authentic dim sum is, and Nom Wah Tea Parlor it ain't. Let me check out what Yelpers are saying about Nom Wah:

  1. Everything was so authentic and delicious - and cheap!!!
  2. Your best bet is to go around the corner and find something more authentic.
  3. Their dumplings are amazing everything is very authentic and tasty!
  4. The food was delicious and so authentic, and the staff were helpful and efficient.
  5. Overall, this place has good authentic dim sum but it could be better.
  6. Not an authentic experience at all.
  7. this dim sum establishment is totally authentic
  8. The onions, bean sprouts and scallion did taste very authentic and appreciated that.
  9. I would skip this and try another spot less hyped and more authentic.
  10. I would have to take my parents here the next time I visit NYC because this is authentic dim sum.

These are the most recent ten reviews containing the word "authentic". Seven out of ten really do mean authentic, the other three are false friends. Text mining is tough business! The student removed "not authentic" which helps. As seen from above, "more authentic" may be negative, and there may be words between "not" and "authentic". Also, think "not inauthentic", "people say it's authentic, and it's not", etc.

One thing I learned from this project is that "authentic" may be a synonym for "I like it" when these diners enjoy the food at an ethnic restaurant. I'm most curious about what inauthentic onions, bean sprouts and scallion taste like.

I love the concept and execution of this project. Nice job!

***

Another project I like is about tourism in Venezuela. The back story is significant. Since a dictatorship took over the country, the government stopped reporting tourism statistics. It's known that tourism collapsed, and that it may be gradually coming back in recent years.

This student does not have access to ready-made datasets. But she imaginatively found data to pursue this story. Specifically, she mentioned grabbing flight schedules into the country from the outside.

The flow chart is a great way to explore this data:

Ibonnet_parsons_dataviz_flightcities

A map gives a different perspective:

Ibonnet_parsons_dataviz_flightmap

I'm glad to hear the student recite some of the limitations of the data. It's easy to look at these visuals and assume that the data are entirely reliable. They aren't. We don't know that what proportion of the people traveling on those flights are tourists, how full those planes are, or the nationalities of those on board. The fact that a flight originated from Panama does not mean that everyone on board is Panamanian.

***

The third project is interesting in its uniqueness. This student wants to highlight the effect of lead in paint on children's health. She used the weight of lead marbles to symbolize the impact of lead paint. She made a dress with two big pockets to hold these marbles.

Scherer_parsons_dataviz_leaddress sm

It's not your standard visualization. One can quibble that dividing the marbles into two pockets doesn't serve a visualziation purpose, and so on. But at the end, it's a memorable performance.