Lost in the middle class

Washington Post asks people what it means to be middle class in the U.S. (link; paywall)

The following graphic illustrates one type of definition, purely based on income ranges.

Wpost_middleclass

For me, this chart is more taxing to read than it appears.

It can be read column by column. Each column represents a hypotheticial annual income for a family of four. People are asked whether they consider that family lower/working class, middle class or upper class. Be careful as the increments from column to column are not uniform.

Now, what's the question again? We're primarily interested in what incomes constitute middle class.

So, we should be looking at the deep green blocks that hang in the middle of each column. It's not easy to read the proportion of middle blocks in a stacked column chart.

***

I tried separating out the three perceived income classes, using a small-multiples design.

Junkcharts_redo_wpost_middleclass

One can more directly see what income ranges are most popularly perceived as being in each income class.

***

The article also goes into alternative definitions of middle class, using more qualitative metrics, such as "able to pay all bills on time without worry". That's a whole other post.

 


Stranger things found on scatter plots

Washington Post published a nice scatter plot which deconstructs scores from the recent World Championships in Gymnastics. (link)

Wpost_simonebiles

The chart presents the main message clearly - the winner Simone Biles scored the highest on both components of the score (difficulty and execution), by quite some margin.

What else can we learn from this chart?

***

Every athlete who qualified for the final scored at or above average on both components.

Scoring below average on either component is a death knell: no athlete scored enough on the other component to compensate. (The top left and bottom right quadrants would have had some yellow dots otherwise.)

Several athletes in the top right quadrant presumably scored enough to qualify but didn't. The footnote likely explains it: each country can send at most two athletes to the final. It may be useful to mark out these "unlucky" athletes using a third color.

Curiously, it's not easy to figure out who these unlucky athletes were from this chart alone. We need two pieces of data: the minimum qualifying score, and the total score for each athlete. The scatter plot isn't the best chart form to show totals, but qualification to the final is based on the sum of the difficulty and execution scores. (Note also, neither axis starts at zero, compounding the challenge.)

***

This scatter plot is most memorable for shattering one of my expectations about risk and reward in sports.

I expect risk-seeking athletes to suffer from higher variance in performance. The tennis player who goes for big serves tend to also commit more double faults. The sluggers who hit home runs tend to strike out more often. Similarly, I expect gymnasts who attempt more difficult skills to receive lower execution scores.

Indeed, the headline writer seemed to agree, suggesting that Biles is special because she's both high in difficulty and strong in execution.

The scatter plot, however, sends the opposite message - this should not surprise. The entire field shows a curiously strong positive correlation between difficulty and execution scores. The more difficult is the routine, the higher the excution score!

It's hard to explain such a pattern. My guesses are:

a) judges reward difficult routines, and subconsciously confound execution and difficulty scores. They use separate judges for excecution and difficulty. Paradoxically, this arrangement may have caused separation anxiety - the judges for execution might just feel the urge to reward high difficulty.

b) those athletes who are skilled enough to attempt more difficult routines are also those who are more consistent in execution. This is a type of self-selection bias frequently found in observational data.

Regardless of the reasons for the strong correlation, the chart shows that these two components of the total score are not independent, i.e. the metrics have significant overlap in what they measure. Thus, one cannot really talk about a difficult routine without also noting that it's a well-executed routine, and vice versa. In an ideal scoring design, we'd like to have independent components.


Graphics that stretch stomachs and make merry

Washington Post has a fun article about the Hot Dog Eating Contest in Coney Island here.

This graphic shows various interesting insights about the annual competition:

Washingtonpost_hotdogeating_scatter

Joey Chestnut is the recent king of hot-dog eating. Since the late 2000s, he's dominated the competition. He typically chows down over 60 hot dogs in 10 minutes. This is shown by the yellow line. Even at that high level, Chestnut has shown steady growth over time.

The legend tells us that the chart shows the results of all the other competitors. It's pretty clear that few have been able to even get close to Chestnut all these years. Most contestants were able to swallow 30 hot dogs or fewer.

It doesn't appear that the general standard has increased over time.

In 2011, a separate competition for women started. There is also a female champion (Miki Sudo) who has won almost every competition since she started playing.

One strange feature is the lack of competition in the early years. The footnote informs us that the trend is not real - they simply did not keep records of other competitors in early contests.

The only question I can't answer from this chart is the general standard and number of female competitors. The chart designer chooses not to differentiate between male and female contestants, other than the champions. I can understand that. Adding another dimension to the chart is a double-edged sword.

***

There is even more fun. There is a little video illustrating theories about what kind of human bodies can take in that many hot dogs in a short time. Here is a screen shot of it:

Washingtonpost_hotdogeating_body

 

 


A graphical compass

A Twitter user pointed me to this article from Washington Post, ruminating about the correlation between gas prices and measures of political sentiment (such as Biden's approval rating or right-track-wrong-track). As common in this genre, the analyst proclaims that he has found something "counter intuitive".

The declarative statement strikes me as odd. In the first two paragraphs, he said the data showed "as gas prices fell, American optimism rose. As prices rose, optimism fell... This seems counterintuitive."

I'm struggling to see what's counterintuitive. Aren't the data suggesting people like lower prices? Is that not what we think people like?

The centerpiece of the article concerns the correlation between metrics. "If two numbers move in concert, they can be depicted literally moving in concert. One goes up, the other moves either up or down consistently." That's a confused statement and he qualifies it by typing "That sort of thing."

He's reacting to the following scatter plot with lines. The Twitter user presumably found it hard to understand. Count me in.

Washingtonpost_gasprices

Why is this chart difficult to grasp?

The biggest puzzle is: what differentiates those two lines? The red and the gray lines are not labelled. One would have to consult the article to learn that the gray line represents the "raw" data at weekly intervals. The red line is aggregated data at monthly intervals. In other words, each red dot is an average of 4 or 5 weekly data points. The red line is just a smoothed version of the gray line. Smoothed lines show the time trend better.

The next missing piece is the direction of time, which can only be inferred by reading the month labels on the red line. But the chart without the direction of time is like a map without a compass. Take this segment for example:

Wpost_gaspricesapproval_directionoftime

If time is running up to down, then approval ratings are increasing over time while gas prices are decreasing. If time is running down to up, then approval ratings are decreasing over time while gas prices are increasing. Exactly the opposite!

The labels on the red line are not sufficient. It's possible that time runs in the opposite direction on the gray line! We only exclude that possibility if we know that the red line is a smoothed version of the gray line.

This type of chart benefits from having a compass. Here's one:

Wpost_gaspricesapproval_compass

It's useful for readers to know that the southeast direction is "good" (higher approval ratings, lower gas prices) while the northwest direction is "bad". Going back to the original chart, one can see that the metrics went in the "bad" direction at the start of the year and has reverted to a "good" direction since.

***

What does this chart really say? The author remarked that "correlation is not causation". "Just because Biden’s approval rose as prices dropped doesn’t mean prices caused the drop."

Here's an alternative: People have general sentiments. When they feel good, they respond more positively to polls, as in they rate everything more positively. The approval ratings are at least partially driven by this general sentiment. The same author apparently has another article saying that the right-track-wrong-track sentiment also moved in tandem with gas prices.

One issue with this type of scatter plot is that it always cues readers to make an incorrect assumption: that the outcome variables (approval rating) is solely - or predominantly - driven by the one factor being visualized (gas prices). This visual choice completely biases the reader's perception.

P.S. [11-11-22] The source of the submission was incorrectly attributed.


Where have the graduates gone?

Someone submitted this chart on Twitter as an example of good dataviz.

Washingtonpost_aftercollege

The chart shows the surprising leverage colleges have on where students live after graduation.

The primary virtue of this chart is conservation of space. If our main line of inquiry is the destination states of college graduations - by state, then it's hard to beat this chart's efficiency at delivering this information. For each state, it's easy to see what proportion of graduates leave the state after graduation, and then within those who leave, the reader can learn which are the most popular destination states, and their relative importance.

The colors link the most popular destination states (e.g. Texas in orange) but they are not enough because the designer uses state labels also. A next set of states are labeled without being differentiated by color. In particular, New York and Massachusetts share shades of blue, which also is the dominant color on the left side.

***

The following is a draft of a concept I have in my head.

Junkcharts_redo_washpost_postgraddestinations_1

I imagine this to be a tile map. The underlying data are not public so I just copied down a bunch of interesting states. This view brings out the spatial information, as we expect graduates are moving to neighboring states (or the states with big cities).

The students in the Western states are more likely to stay in their own state, and if they move, they stay in the West Coast. The graduates in the Eastern states also tend to stay nearby, except for California.

I decided to use groups of color - blue for East, green for South, red for West. Color is a powerful device, if used well. If the reader wants to know which states send graduates to New York, I'm hoping the reader will see the chart this way:

Junkcharts_redo_washpost_postgraddestinations_2

 


Dataviz is good at comparisons if we make the right comparisons

In an article about gas prices around the world, the Washington Post uses the following bar chart (link):

Wpost_gasprices_highincome

There are a few wrinkles in this one compared to the most generic bar chart one can produce:

Redo_wpost_gasprices_0

(The numbers on my chart are not the same as Washington Post's. That's because the data vendor charges for data, except for the most recent week. So, my data is from a different week.)

_trifectacheckup_imageThe gas prices are not expressed in dollars but a transformation turns prices into a cost-effectiveness metric: miles per dollar, or more precisely, miles per $40 dollars of gas. The metric has a reverse direction - the higher the price, the lower the miles. The data transformation belongs to the D corner of the Trifecta Checkup framework (link). Depending on how one poses the Q(uestion) of the chart, the shift from dollars to miles can bring the Q and the D in sync.

In the V(isual) corner, the designer embellishes the bars. A car icon is placed at the tip of each bar while the bar itself is turned into a wavy path, symbolizing a dirt path. The driving metaphor is in full play. In fact, the video makes the most out of it. There is no doubt that the embellishment has turned a mere scientific presentation into a form of entertainment.

***

Did the embellishment harm visual clarity? For the most part, no.

The worst it can get is when they compared U.S. and India/South Africa:

Redo_wpost_gasprices_indiasouthafrica

The left column shows the original charts from the article. In  both charts, the two cars are so close together that it is impossible to learn the scale of the difference. The amount of difference is a fraction of the width of a car icon.

The right column shows the "self-sufficiency test". Imagine the data labels are not on the chart. What we learn is that if we wanted to know how big of a gap is between the two countries, when reading the charts on the left, we are relying on the data labels, not the visual elements. On the right side, if we really want to learn the gaps, we have to look through the car icons to find the tips of the bars!

This discussion does not necessarily doom the appealing chart. If the message one wants to send with the India/South Afrcia charts is that there is negligible difference between them, then it is not crucial to present the precise differences in prices.

***

The real problem with this dataviz is in the D corner. Comparing countries is hard.

As shown above, by the miles per $40 spend metric, U.S. and India are rated essentially the same. So is the average American and the average Indian suffering equally?

Far from it. The clue comes from the aggregate chart, in which countries are divided into three tiers: high income, upper middle income and lower middle income. The U.S. belongs to the high-income tier while India falls into the lower-middle-income tier.

The cost of living in India is much lower than in the US. Forty dollars is a much bigger chunk of an Indian paycheck than an American one.

To adjust for cost of living, economists use a PPP (purchasing power parity) value. The following chart shows the difference:

Redo_wpost_gasprices_1

The right graph contains cost-of-living adjustments. It shows a completely different picture. Nominally (left chart), the price of gas in about the same in dollar terms between U.S. and India. In terms of cost of living, gas is actually 5 times more expensive in India. Thus, the adjusted miles per $40 gas number is much smaller for India than the unadjusted. (Because PPP is relative to U.S. prices, the U.S. numbers are not affected.)

PPP is not the end-all here. According to the Economic Times (India), only 22 out of 1,000 Indians own cars, compared to 980 out of 1,000 Americans. Think about the implication of using any statistic that averages the entire population!

***

Why is gas more expensive in California than the U.S. average? The talking point I keep hearing is environmental regulations. Gas prices may be higher in Europe for a similar reason. Residents in those places may be willing to pay higher prices because they get satisfaction from playing their part in preserving the planet for future generations.

The footnote discloses this not-trivial issue.

Wpost_gasprices_footnote

When converting from dollars per gallon/liter into miles per $40, we need data on miles per gallon/liter. Americans notoriously drive cars (trucks, SUVs, etc.) that have much lower mileage than those driven by other countries. However, this factor is artificially removed by assuming the same car with 32 mpg on all countries. A quick hop to the BTS website tells us that the average mpg of American cars is a third of that assumption. [See note below.]

Ignoring cross-country comparisons for the time being, the true number for U.S. is not 247 miles per $40 spent on gas as claimed. It is a third of that value: 82 miles per $40 spent.

It's tough to find data on fuel economy of all passenger cars, not just new passenger cars. I found Australia's number, which is 21 mpg. So this brings the miles per $40 number down from about 230 to 115. These are not small adjustments.

Washington Post's analysis paints a simplistic picture that presupposes that price is the only thing people care about. I call this issue xyopia. It's when the analyst frames the problem as factor x explaining outcome y, and when factor x is not the only, and frequently not even the most important, factor affecting y.

More on xyopia.

More discussion of Washington Post graphics.

 

[P.S. 7-25-2022. Reader Cody Curtis pointed out in the comments that the Bureau of Transportation Statistics report was using km/liter as units, not miles per gallon. The 10 km/liter number for average cars is roughly 23 mpg. I'll leave the text as is in the post as the larger point is valid: that there is variation in average fuel economy between nations - partly due to environemental regulation and consumer behavior - and thus, a proper comparison requires adjusting for this factor.]


The time has arrived for cumulative charts

Long-time reader Scott S. asked me about this Washington Post chart that shows the disappearance of pediatric flu deaths in the U.S. this season:

Washingtonpost_pediatricfludeaths

The dataset behind this chart is highly favorable to the designer, because the signal in the data is so strong. This is a good chart. The key point is shown clearly right at the top, with an informative title. Gridlines are very restrained. I'd draw attention to the horizontal axis. The master stroke here is omitting the week labels, which are likely confusing to all but the people familiar with this dataset.

Scott suggested using a line chart. I agree. And especially if we plot cumulative counts, rather than weekly deaths. Here's a quick sketch of such a chart:

Junkcharts_redo_wppedflu_panel

(On second thought, I'd remove the week numbers from the horizontal axis, and just go with the month labels. The Washington Post designer is right in realizing that those week numbers are meaningless to most readers.)

The vaccine trials have brought this cumulative count chart form to the mainstream. For anyone who have seen the vaccine efficacy charts, the interpretation of the panel of line charts should come naturally.

Instead of four plots, I prefer one plot with four superimposed lines. Like this:

Junkcharts_redo_wppeddeaths_superpose2

 

 

 


Water stress served two ways

Via Alberto Cairo (whose new book How Charts Lie can be pre-ordered!), I found the Water Stress data visualization by the Washington Post. (link)

The main interest here is how they visualized the different levels of water stress across the U.S. Water stress is some metric defined by the Water Resources Institute that, to my mind, measures the demand versus supply of water. The higher the water stress, the higher the risk of experiencing droughts.

There are two ways in which the water stress data are shown: the first is a map, and the second is a bubble plot.

Wp_waterstress

This project provides a great setting to compare and contrast these chart forms.

How Data are Coded

In a map, the data are usually coded as colors. Sometimes, additional details can be coded as shades, or moire patterns within the colors. But the map form locks down a number of useful dimensions - including x and y location, size and shape. The outline map reserves all these dimensions, rendering them unavailable to encode data.

By contrast, the bubble plot admits a good number of dimensions. The key ones are the x- and y- location. Then, you can also encode data in the size of the dots, the shape, and the color of the dots.

In our map example, the colors encode the water stress level, and a moire pattern encodes "arid areas". For the scatter plot, x = daily water use, y = water stress level, grouped by magnitude, color = water stress level, size = population. (Shape is constant.)

Spatial Correlation

The map is far superior in displaying spatial correlation. It's visually obvious that the southwestern states experience higher stress levels.

This spatial knowledge is relinquished when using a bubble plot. The designer relies on the knowledge of the U.S. map in the head of the readers. It is possible to code this into one of the available dimensions, e.g. one could make x = U.S. regions, but another variable is sacrificed.

Non-contiguous Spatial Patterns

When spatial patterns are contiguous, the map functions well. Sometimes, spatial patterns are disjoint. In that case, the bubble plot, which de-emphasizes the physcial locations, can be superior. In our example, the vertical axis divides the states into five groups based on their water stress levels. Try figuring out which states are "medium to high" water stress from the map, and you'll see the difference.

Finer Geographies

The map handles finer geographical units like counties and precincts better. It's completely natural.

In the bubble plot, shifting to finer units causes the number of dots to explode. This clutters up the chart. Besides, while most (we hope) Americans know the 50 states, most of us can't recite counties or precincts. Thus, the designer can't rely on knowledge in our heads. It would be impossible to learn spatial patterns from such a chart.

***

The key, as always, is to nail down your message, then select the right chart form.

 

 


Transforming the data to fit the message

A short time ago, there were reports that some theme-park goers were not happy about the latest price hike by Disney. One of these report, from the Washington Post (link), showed a chart that was intended to convey how much Disney park prices have outpaced inflation. Here is the chart:

Wapo_magickingdom_price_changes

I had a lot of trouble processing this chart. The two lines are labeled "original price" and "in 2014 dollars". The lines show a gap back in the 1970s, which completely closes up by 2014. This gives the reader an impression that the problem has melted away - which is the opposite of the designer intended.

The economic concept being marshalled here is the time value of money, or inflation. The idea is that $3.50 in 1971 is equivalent to a much higher ticket price in "2014 dollars" because by virtue of inflation, putting that $3.50 in the bank in 1971 and holding till 2014 would make that sum "nominally" higher. In fact, according to the chart, the $3.50 would have become $20.46, an approx. 7-fold increase.

The gap thus represents the inflation factor. The gap melting away is a result of passing of time. The closer one is to the present, the less the effect of cumulative inflation. The story being visualized is that Disney prices are increasing quickly whether or not one takes inflation into account. Further, if inflation were to be considered, the rate of increase is lower (red line).

What about the alternative story - Disney's price increases are often much higher than inflation? We can take the nominal price increase, and divide it into two parts, one due to inflation (of the prior-period price), and the other in excess of inflation, which we will interpret as a Disney premium.

The following chart then illustrates this point of view:

Redo_disneypricehikes

Most increases are small, and stay close to the inflation rate. But once in a while, and especially in 2010s, the price increases have outpaced inflation by a lot.

Note: since 2013, Disney has introduced price tiers, starting with two and currently at four levels. In the above chart, I took the average of the available prices, making the assumption that all four levels are equally popular. The last number looks like a price decrease because there is a new tier called "Low". The data came from AllEars.net.


A gem among the snowpack of Olympics data journalism

It's not often I come across a piece of data journalism that pleases me so much. Here it is, the "Happy 700" article by Washington Post is amazing.

Wpost_happy700_map2

 

When data journalism and dataviz are done right, the designers have made good decisions. Here are some of the key elements that make this article work:

(1) Unique

The topic is timely but timeliness heightens both the demand and supply of articles, which means only the unique and relevant pieces get the readers' attention.

(2) Fun

The tone is light-hearted. It's a fun read. A little bit informative - when they describe the towns that few have heard of. The notion is slightly silly but the reader won't care.

(3) Data

It's always a challenge to make data come alive, and these authors succeeded. Most of the data work involves finding, collecting and processing the data. There isn't any sophisticated analysis. But a powerful demonstration that complex analysis is not always necessary.

(4) Organization

The structure of the data is three criteria (elevation, population, and terrain) by cities. A typical way of showing such data might be an annotated table, or a Bumps-type chart, grouped columns, and so on. All these formats try to stuff the entire dataset onto one chart. The designers chose to highlight one variable at a time, cumulatively, on three separate maps. This presentation fits perfectly with the flow of the writing. 

(5) Details

The execution involves some smart choices. I am a big fan of legend/axis labels that are informative, for example, note that the legend doesn't say "Elevation in Meters":

Wpost_happy700_legend

The color scheme across all three maps shows a keen awareness of background/foreground concerns.