## Aligning V and Q by way of D

##### Apr 08, 2024

In the Trifecta Checkup (link), there is a green arrow between the Q (question) and V (visual) corners, indicating that they should align. This post illustrates what I mean by that.

I saw the following chart in a Washington Post article comparing dairy milk and plant-based "milks".

The article contains a whole series of charts. The one shown here focuses on vitamins.

The red color screams at the reader. At first, it appears to suggest that dairy milk is a standout on all four categories of vitamins. But that's not what the data say.

Let's take a look at the chart form: it's a grid of four plots, each containing one square for each of four types of "milk". The data are encoded in the areas of the squares. The red and green colors represent category labels and do not reflect data values.

Whenever we make bubble plots (the closest relative of these square plots), we have to solve a scale problem. What is the relationship between the scales of the four plots?

I noticed the largest square is the same size across all four plots. So, the size of each square is made relative to the maximum value in each plot, which is assigned a fixed size. In effect, the data encoding scheme is that the areas of the squares show the index values relative to the group maximum of each vitamin category. So, soy milk has 72% as much potassium as dairy milk while oat and almond milks have roughly 45% as much as dairy.

The same encoding scheme is applied also to riboflavin. Oat milk has the most riboflavin, so its square is the largest. Soy milk is 80% of oat, while dairy has 60% of oat.

***

Let's step back to the Trifecta Checkup (link). What's the question being asked in this chart? We're interested in the amount of vitamins found in plant-based milk relative to dairy milk. We're less interested in which type of "milk" has the highest amount of a particular vitamin.

Thus, I'd prefer the indexing tied to the amount found in dairy milk, rather than the maximum value in each category. The following set of column charts show this encoding:

I changed the color coding so that blue columns represent higher amounts than dairy while yellow represent lower.

From the column chart, we find that plant-based "milks" contain significantly less potassium and phosphorus than dairy milk while oat and soy "milks" contain more riboflavin than dairy. Almond "milk" has negligible amounts of riboflavin and phosphorus. There is vritually no difference between the four "milk" types in providing vitamin D.

***

In the above redo, I strengthen the alignment of the Q and V corners. This is accomplished by making a stop at the D corner: I change how the raw data are transformed into index values.

Just for comparison, if I only change the indexing strategy but retain the square plot chart form, the revised chart looks like this:

The four squares showing dairy on this version have the same size. Readers can evaluate the relative sizes of the other "milk" types.

## One bubble is a tragedy, and a bag of bubbles is...

##### Jul 19, 2023

From Kathleen Tyson's twitter account, I came across a graphic showing the destinations of Ukraine's grain exports since 2022 under the auspices of a UN deal. This graphic, made by AFP, uses one of the chart forms that baffle me - the bag of bubbles.

The first trouble with a bag of bubbles is the single bubble. The human brain is just not fit for comparing bubble sizes. The self-sufficiency test is my favorite device for demonstrating this weakness. The following is the European section of the above chart, with the data labels removed.

How much bigger is Spain than the Netherlands? What's the difference between Italy and the Netherlands? The answers don't come easily to mind. (The Netherlands is about 40% the size of Spain, and Italy is about 20% larger than the Netherlands.)

While comparing relative circular areas is a struggle, figuring out the relative ranks is not. Sure, it gets tougher with small differences (Germany vs S. Korea, Belgium vs Portugal) but saying those pairs are tied isn't a tragedy.

***

Another issue with bubble charts is how difficult it is to assess absolute values. A circle on its own has no reference point. The designer needs to add data labels or a legend. Adding data labels is an act of giving up. The data labels become the primary instrument for communicating the data, not the visual construct. Adding one data label is not enough, as the following shows:

Being told that Spain's value is 4.1 does little to help estimate the values for the non-labelled bubbles.

The chart does come with the following legend:

For this legend to work, the sample bubble sizes should span the range of the data. Notice that it's difficult to extrapolate from the size of the 1-million-ton bubble to 2-million, 4-million, etc. The analogy is a column chart in which the vertical axis does not extend through the full range of the dataset.

The designer totally gets this. The chart therefore contains both selected data labels and the partial legend. Every bubble larger than 1 million tons has an explicit data label. That's one solution for the above problem.

Nevertheless, why not use another chart form that avoids these problems altogether?

***

In Tyson's tweet, she showed another chart that pretty much contains the same information, this one from TASS.

This chart uses the flow diagram concept - in an abstract way, as I explained in previous post.

This chart form imposes structure on the data. The relative ranks of the countries within each region are listed from top to bottom. The relative amounts of grains are shown in black columns (and also in the thickness of the flows).

The aggregate value of movements within each region is called out in that middle section. It is impossible to learn this from the bag of bubbles version.

The designer did print the entire dataset onto this chart (except for the smallest countries grouped together as "other"). This decision takes away from the power of the underlying flow chart. Instead of thinking about the proportional representation of each country within its respective region, or the distribution of grains among regions, our eyes hone in on the data labels.

This brings me back to the principle of self-sufficiency: if we expect readers to consume the data labels - which comprise the entire dataset, why not just print a data table? If we decide to visualize, make the visual elements count!

## Parsons Student Projects

##### May 19, 2023

I had the pleasure of attending the final presentations of this year's graduates from Parsons's MS in Data Visualization program. You can see the projects here.

***

A few of the projects caught my eye.

A project called "Authentic Food in NYC" explores where to find "authentic" cuisine in New York restaurants. The project is notable for plowing through millions of Yelp reviews, and organizing the information within. Reviews mentioning "authentic" or "original" were extracted.

During the live presentation, the student clicked on Authentic Chinese, and the name that popped up was Nom Wah Tea Parlor, which serves dim sum in Chinatown that often has lines out the door.

Curiously, the ranking is created from raw counts of authentic reviews, which favors restaurants with more reviews, such as restaurants that have been operating for a longer time. It's unclear what rule is used to transfer authenticity from reviews to restaurants: does a single review mentioning "authentic" qualify a restaurant as "authentic", or some proportion of reviews?

Later, we see a visualization of the key words found inside "authentic" reviews for each cuisine. Below are words for Chinese and Italian cuisines:

These are word clouds with a twist. Instead of encoding the word counts in the font sizes, she places each word inside a bubble, and uses bubble sizes to indicate relative frequency.

Curiously, almost all the words displayed come from menu items. There isn't any subjective words to be found. Algorithms that extract keywords frequently fail in the sense that they surface the most obvious, uninteresting facts. Take the word cloud for Taiwanese restaurants as an example:

The overwhelming keyword found among reviews of Taiwanese restaurants is... "taiwanese". The next most important word is "taiwan". Among the remaining words, "886" is the name of a specific restaurant, "bento" is usually associated with Japanese cuisine, and everything else is a menu item.

Getting this right is time-consuming, and understandably not a requirement for a typical data visualization course.

The most interesting insight is found in this data table.

It appears that few reviewers care about authenticity when they go to French, Italian, and Japanese restaurants but the people who dine at various Asian restaurants, German restaurants, and Eastern European restaurants want "authentic" food. The student concludes: "since most Yelp reviewers are Americans, their pursuit of authenticity creates its own trap: Food authenticity becomes an americanized view of what non-American food is."

This hits home hard because I know what authentic dim sum is, and Nom Wah Tea Parlor it ain't. Let me check out what Yelpers are saying about Nom Wah:

1. Everything was so authentic and delicious - and cheap!!!
2. Your best bet is to go around the corner and find something more authentic.
3. Their dumplings are amazing everything is very authentic and tasty!
4. The food was delicious and so authentic, and the staff were helpful and efficient.
5. Overall, this place has good authentic dim sum but it could be better.
6. Not an authentic experience at all.
7. this dim sum establishment is totally authentic
8. The onions, bean sprouts and scallion did taste very authentic and appreciated that.
9. I would skip this and try another spot less hyped and more authentic.
10. I would have to take my parents here the next time I visit NYC because this is authentic dim sum.

These are the most recent ten reviews containing the word "authentic". Seven out of ten really do mean authentic, the other three are false friends. Text mining is tough business! The student removed "not authentic" which helps. As seen from above, "more authentic" may be negative, and there may be words between "not" and "authentic". Also, think "not inauthentic", "people say it's authentic, and it's not", etc.

One thing I learned from this project is that "authentic" may be a synonym for "I like it" when these diners enjoy the food at an ethnic restaurant. I'm most curious about what inauthentic onions, bean sprouts and scallion taste like.

I love the concept and execution of this project. Nice job!

***

Another project I like is about tourism in Venezuela. The back story is significant. Since a dictatorship took over the country, the government stopped reporting tourism statistics. It's known that tourism collapsed, and that it may be gradually coming back in recent years.

This student does not have access to ready-made datasets. But she imaginatively found data to pursue this story. Specifically, she mentioned grabbing flight schedules into the country from the outside.

The flow chart is a great way to explore this data:

A map gives a different perspective:

I'm glad to hear the student recite some of the limitations of the data. It's easy to look at these visuals and assume that the data are entirely reliable. They aren't. We don't know that what proportion of the people traveling on those flights are tourists, how full those planes are, or the nationalities of those on board. The fact that a flight originated from Panama does not mean that everyone on board is Panamanian.

***

The third project is interesting in its uniqueness. This student wants to highlight the effect of lead in paint on children's health. She used the weight of lead marbles to symbolize the impact of lead paint. She made a dress with two big pockets to hold these marbles.

It's not your standard visualization. One can quibble that dividing the marbles into two pockets doesn't serve a visualziation purpose, and so on. But at the end, it's a memorable performance.

##### Mar 24, 2023

I forgot where I found this chart but here it is:

The designer realizes the flaw of the design, which is why the number 50 is placed in a red box, and there is another big red box  placed right in our faces telling us that any number above 50 represents growing, while all below 50 shrinking.

The real culprit is the column chart design, which treats zero as the baseline, not 50. Thus, the real solution is to move away from a column chart design.

There are many possibilities. Here's one using the Bumps chart form:

There are several interesting insights buried in that column chart!

First we learn that almost all segments were contracting in both years.

Next, there are some clustering of segments. The Premium Regular and Cider segments were moving in sync. Craft, FMB/SEltzer and Below Premium were similar in 2022; intriguingly, Below Premium diverged from the other two segments.

In fact, Below Premium has distinguished itself as the only segment that experienced an improved index relative to 2022!

## Happy holidays

##### Dec 25, 2020

A message of hope.

In past years, I've featured pictures from great food from my travels. In this very different year, I'm showing some joyful creations from my kitchen.

## Consumption patterns during the pandemic

##### May 29, 2020

The impact of Covid-19 on the economy is sharp and sudden, which makes for some dramatic data visualization. I enjoy reading the set of charts showing consumer spending in different categories in the U.S., courtesy of Visual Capitalist.

The designer did a nice job cleaning up the data and building a sequential story line. The spending are grouped by categories such as restaurants and travel, and then sub-categories such as fast food and fine dining.

Spending is presented as year-on-year change, smoothed.

Here is the chart for the General Commerce category:

The visual design is clean and efficient. Even too sparse because one has to keep returning to the top to decipher the key events labelled 1, 2, 3, 4. Also, to find out that the percentages express year-on-year change, the reader must scroll to the bottom, and locate a footnote.

As you move down the page, you will surely make a stop at the Food Delivery category, noting that the routine is broken.

I've featured this device - an element of surprise - before. Remember this Quartz chart that depicts drinking around the world (link).

The rule for small multiples is to keep the visual design identical but vary the data from chart to chart. Here, the exceptional data force the vertical axis to extend tremendously.

This chart contains a slight oversight - the red line should be labeled "Takeout" because food delivery is the label for the larger category.

Another surprise is in store for us in the Travel category.

I kept staring at the Cruise line, and how it kept dipping below -100 percent. That seems impossible mathematically - unless these cardholders are receiving more refunds than are making new bookings. Not only must the entire sum of 2019 bookings be wiped out, but the records must also show credits issued to these credit (or debit) cards. It's curious that the same situation did not befall the airlines. I think many readers would have liked to see some text discussing this pattern.

***

Now, let me put on a data analyst's hat, and describe some thoughts that raced through my head as I read these charts.

Data analysis is hard, especially if you want to convey the meaning of the data.

The charts clearly illustrate the trends but what do the data reveal? The designer adds commentary on each chart. But most of these comments count as "story time." They contain speculation on what might be causing the trend but there isn't additional data or analyses to support the storyline. In the General Commerce category, the 50 to 100 percent jump in all subcategories around late March is attributed to people stockpiling "non-perishable food, hand sanitizer, and toilet paper". That might be true but this interpretation isn't supported by credit or debit card data because those companies do not have details about what consumers purchased, only the total amount charged to the cards. It's a lot more work to solidify these conclusions.

A lot of data do not mean complete or unbiased data.

The data platform provided data on 5 million consumers. We don't know if these 5 million consumers are representative of the 300+ million people in the U.S. Some basic demographic or geographic analysis can help establish the validity. Strictly speaking, I think they have data on 5 million card accounts, not unique individuals. Most Americans use more than one credit or debit cards. It's not likely the data vendor have a full picture of an individual's or a family's spending.

It's also unclear how much of consumer spending is captured in this dataset. Credit and debit cards are only one form of payment.

Data quality tends to get worse.

One thing that drives data analyst nuts. The spending categories are becoming blurrier. In the last decade or so, big business has come to dominate the American economy. Big business, with bipartisan support, has grown by (a) absorbing little guys, and (b) eliminating boundaries between industry sectors. Around me, there is a Walgreens, several Duane Reades, and a RiteAid. They currently have the same owner, and increasingly offer the same selection. In the meantime, Walmart (big box), CVS (pharmacy), Costco (wholesale), etc. all won regulatory relief to carry groceries, fresh foods, toiletries, etc. So, while CVS or Walgreens is classified as a pharmacy, it's not clear that what proportion of the spending there is for medicines. As big business grows, these categories become less and less meaningful.

## Graphing the economic crisis of Covid-19

##### Mar 23, 2020

My friend Ray Vella at The Conference Board has a few charts up on their coronavirus website. TCB is a trusted advisor and consultant to large businesses and thus is a good place to learn how the business community is thinking about this crisis.

I particularly like the following chart:

This puts the turmoil in the stock market in perspective. We are roughly tracking the decline of the Great Recession of the late 2000s. It's interesting that 9/11 caused very mild gyrations in the S&P index compared to any of the other events.

The chart uses an index with value 100 at Day 0. Day 0 is defined by the trigger event for each crisis. About three weeks into the current crisis, the S&P has lost over 30% of its value.

The device of a gray background for the bottom half of the chart is surprisingly effective.

***

Here is a chart showing the impact of the Covid-19 crisis on different sectors.

So the full-service restaurant industry is a huge employer. Restaurants employ 7-8 times more people than airlines. Airlines employ about the same numbers of people as "beverage bars" (which I suppose is the same as "bars" which apparently is different from "drinking places"). Bars employ 7 times more people than "Cafeterias, etc.".

The chart describes where the jobs are, and which sectors they believe will be most impacted. It's not clear yet how deeply these will be impacted. Being in NYC, the complete shutdown is going to impact 100% of these jobs in certain sectors like bars, restaurants and coffee shops.

## Food coma and self-sufficiency in dataviz

##### Jan 27, 2020

The Hustle wrote a strong analysis of the business of buffets. If you've read my analysis of Groupon's business model in Numbersense (link), you'll find some similarities. A key is to not think of every customer as an average customer; there are segments of customers who behave differently, and creating a proper mix of different types of customers is the management's challenge. I will make further comments on the statistics in a future post on the sister blog.

At Junk Charts, we'll focus on visualizing and communciating data. The article in The Hustle comes with the following dataviz:

This dataviz fails my self-sufficiency test. Recall: self-sufficiency is a basic requirement of visualizing data - that the graphical elements should be sufficient to convey the gist of the data. Otherwise, there is no point in augmenting the data with graphical elements.

The self-sufficiency test is to remove the dataset from the dataviz, and ask whether the graphic can stand on its own. So here:

The entire set of ingredient costs appears on the original graphic. When these numbers are removed, the reader gets the wrong message - that the cost is equally split between these five ingredients.

This chart reminds me of the pizza chart that everyone thought was a pie chart except its designer! I wrote about it here. Food coma is a thing.

The original chart may be regarded as an illustration rather than data visualization. If so, it's just a few steps from becoming a dataviz. Like this:

P.S. A preview of what I'll be talking about at the sister blog. The above diagram illustrates the average case - for the average buffet diner. Underneath these costs is an assumption about the relative amounts of each food that is eaten. But eaten by whom?

Also, if you have Numbersense (link), the chapter on measuring the inflation rate is relevant here. Any inflation metric must assume a basket of goods, but then the goods within the basket have to be weighted by the amount of expenditure. It's much harder to get the ratio of expenditures correct compared to getting price data.

## Bubble charts, ratios and proportionality

##### Jan 13, 2020

A recent article in the Wall Street Journal about a challenger to the dominant weedkiller, Roundup, contains a nice selection of graphics. (Dicamba is the up-and-comer.)

The change in usage of three brands of weedkillers is rendered as a small-multiples of choropleth maps. This graphic displays geographical and time changes simultaneously.

The staircase chart shows weeds have become resistant to Roundup over time. This is considered a weakness in the Roundup business.

***

In this post, my focus is on the chart at the bottom, which shows complaints about Dicamba by state in 2019. This is a bubble chart, with the bubbles sorted along the horizontal axis by the acreage of farmland by state.

Below left is a more standard version of such a chart, in which the bubbles are allowed to overlap. (I only included the bubbles that were labeled in the original chart).

The WSJ’s twist is to use the vertical spacing to avoid overlapping bubbles. The vertical axis serves a design perogative and does not encode data.

I’m going to stick with the more traditional overlapping bubbles here – I’m getting to a different matter.

***

The question being addressed by this chart is: which states have the most serious Dicamba problem, as revealed by the frequency of complaints? The designer recognizes that the amount of farmland matters. One should expect the more acres, the more complaints.

Let's consider computing directly the number of complaints per million acres.

The resulting chart (shown below right) – while retaining the design – gives a wholly different feeling. Arkansas now owns the largest bubble even though it has the least acreage among the included states. The huge Illinois bubble is still large but is no longer a loner.

Now return to the original design for a moment (the chart on the left). In theory, this should work in the following manner: if complaints grow purely as a function of acreage, then the bubbles should grow proportionally from left to right. The trouble is that proportional areas are not as easily detected as proportional lengths.

The pair of charts below depict made-up data in which all states have 30 complaints for each million acres of farmland. It’s not intuitive that the bubbles on the left chart are growing proportionally.

Now if you look at the right chart, which shows the relative metric of complaints per million acres, it’s impossible not to notice that all bubbles are the same size.

## Taking small steps to bring out the message

##### Jan 06, 2020

Happy new year! Good luck and best wishes!

***

We'll start 2020 with something lighter. On a recent flight, I saw a chart in The Economist that shows the proportion of operating income derived from overseas markets by major grocery chains - the headline said that some of these chains are withdrawing from international markets.

The designer used one color for each grocery chain, and two shades within each color. The legend describes the shades as "total" and "of which: overseas". As with all stacked bar charts, it's a bit confusing where to find the data. The "total" is actually the entire bar, not just the darker shaded part. The darker shaded part is better labeled "home market" as shown below:

The designer's instinct to bring out the importance of international markets to each company's income is well placed. A second small edit helps: plot the international income amounts first, so they line up with the vertical zero axis. Like this:

This is essentially the same chart. The order of international and home market is reversed. I also reversed the shading, so that the international share of income is displayed darker. This shading draws the readers' attention to the key message of the chart.

A stacked bar chart of the absolute dollar amounts is not ideal for showing proportions, because each bar is a different length. Sometimes, plotting relative values summing to 100% for each company may work better.

As it stands, the chart above calls attention to a different message: that Walmart dwarfs the other three global chains. Just the international income of Walmart is larger than the total income of Costco.

***

Please comment below or write me directly if you have ideas for this blog as we enter a new decade. What do you want to see more of? less of?