What is the price for objectivity

I knew I had to remake this chart.


The simple message of this chart is hidden behind layers of visual complexity. What the analyst wants readers to focus on (as discerned from the text on the right) is the red line, the seven-day moving average of new hospital admissions due to Covid-19 in Texas.

My eyes kept wandering away from the line. It's the sideway data labels on the columns. It's the columns that take up vastly more space than the red line. It's the sideway date labels on the horizontal axis. It's the redundant axis labels for hospitalizations when the entire data set has already been printed. It's the two hanging diamonds, for which the clues are filed away in the legend above.

Here's a version that brings out the message: after Phase 2 re-opening, the number of hospital admissions has been rising steadily.


Dots are used in place of columns, which push these details to the background. The line as well as periods of re-opening are directly labeled, removing the need for a legend.

Here's another visualization:


This chart plots the weekly average new hospital admissions, instead of the seven-day moving average. In the previous chart, the raggedness of moving average isn't transmitting any useful information to the average reader. I believe this weekly average metric is easier to grasp for many readers while retaining the general story.


On the original chart by TMC, the author said "the daily hospitalization trend shows an objective view of how COVID-19 impacts hospital systems." Objectivity is an impossible standard for any kind of data analysis or visualization. As seen above, the two metrics for measuring the trend in hospitalizations have pros and cons. Even if one insists on using a moving average, there are choices of averaging methods and window sizes.

Scientists are trained to believe in objectivity. It frequently disappoints when we discover that the rest of the world harbors no such notion. If you observe debates between politicians or businesspeople or social scientists, you rarely hear anyone claim one analysis is more objective - or less subjective - than another. The economist who predicts Dow to reach a new record, the business manager who argues for placing discounted products in the front not the back of the store, the sportscaster who maintains Messi is a better player than Ronaldo: do you ever hear these people describe their methods as objective?

Pursuing objectivity leads to the glorification of data dumps. The scientist proclaims disinterest in holding an opinion about the data. This is self-deception though. We clearly have opinions because when someone else  "misinterprets" the data, we express dismay. What is the point of pretending to hold no opinions when most of the world trades in opinions? By being "objective," we never shape the conversation, and forever play defense.

Designs of two variables: map, dot plot, line chart, table

The New York Times found evidence that the richest segments of New Yorkers, presumably those with second or multiple homes, have exited the Big Apple during the early months of the pandemic. The article (link) is amply assisted by a variety of data graphics.

The first few charts represent different attempts to express the headline message. Their appearance in the same article allows us to assess the relative merits of different chart forms.

First up is the always-popular map.


The advantage of a map is its ease of comprehension. We can immediately see which neighborhoods experienced the greater exoduses. Clearly, Manhattan has cleared out a lot more than outer boroughs.

The limitation of the map is also in view. With the color gradient dedicated to the proportions of residents gone on May 1st, there isn't room to express which neighborhoods are richer. We have to rely on outside knowledge to make the correlation ourselves.

The second attempt is a dot plot.


We may have to take a moment to digest the horizontal axis. It's not time moving left to right but income percentiles. The poorest neighborhoods are to the left and the richest to the right. I'm assuming that these percentiles describe the distribution of median incomes in neighborhoods. Typically, when we see income percentiles, they are based on households, regardless of neighborhoods. (The former are equal-sized segments, unlike the latter.)

This data graphic has the reverse features of the map. It does a great job correlating the drop in proportion of residents at home with the income distribution but it does not convey any spatial information. The message is clear: The residents in the top 10% of New York neighborhoods are much more likely to have left town.

In the following chart, I attempted a different labeling of both axes. It cuts out the need for readers to reverse being home to not being home, and 90th percentile to top 10%.


The third attempt to convey the income--exit relationship is the most successful in my mind. This is a line chart, with time on the horizontal axis.


The addition of lines relegates the dots to the background. The lines show the trend more clearly. If directly translated from the dot plot, this line chart should have 100 lines, one for each percentile. However, the closeness of the top two lines suggests that no meaningful difference in behavior exists between the 20th and 80th percentiles. This can be conveyed to readers through a short note. Instead of displaying all 100 percentiles, the line chart selectively includes only the 99th , 95th, 90th, 80th and 20th percentiles. This is a design choice that adds by subtraction.

Along the time axis, the line chart provides more granularity than either the map or the dot plot. The exit occurred roughly over the last two weeks of March and the first week of April. The start coincided with New York's stay-at-home advisory.

This third chart is a statistical graphic. It does not bring out the raw data but features aggregated and smoothed data designed to reveal a key message.

I encourage you to also study the annotated table later in the article. It shows the power of a well-designed table.

[P.S. 6/4/2020. On the book blog, I have just published a post about the underlying surveillance data for this type of analysis.]



Pulling the multi-national story out, step by step

Reader Aleksander B. found this Economist chart difficult to understand.


Given the chart title, the reader is looking for a story about multinationals producing lower return on equity than local firms. The first item displayed indicates that multinationals out-performed local firms in the technology sector.

The pie charts on the right column provide additional information about the share of each sector by the type of firms. Is there a correlation between the share of multinationals, and their performance differential relative to local firms?


We can clean up the presentation. The first changes include using dots in place of pipes, removing the vertical gridlines, and pushing the zero line to the background:


The horizontal gridlines attached to the zero line can also be removed:


Now, we re-order the rows. Start with the aggregate "All sectors". Then, order sectors from the largest under-performance by multinationals to the smallest.


The pie charts focus only on the share of multinationals. Taking away the remainders speeds up our perception:


Help the reader understand the data by dividing the sectors into groups, organized by the performance differential:


For what it's worth, re-sort the sectors from largest to smallest share of multinationals:


Having created groups of sectors by share of multinationals, I simplify further by showing the average pie chart within each group:



To recap all the edits, here is an animated gif: (if it doesn't play automatically, click on it)



Judging from the last graphic, I am not sure there is much correlation between share of multinationals and the performance differentials. It's interesting that in aggregate, local firms and multinationals performed the same. The average hides the variability by sector: in some sectors, local firms out-performed multinationals, as the original chart title asserted.

Re-thinking a standard business chart of stock purchases and sales

Here is a typical business chart.


A possible story here: institutional investors are generally buying AMD stock, except in Q3 2018.

Let's give this chart a three-step treatment.

STEP 1: The Basics

Remove the data labels, which stand sideways awkwardly, and are redundant given the axis labels. If the audience includes people who want to take the underlying data, then supply a separate data table. It's easier to copy and paste from, and doing so removes clutter from the visual.

The value axis is probably created by an algorithm - hard to imagine someone deliberately placing axis labels  $262 million apart.

The gridlines are optional.


STEP 2: Intermediate

Simplify and re-organize the time axis labels; show the quarter and year structure. The years need not repeat.

Align the vocabulary on the chart. The legend mentions "inflows and outflows" while the chart title uses the words "buying and selling". Inflows is buying; outflows is selling.


STEP 3: Advanced

This type of data presents an interesting design challenge. Arguably the most important metric is the net purchases (or the net flow), i.e. inflows minus outflows. And yet, the chart form leaves this element in the gaps, visually.

The outflows are numerically opposite to inflows. The sign of the flow is encoded in the color scheme. An outflow still points upwards. This isn't a criticism, but rather a limitation of the chart form. If the red bars are made to point downwards to indicate negative flow, then the "net flow" is almost impossible to visually compute!

Putting the columns side by side allows the reader to visually compute the gap, but it is hard to visually compare gaps from quarter to quarter because each gap is hanging off a different baseline.

The following graphic solves this issue by focusing the chart on the net flows. The buying and selling are still plotted but are deliberately pushed to the side:


The structure of the data is such that the gray and pink sections are "symmetric" around the brown columns. A purist may consider removing one of these columns. In other words:


Here, the gray columns represent gross purchases while the brown columns display net purchases. The reader is then asked to infer the gross selling, which is the difference between the two column heights.

We are almost back to the original chart, except that the net buying is brought to the foreground while the gross selling is pushed to the background.


Quick example of layering

The New York Times uses layering to place the Alabama tornadoes in context. (link)

Today's wide availability of detailed data allows designers to create dense data graphics like this:


The graphic shows the starting and ending locations and trajectory of each tornado, as well as the wind speeds (shown in color).

Too much data slows down our understanding of the visual message. The remedy is to subtract. Here is a second graphic that focuses only on the strongest tornadoes (graded 4 or 5 on a 5-point scale):


Another goal of the data visualization is to place in context the tornado that hit Beauregard:


The area around Beauregard is not typically visited by strong tornadoes. Also, the tornadoes were strong but there have been stronger ones.


The designer unfolds the story in three stages. There are no knobs and sliders and arrows, and that's a beauty. It's usually not a good idea to make readers find the story themselves.

Five steps to let the young ones shine

Knife stabbings are in the news in the U.K. and the Economist has a quartet of charts to illustrate what's going on.


I'm going to focus on the chart on the bottom right. This shows the trend in hospital admissions due to stabbings in England from 2000 to 2018. The three lines show all ages, and two specific age groups: under 16 and 16-18.

The first edit I made was to spell out all years in four digits. For this chart, numbers like 15 and 18 can be confused with ages.


The next edit corrects an error in the subtitle. The reference year is not 2010 as those three lines don't cross 100. It appears that the reference year is 2000. Another reason to use four-digit years on the horizontal axis is to be consistent with the subtitle.


The next edit removes the black dot which draws attention to itself. The chart though is not about the year 2000, which has the least information since all data have been forced to 100.


The next edit makes the vertical axis easier to interpret. The indices 150, 200, are much better stated as + 50%, + 100%. The red line can be labeled "at 2000 level". One can even remove the subtitle 2000=100 if desired.


Finally, I surmise the message the designer wants to get across is the above-average jump in hospital admissions among children under 16 and 16 to 18. Therefore, the "All" line exists to provide context. Thus, I made it a dashed line pushing it to the background.





Appreciating population mountains

Tim Harford tweeted about a nice project visualizing of the world's distribution of population, and wondered why he likes it so much. 

That's the question we'd love to answer on this blog! Charts make us emotional - some we love, some we hate. We like to think that designers can control those emotions, via design choices.

I also happen to like the "Population Mountains" project as well. It fits nicely into a geography class.

1. Chart Form

The key feature is to adopt a 3D column chart form, instead of the more conventional choropleth or dot density. The use of columns is particularly effective here because it is natural - cities do tend to expand vertically upwards when ever more people cramp into the same amount of surface area. 


Imagine the same chart form is used to plot the number of swimming pools per square meter. It just doesn't make the same impact. 

2. Color Scale

The designer also made judicious choices on the color scale. The discrete, 5-color scheme is a clear winner over the more conventional, continuous color scale. The designer made a deliberate choice because most software by default uses a continuous color scale for continuous data (population density per square meter).


Also, notice that the color intervals in 5-color scale is not set uniformly because there is a power law in effect - the dense areas are orders of magnitude denser than the sparsely populated areas, and most locations are low-density. 

These decisions have a strong influence on the perception of the information: it affects the heights of the peaks, the contrasts between the highs and lows, etc. It also injects a degree of subjectivity into the data visualization exercise that some find offensive.

3. Background

The background map is stripped of unnecessary details so that the attention is focused on these "population mountains". No unnecessary labels, roads, relief, etc. This demonstrates an acute awareness of foreground/background issues.

4. Insights on the "shape" of the data 

The article makes the following comment:

What stands out is each city’s form, a unique mountain that might be like the steep peaks of lower Manhattan or the sprawling hills of suburban Atlanta. When I first saw a city in 3D, I had a feel for its population size that I had never experienced before.

I'd strike out population size and replace with population density. In theory, the sum of the areas of the columns in any given surface area gives you the "population size" but given the fluctuating heights of these columns, and the different surface areas (sprawls) of different cities, it is an Olympian task to estimate the volumes of the population mountains!

The more salient features of these mountains, most easily felt by readers, are the heights of the peak columns, the sprawl of the cities, and the general form of the mass of columns. The volume of the mountain is one of the tougher things to see. Similarly, the taller 3D columns hide what's behind them, and you'd need to spin and rotate the map to really get a good feel.

Here is the contrast between Paris and London, with comparable population sizes. You can see that the population in Paris (and by extension, France) is much more concentrated than in the U.K. This difference is a surprise to me.


5. Sourcing

Some of the other mountains, especially those in India and China, look a bit odd to me, which leads me to wonder about the source of the data. This project has a very great set of footnotes that not only point to the source of the data but also a discussion of its limitations, including the possibility of inaccuracies in places like India and China. 


Check out Population Mountains!






Well-structured, interactive graphic about newsrooms

Today, I take a detailed look at one of the pieces that came out of an amazing collaboration between Alberto Cairo, and Google's News Lab. The work on diversity in U.S. newsrooms is published here. Alberto's introduction to this piece is here.

The project addresses two questions: (a) gender diversity (representation of women) in U.S. newsrooms and (b) racial diversity (representation of white vs. non-white) in U.S. newsrooms.

One of the key strengths of the project is how the complex structure of the underlying data is displayed. The design incorporates the layering principle everywhere to clarify that structure.

At the top level, the gender and race data are presented separately through the two tabs on the top left corner. Additionally, newsrooms are classified into three tiers: brand-names (illustrated with logos), "top" newsrooms, and the rest.


The brand-name newsrooms are shown with logos while the reader has to click on individual bubbles to see the other newsrooms. (Presumably, the size of the bubble is the size of each newsroom.)

The horizontal scale is the proportion of males (or females), with equality positioned in the middle. The higher the proportion of male staff, the deeper is the blue. The higher the proportion of female staff, the deeper is the red. The colors are coordinated between the bubbles and the horizontal axis, which is a nice touch.

I am not feeling this color choice. The key reference level on this chart is the 50/50 split (parity), which is given the pale gray. So the attention is drawn to the edges of the chart, to those newsrooms that are the most gender-biased. I'd rather highlight the middle, celebrating those organizations with the best gender balance.


The red-blue color scheme unfortunately re-appeared in a subsequent chart, with a different encoding.


Now, blue means a move towards parity while red indicates a move away from parity between 2001 and 2017. Gray now denotes lack of change. The horizontal scale remains the same, which is why this can cause some confusion.

Despite the colors, I like the above chart. The arrows symbolize trends. The chart delivers an insight. On average, these newsrooms are roughly 60% male with negligible improvement over 16 years.


Back to layering. The following chart shows that "top" newsrooms include more than just the brand-name ones.


The dot plot is undervalued for showing simple trends like this. This is a good example of this use case.

While I typically recommend showing balanced axis for bipolar scale, this chart may be an exception. Moving to the right side is progress but the target sits in the middle; the goal isn't to get the dots to the far right so much of the right panel is wasted space.


A gem among the snowpack of Olympics data journalism

It's not often I come across a piece of data journalism that pleases me so much. Here it is, the "Happy 700" article by Washington Post is amazing.



When data journalism and dataviz are done right, the designers have made good decisions. Here are some of the key elements that make this article work:

(1) Unique

The topic is timely but timeliness heightens both the demand and supply of articles, which means only the unique and relevant pieces get the readers' attention.

(2) Fun

The tone is light-hearted. It's a fun read. A little bit informative - when they describe the towns that few have heard of. The notion is slightly silly but the reader won't care.

(3) Data

It's always a challenge to make data come alive, and these authors succeeded. Most of the data work involves finding, collecting and processing the data. There isn't any sophisticated analysis. But a powerful demonstration that complex analysis is not always necessary.

(4) Organization

The structure of the data is three criteria (elevation, population, and terrain) by cities. A typical way of showing such data might be an annotated table, or a Bumps-type chart, grouped columns, and so on. All these formats try to stuff the entire dataset onto one chart. The designers chose to highlight one variable at a time, cumulatively, on three separate maps. This presentation fits perfectly with the flow of the writing. 

(5) Details

The execution involves some smart choices. I am a big fan of legend/axis labels that are informative, for example, note that the legend doesn't say "Elevation in Meters":


The color scheme across all three maps shows a keen awareness of background/foreground concerns. 

What do we think of the "packed" bar chart?

Xan Gregg - my partner in the #onelesspie campaign to replace terrible Wikipedia pie charts one at a time - has come up with a new chart form that he calls "packed bars". It's a combination of bar charts and the treemap.

Here is an example of a packed barchart, in which the top 10 companies on the S&P500 index are displayed:


What he's doing is to add context to help interpret the data. So frequently these days, we encounter data analyses of the "Top X" or "Bottom Y" type. Such analyses are extremely limited in utility as it ignores the bulk of the data. The extreme values have little to nothing to say about the rest of the data. This problem is particularly acute in skewed data.

Compare the two versions:


The left chart is a Top 10 analysis. The reader knows nothing about the market cap of the other 490 companies. The right chart provides the context. We can see that the Top 10 companies have a combined market cap that is roughly a quarter of the total market cap in the S&P 500. We also learn about the size of the next 10 versus the Top 10, etc.

As with any chart form, a nice dataset can really surface its power. I really like what the packed barchart reveals about the election data by county:


(Thanks to Xan for providing me this image.)

Notice the preponderance of red on the right side and the gradual shift from blue/purple to pink/red moving left to right. This is very effective at showing one of the most important patterns in American politics - the small counties are mostly deep red while the Democratic base is to be found primarily in large metropolitan areas. I have previously featured a number of interesting election graphics here. Washington Post's nation of peaks is another way to surface this pattern.

Xan would love to get feedback about this chart type. He has put up a blog post here with more details. I also love this animation he created to show how the packing occurs.