Too many colors on a chart is bad, but why?

The following chart is bad, but how so?

Junkcharts_colors_columnchart

The chart is annoying because of the misuse of colors.

What is the purpose of the multiple colors used in this chart? It's not encoding any data. Colors are used here to differentiate one bar from its two neighbors. Or perhaps to make the chart more "appealing".

The reason why the coloring scheme backfires is that readers may look for meaning in the colors. What's common between Iceland, United States and Germany for them to be assigned green? What about Japan, New Zealand, Spain and France, all of which shown yellow?

The readers' instinct is driven by a set of unspoken rules that govern the production of data visualization. Specifically, the rule here is: color differences reflect data differences. When such a rule is violated, the reader is misled and confused.

***

For more about this rule, other rules related to making bar charts, and other other rules for making data graphics, please read my Long Read article, here.

 


The unspoken rules of visualization

My latest is at DataJournalism.com.

Ejc_unspokenrulesbanner

It's an essay on the following observation:

The efficiency and multidimensionality of the visual medium arise from a set of conventions and rules, which regularises the communications between producers of data visualisation and its consumers. These conventions and rules are often unspoken: it's the visual equivalent of saying ’it goes without saying’ .

There are lots of little things visualization designers do in their sleep that don't get mentioned. When a visual design deviates from these rules, the readers may get confused.

Here is one example I discussed in the article (hat tip to Xan Gregg).

Fig04_piechart_diverging

This pie chart is not easy to read beyond the obvious point that English is the most popular. The following pie chart is much easier on the readers:

Fig03_piechart_conforming

Why?

The designer follows some common conventions, such as placing the first slice at the top vertical, sorting the slices from largest to smallest (excepting the "other"), and introducing multiple colors only to encode data differences.

These rules are silently applied, and are not announced to the reader. There is a network effect: the more practitioners use these rules, the stronger they stick.

My essay attempts to outline some of the most important unspoken rules of visualizaiton. For more, see here.


When design goes awry

One can't accuse the following chart of lacking design. Strong is the evidence of departing from convention but the design decisions appear wayward. (The original link on Money here)

Mc_cellphones_money17

 

The donut chart (right) has nine sections. Eight of the sections (excepting A) have clearly all been bent out of shape. It turns out that section A does not have the right size either. The middle gray circle is not really in the middle, as seen below.

Redo_mc_cellphone

The bar charts (left) suffer from two ills. Firstly, the full width of the chart is at the 50 percent mark, so readers are forced to read the data labels to understand the data. Secondly, only the top two categories are shown, thus the size of the whole is lost. A stacked bar chart would serve better here.

Here is a bardot chart; the "dot" part of it makes it easier to see a Top 2 box analysis.

Redo_jc_mc_cellphone_2

I explain the bardot chart here.

 

 PS. Here is Jamie's version (from the comment below):

Jamie_mc_cellphone

 

 


Political winds and hair styling

Washington Post (link) and New York Times (link) published dueling charts last week, showing the swing-swang of the political winds in the U.S. Of course, you know that the pendulum has shifted riotously rightward towards Republican red in this election.

The Post focused its graphic on the urban / not urban division within the country:

Wp_trollhair

Over Twitter, Lazaro Gamio told me they are calling these troll-hair charts. You certainly can see the imagery of hair blowing with the wind. In small counties (right), the wind is strongly to the right. In urban counties (left), the straight hair style has been in vogue since 2008. The numbers at the bottom of the chart drive home the story.

Previously, I discussed the Two Americas map by the NY Times, which covers a similar subject. The Times version emphasizes the geography, and is a snapshot while the Post graphic reveals longer trends.

Meanwhile, the Times published its version of a hair chart.

Nyt_hair_election

This particular graphic highlights the movement among the swing states. (Time moves bottom to top in this chart.) These states shifted left for Obama and marched right for Trump.

The two sets of charts have many similarities. They both use curvy lines (hair) as the main aesthetic feature. The left-right dimension is the anchor of both charts, and sways to the left or right are important tropes. In both presentations, the charts provide visual aid, and are nicely embedded within the story. Neither is intended as exploratory graphics.

But the designers diverged on many decisions, mostly in the D(ata) or V(isual) corner of the Trifecta framework.

***

The Times chart is at the state level while the Post uses county-level data.

The Times plots absolute values while the Post focuses on relative values (cumulative swing from the 2004 position). In the Times version, the reader can see the popular vote margin for any state in any election. The middle vertical line is keyed to the electoral vote (plurality of the popular vote in most states). It is easy to find the crossover states and times.

The Post's designer did some data transformations. Everything is indiced to 2004. Each number in the chart is the county's current leaning relative to 2004. Thus, left of vertical means said county has shifted more blue compared to 2004. The numbers are cumulative moving top to bottom. If a county is 10% left of center in the 2016 election, this effect may have come about this year, or 4 years ago, or 8 years ago, or some combination of the above. Again, left of center does not mean the county voted Democratic in that election. So, the chart must be read with some care.

One complaint about anchoring the data is the arbitrary choice of the starting year. Indeed, the Times chart goes back to 2000, another arbitrary choice. But clearly, the two teams were aiming to address slightly different variations of the key question.

There is a design advantage to anchoring the data. The Times chart is noticeably more entangled than the Post chart. There are tons more criss-crossing. This is particularly glaring given that the Times chart contains many fewer lines than the Post chart, due to state versus county.

Anchoring the data to a starting year has the effect of combing one's unruly hair. Mathematically, they are just shifting the lines so that they start at the same location, without altering the curvature. Of course, this is double-edged: the re-centering means the left-blue / right-red interpretation is co-opted.

On the Times chart, they used a different coping strategy. Each version of their charts has a filter: they highlight the set of lines to demonstrate different vignettes: the swing states moved slightly to the right, the Republican states marched right, and the Democratic states also moved right. Without these filters, the readers would be winking at the Times's bad-hair day.

***

Another decision worth noting: the direction of time. The Post's choice of top to bottom seems more natural to me than the Times's reverse order but I am guessing some of you may have different inclinations.

Finally, what about the thickness of the lines? The Post encoded population (voter) size while the Times used electoral votes. This decision is partly driven by the choice of state versus county level data.

One can consider electoral votes as a kind of log transformation. The effect of electorizing the popular vote is to pull the extreme values to the center. This significantly simplifies the designer's life. To wit, in the Post chart (shown nbelow), they have to apply a filter to highlight key counties, and you notice that those lines are so thick that all the other countries become barely visible.

  Wp_trollhair_texas

 


Depicting imbalance, straying from the standard chart

My friend Tonny M. sent me a tip to two pretty nice charts depicting the state of U.S. healthcare spending (link).

The first shows U.S. as an outlier:

FtotHealthExp_pC_USD_long-1-768x871

This chart is a replica of the Lane Kenworthy chart, with some added details, that I have praised here before. This chart remains one of the most impactful charts I have seen. The added time-series details allow us to see a divergence from about 1980.

Lanekenworthy-250wi

The second chart shows the inequity of healthcare spending among Americans. The top 10% spenders consume about 6.5 times as much as the average while the bottom 16% do not spend anything at all.

Ourworldindata_nihcm-spending-concentration-titled

This chart form is standard for depicting imbalance in scientific publications. But the general public finds this chart difficult to interpret, mostly because both axes operate on a cumulative scale. Further, encoding inequity in the bend of the curve is not particularly intuitive.

So I tried out some other possibilities. Both alternatives are based on incremental, not cumulative, metrics. I take the spend of the individual ten groups (deciles) and work with those dollars. Also, I provide a reference point, which is the level of spend of each decile if the spend were to be distributed evenly among all ten groups.

The first alternative depicts the "excess" or "deficient" spend as column segments. Redo_healthcarespend1

The second alternative shows the level of excess or deficient spending as slopes of lines. I am aiming for a bit more drama here.

Redo_healthcarespend2

Now, the interpretation of this chart is not simple. Since illness is not evenly spread out within the population, this distribution might just be the normal state of affairs. Nevertheless, this pattern can also result from the top spenders purchasing very expensive experimental treatments with little chance of success, for example.

 


Raining, data art, if it ain't broke

Via Twitter, reader Joe D. asked a few of us to comment on the SparkRadar graphic by WeatherSpark.

At the time of writing, the picture for Baltimore is very pretty:

Sparkradar

The picture for New York is not as pretty but still intriguing. We are having a bout of summer and hence the white space (no precipitation):

Sparkradar_newyork

Interpreting this innovative chart is a tough task - this is a given with any innovative chart. Explaining the chart requires all the text on this page.

The difficulty of interpreting the SparkRadar chart is twofold.

Firstly, the axes are unnatural. Time runs vertically, defying the horizontal convention. Also, "now" - the most recent time depicted - is at the very bottom, which tempts readers to read bottom to top, meaning we are reading time running backwards into the past. In most charts, time run left to right from past to present (at least in the left-right-centric part of the world that I live in.)

Location has been reduced to one dimension. The labels "Distance Inside" and "Distance from Storm" confuse me - perhaps those who follow weather more closely can justify the labels. Conventionally, location is shown in two dimensions.

The second difficulty is created by the inclusion of irrelevant data (aka noise). The square grid prescribes a fixed box inside which all data are depicted. In the New York graphic, something is going on in the top right corner - far away in both time and space - how does it help the reader?

***

Now, contrast this chart to the more standard one, a map showing rain "clouds" moving through space.

Bing_precipitationradar_baltimore

(From Bing search result)

The standard one wins because it matches our intuition better.

Location is shown in two dimensions.

Distance from the city is shown on the map as scaled distance.

Time is shown as motion.

Speed is shown as speed of the motion. (In SparkRadar, speed is shown by the slope of imaginary lines.)

Severity is shown by density and color.

Nonetheless, a panel of the new charts make great data art.

 

 


Happy new year. Did you have a white Christmas?

Happy 2016.

I spent time with the family in California, wiping out any chance of a white Christmas, although I hear that the probability would have been miniscule even had I stayed.

I did come across a graphic that tried to drive the point home, via NOAA.

WhiteChristmasProbabilityBinned_620_hat

Unfortunately, this reminded me a little of the controversial Florida gun-deaths chart (see here):

  Floridaguns-300wi

In this graphic, the designer played with the up-is-bigger convention, drawing some loud dissent.

Begin with the question addressed by the NOAA graphic: which parts of the country has the highest likelihood of having a white Christmas? My first instinct is to look at the darkest regions, which ironically match the places with the smallest chance of snow.

Surely, the designer's idea is to play with white Christmas. But I am not liking the result.

***

Then, I happen upon an older version (2012) of this map, also done by NOAA. (See this Washington Post blog for example.)

WHITE-XMAS-NEW-NORMALS620p-olderNOAA

There are a number of design choices that make this version more effective.

The use of an unrelated brown color to cordon off the bottom category (0-10%) is a great idea.

Similarly, the play of hue and shade allows readers to see the data at multiple levels, first at the top level of more likely, less likely, and not likely, and then at the more detailed level of 10 categories.

Finally, there is no whiteness inside the US boundary. The top category is the lightest shade of purple, not exactly white. In the 2015 version above, the white of the snowy regions is not differentiated from the white of the Great Lakes.

I am still not convinced about the inversion of the darker-is-larger convention though. How about you?