Who trades with Sweden

It's great that the UN is publishing dataviz but it can do better than this effort:

Untradestats_sweden

Certain problems are obvious. The country names turned sideways. The meaningless use of color. The inexplicable sequencing of the country/region.

Some problems are subtler. "Area, nes" - upon research - is a custom term used by UN Trade Statistics, meaning "not elsewhere specified".

The gridlines are debatable. Their function is to help readers figure out the data values if they care. The design omitted the top and bottom gridlines, which makes it hard to judge the values for USA (dark blue), Netherlands (orange), and Germany (gray).

See here, where I added the top gridline.

Redo_untradestats_sweden_gridline

Now, we can see this value is around 3.6, just over the halfway point between gridlines.

***

A central feature of trading statistics is "balance". The following chart makes it clear that the positive numbers outweigh the negative numbers in the above chart.

Redo_untradestats_sweden

At the time I made the chart, I wasn't sure how to interpret the gap of 1.3%. Looking at the chart again, I think it's saying Sweden has a trade surplus equal to that amount.


A German obstacle course

Tagesschau_originalA twitter user sent me this chart from Germany.

It came with a translation:

"Explanation: The chart says how many car drivers plan to purchase a new state-sponsored ticket for public transport. And of those who do, how many plan to use their car less often."

Because visual language should be universal, we shouldn't be deterred by not knowing German.

The structure of the data can be readily understood: we expect three values that add up to 100% from the pie chart. The largest category accounts for 58% of the data, followed by the blue category (40%). The last and smallest category therefore has 2% of the data.

The blue category is of the most interest, and the designer breaks that up into four sub-groups, three of which are roughly similarly popular.

The puzzle is the identities of these categories.

The sub-categories are directly labeled so these are easy for German speakers. From a handy online translator, these labels mean "definitely", "probably", "rather not", "definitely not". Well, that's not too helpful when we don't know what the survey question is.

According to our correspondent, the question should be "of those who plan to buy the new ticket, how many plan to use their car less often?"

I suppose the question is found above the column chart under the car icon. The translator dutifully outputs "Thus rarer (i.e. less) car use". There is no visual cue to let readers know we are supposed to read the right hand side as a single column. In fact, for this reader, I was reading horizontally from top to bottom.

Now, the two icons on the left and the middle of the top row should map to not buying and buying the ticket. The check mark and cross convey that message. But... what do these icons map to on the chart below? We get no clue.

In fact, the will-buy ticket group is the 40% blue category while the will-not group is the 58% light gray category.

What about the dark gray thin sector? Well, one needs to read the fine print. The footnote says "I don't know/ no response".

Since this group is small and uninformative, it's fine to push it into the footnote. However, the choice of a dark color, and placing it at the 12-o'clock angle of the pie chart run counter to de-emphasizing this category!

Another twitter user visually depicts the journey we take to understand this chart:

Tagesschau_reply

The structure of the data is revealed better with something like this:

Redo_tagesschau_newticket

The chart doesn't need this many colors but why not? It's summer.

 

 

 

 


Variance is a friend of dataviz

Seven years ago, I wrote a post about "invariance" in data visualization, which is something we should avoid (link). Yesterday, Business Insider published the following chart in an article about rising gas prices (link):

Businessinsider_gasprices_prices

The map shows the average prices at the pump in seven regions of the United States. 

This chart is succeeded by the following map:

Businessinsider_gasprices_pricechange

This second map shows the change in average gas prices in the same seven regions.

This design is invariant to the data! While the data change, the visualization looks identical. That's because the data are not encoded to any visual element - they are just printed as labels.

 


Superb tile map offering multiple avenues for exploration

Here's a beauty by WSJ Graphics:

Wsj_powerproduction

The article is here.

This data graphic illustrates the power of the visual medium. The underlying dataset is complex: power production by type of source by state by month by year. That's more than 90,000 numbers. They all reside on this graphic.

Readers amazingly make sense of all these numbers without much effort.

It starts with the summary chart on top.

Wsj_powerproduction_us_summary

The designer made decisions. The data are presented in relative terms, as proportion of total power production. Only the first and last years are labeled, thus drawing our attention to the long-term trend. The order of the color blocks is carefully selected so that the cleaner sources are listed at the top and the dirtier sources at the bottom. The order of the legend labels mirrors the color blocks in the area chart.

It takes only a few seconds to learn that U.S. power production has largely shifted away from coal with most of it substituted by natural gas. Other than wind, the green sources of power have not gained much ground during these years - in a relative sense.

This summary chart serves as a reading guide for the rest of the chart, which is a tile map of all fifty states. Embedded in the tile map is a small-multiples arrangement.

***

The map offers multiple avenues for exploration.

Some readers may look at specific states. For example, California.

Wsj_powerproduction_california

Currently, about half of the power production in California come from natural gas. Notably, there is no coal at all in any of these years. In addition to wind, solar energy has also gained. All of these insights come without the need for any labels or gridlines!

Wsj_powerproduction_westernstatesBrowsing around California, readers find different patterns in other Western states like Oregon and Washington.

Hydroelectric energy is the dominant source in those two states, with wind gradually taking share.

At this point, readers realize that the summary chart up top hides remarkable state-level variations.

***

There are other paths through the map.

Some readers may scan the whole map, seeking patterns that pop out.

One such pattern is the cluster of states that use coal. In most of these states, the proportion of coal has declined.

Yet another path exists for those interested in specific sources of power.

For example, the trend in nuclear power usage is easily followed by tracking the purple. South Carolina, Illinois and New Hampshire are three states that rely on nuclear for more than half of its power.

Wsj_powerproduction_vermontI wonder what happened in Vermont about 8 years ago.

The chart says they renounced nuclear energy. Here is some history. This one-time event caused a disruption in the time series, unique on the entire map.

***

This work is wonderful. Enjoy it!


Funnel is just for fun

This is part 2 of a review of a recent video released by NASA. Part 1 is here.

The NASA video that starts with the spiral chart showing changes in average global temperature takes a long time (about 1 minute) to run through 14 decades of data, and for those who are patient, the chart then undergoes a dramatic transformation.

With a sleight of hand, the chart went from a set of circles to a funnel. Here is a look:

Nasa_climatespiral_funnel

What happens is the reintroduction of a time dimension. Imagine pushing the center of the spiral down into the screen to create a third dimension.

Our question as always is - what does this chart tell readers?

***

The chart seems to say that the variability of temperature has increased over time (based on the width of the funnel). The red/blue color says the temperature is getting hotter especially in the last 20-40 years.

When the reader looks beneath the surface, the chart starts to lose sense.

The width of the funnel is really a diameter of the spiral chart in the given year. But, if you recall, the diameter of the spiral (polar) chart isn't the same between any pairs of months.

Nasa_climatespiral_fullperiod

In the particular rendering of this video, the width of the funnel is the diameter linking the April and October values.

Remember the polar gridlines behind the spiral:

Nasa_spiral_gridlines

Notice the hole in the middle. This hole has arbitrary diameter. It can be as big or as small as the designer makes it. Thus, the width of the funnel is as big or as small as the designer wants it. But the first thing that caught our attention is the width of the funnel.

***

The entire section between -1 and + 1 is, in fact, meaningless. In the following chart, I removed the core of the funnel, adding back the -1 degree line. Doing so exposes an incompatibility between the spiral and funnel views. The middle of the polar grid is negative infinity, a black hole.

Junkcharts_nasafunnel_arbitrarygap

For a moment, the two sides of the funnel look like they are mirror images. That's not correct, either. Each width of the funnel represents a year, and the extreme values represent April and October values. The line between those two values does not signify anything real.

Let's take a pair of values to see what I mean.

Junkcharts_nasafunnel_lines

I selected two values for October 2021 and October 1899 such that the first value appears as a line double the length of the second. The underlying values are +0.99C and -0.04C, roughly speaking, +1 and 0, so the first value is definitely not twice the size of the second.

The funnel chart can be interpreted, in an obtuse way, as a pair of dot plots. As shown below, if we take dot plots for Aprils and Octobers of every year, turn the chart around, and then connect the corresponding dots, we arrive at the funnel chart.

Junkcharts_nasafunnel_fromdotplots

***

This NASA effort illustrates a central problem in visual communications: attention (what Andrew Gelman calls "grabbiness") and information integrity. On the one hand, what's the point of an accurate chart when no one is paying attention? On the other hand, what's the point of a grabby chart when anyone who pays attention gets the wrong information? It's not easy to find that happy medium.


What do I think about spirals?

A twitter user asked how I feel about this latest effort (from NASA) to illustrate global warming. To see the entire video, go to their website.

Nasa_climatespiral_fullperiod

This video hides the lede so be patient or jump ahead to 0:56 and watch till the end.

Let's first describe what we are seeing.

The dataset consists of monthly average global temperature "anomalies" from 1880 to 2021 - an "anomaly" is the deviation of the average temperature that month from a reference level (seems like this is fixed at the average temperatures by month between 1951 and 1980).

A simple visualization of the dataset is this:

Junkcharts_redo_nasasprials_longline

We see a gradual rise in temperature from the 1980s to today. The front half of this curve is harder to interpret. The negative values suggest that the average temperatures prior to 1951 are generally lower than the temperature in the reference period. Other than 1880-1910, temperatures have generally been rising.

Now imagine chopping up the above chart into yearly increments, 12 months per year. Then wrap each year's line into a circle, and place all these lines onto the following polar grid system.

Junkcharts_redo_nasaspiral_linesandcircles

Close but not quite there. The circles in the NASA video look much smoother. Two possibilities here. First is the aspect ratio. Note that the polar grid stretches the time axis to the full circle while the vertical axis is squashed. Not enough to explain the smoothness, as seen below.

Junkcharts_redo_nasaspirals_unsmoothedwide

The second possibility is additional smoothing between months.

Junkcharts_redo_nasaspirals_smoothedlines

The end result is certainly pretty:

Nasa_climatespiral_fullperiod

***

Is it a good piece of scientific communications?

What is the chart saying?

I see red rings on the outside, white rings in the middle, and blue rings near the center. Red presumably means hotter, blue cooler.

The gridlines are painted over. The 0 degree (green) line is printed over again and again.

The biggest red circles are just beyond the 1 degree line with the excess happening in the January-March months. In making that statement, I'm inferring meaning to excess above 1 degree. This inference is purely based on where the 1-degree line is placed.

I also see in the months of December and January, there may have been "cooling", as the blue circles edge toward the -1 degree gridline. Drawing this inference actually refutes my previous claim. I had said that the bulge beyond the +1 degree line is informative because the designer placed the +1 degree line there. If I applied the same logic, then the location of the -1 degree line implies that only values more negative than -1 matter, which excludes the blue bulge!

Now what years are represented by these circles? Test your intuition. Are you tempted to think that the red lines are the most recent years, and the blue lines are the oldest years? If you think so, like I do, then we fall into a trap. We have now imputed two meanings to color -- temperature and recency, when the color coding can only hold one.

The only way to find out for sure is to rewind the tape and watch from the start. The year dimension is pushed to the background in this spiral chart. Instead, the month dimension takes precedence. Recall that at the start, the circles are white. The bluer circles appear in the middle of the date range.

This dimensional flip flop is a key difference between the spiral chart and the line chart (shown again for comparison).

Junkcharts_redo_nasasprials_longline

In the line chart, the year dimension is primary while the month dimension is pushed to the background.

Now, we have to decide what the message of the chart should be. For me, the key message is that on a time scale of decades, the world has experienced a significant warming to the tune of about 1.5 degrees Celsius (35 F2.7 F). The warming has been more pronounced in the last 40 years. The warming is observed in all twelve months of the year.

Because the spiral chart hides the year dimension, it does not convey the above messages.

The spiral chart shares the same weakness as the energy demand chart discussed recently (link). Our eyes tend to focus on the outer and inner envelopes of these circles, which by definition are extreme values. Those values do not necessarily represent the bulk of the data. The spiral chart in fact tells us that there is not much to learn from grouping the data by month. 

The appeal of a spiral chart for periodic data is similar to a map for spatial data. I don't recommend using maps unless the spatial dimension is where the signal lies. Similarly, the spiral chart is appropriate if there are important deviations from a seasonal pattern.

 

 


The envelope of one's data

This post is the second post in response to a blog post at StackOverflow (link) in which the author discusses the "harm" of "aggregating away the signal" in your dataset. The first post appears on my book blog earlier this week (link).

One stop in their exploratory data analysis journey was the following chart:

Stackoverflow_variabilitychart

This chart plots all the raw data, all 8,760 values of electricity consumption in California in 2020. Most analysts know this isn't a nice chart, and it's an abuse of ink. This chart is used as a contrast to the 4-week moving average, which was hoisted up as an example of "over-aggregation".

Why is the above chart bad (aside from the waste of ink)? Think about how you consume the information. For me, I notice these features in the following order:

  1. I see the upper "envelope" of the data, i.e. the top values at each hour of each day throughout the year. This gives me the seasonal pattern with a peak in the summer months.
  2. I see the lower "envelope" of the data
  3. I see the "height" of the data, which is, roughly speaking, the range of values within a day
  4. If I squint hard enough, I see a darker band within the band, which roughly maps to the most frequently occurring values (this feature becomes more prominent if we select a lighter shade of gray)

The chart may not be as bad as it looks. The "moving average" is sort of visible. The variability of consumption is visible. The primary problem is it draws attention to the outliers, rather than the more common values.

The envelope of any dataset is composed of extreme values, by definition. For most analysis objectives, extreme values are "noise". In the chart above, it's hard to tell how common the maximum values are relative to other possible values but it's the upper envelope that captures my attention - simply because it's the easiest trend to make out.

***

The same problem actually surfaces in the "improved" chart:

Stackoverflow_weekofyearchart

As explained in the preceding post, this chart rearranges the data. Instead of a single line, therea are now 52 overlapping lines, one for each week of the year. So each line is much less dense and we can make out the hour of day/day of week pattern.

Notice that the author draws attention to the upper envelope of this chart. They notice the line(s) near the top are from the summer, and this further guides their next analysis.

The reason for focusing on the envelope is the same as in the other chart. Where the lines are dense, it's not easy to make out the pattern.

Even the envelope is not as clear as it seems! There is no reason why the highlighted week (August 16 to 23) should have the highest consumption value each hour of each day of the week. It's possible that the line dips into the middle of the range at various points along the line. In the following chart, I highlight two time points in which lines may or may not have crossed:

Junkcharts_stackoverflow_confusingenvelope

In an interactive chart, each line can be highlighted to resolve the confusion.

Note that the lower envelope is much harder to decipher, given the density of lines.

***
The author then pursues a hypothesis that there are lines (weeks) with one intra-day peak and there are those with two peaks.

I'd propose that those are not discrete states but continuous. The base pattern can be one with two peaks, a higher peak in the evening, and a lower peak in the morning. Now, if you imagine pushing up the evening peak while holding the lower peak at its height, you'd gradually "erase" the lower peak but it's just receded into the background.

Possibly the underlying driver is the total demand for energy. The higher the demand, the more likely it's concentrated in the evening, which causes the lower peak to recede. The lower the demand, the more likely we see both peaks.

In either case, the prior chart drives the direction of the next analysis.

 

 

 

 

 


Improving simple bar charts

Here's another bar chart I came across recently. The chart - apparently published by Kaggle - appeared to present challenges data scientists face in industry:

Kaggle

This chart is pretty standard, and inoffensive. But we can still make it better.

Version 1

Redo_kaggle_nodecimals

I removed the decimals from the data labels.

Version 2

Redo_kaggle_noaxislabels

Since every bar is labelled, is anyone looking at the axis labels?

Version 3

Redo_kaggle_nodatalabels

You love axis labels. Then, let's drop the data labels.

Version 4

Redo_kaggle_categories

Ahh, so data scientists struggle with data problems, and people issues. They don't need better tools.


Easy breezy bar charts, perhaps

I came across the following bar chart (link), which presents the results of a survey of CMOs (Chief Marketing Officers) on their attitudes toward data analytics.

Big-Data-and-the-CMO_chart5-Hurdle-800_30Apr2013Responses are tabulated to the question about the most significant hurdle(s) against the increasing use of data and analytics for marketing.

Eleven answers were presented, in addition to the catchall "Other" response. I'm unable to divine the rule used by the designer to sequence the responses.

It's not in order of significance, the most obvious choice. It's not alphabetical, either.

***

I think this indiscretion is partially redeemed by the use of color shades. The darkest blue shade points our eyes to the most significant hurdle - lack of investment in technology (44% of respondents). The second most significant hurdle is "availability of credible tools for measuring effectiveness" (31%), and that too is in dark blue.

Now what? The third most popular answer has 30% of the respondents, but it's shown by the second palest blue! I then realize the colors don't actually convey any information. Five shades of blue were selected, and they are laid out from top to bottom, from palest to darkest, in a sequential, recursive manner.

***

This chart is wild. Notice how the heights of the bars are variable. It seems that some bars have been widened to accommodate wrapped lines of text. These small edits introduce visual distortion so that the areas of these bars no longer are proportional to the data.

I like a pair of design decisions. Not showing decimal places and appending the % sign on each bar label is good. They also extend the horizontal axis to 100%. This shows what proportion of the respondents selected any particular answer - we note that a respondent is allowed to select more than one response.

The following is a more standard way of making a bar chart. (The color shading is not necessary.)

Redo_CMOsurveyanalytics

This example proves that the V corner of the Trifecta Checkup is still relevant. After one develops a good question, collects useful data and selects a standard chart form, figuring out how to visually display the information is not as easy breezy as one might think.


Visualizing composite ratings

A twitter reader submitted the following chart from Autoevolution (link):

Google-maps-is-no-longer-the-top-app-for-navigation-and-offline-maps-179196_1

This is not a successful chart for the simple reason that readers want to look away from it. It's too busy. There is so much going on that one doesn't know where to look.

The underlying dataset is quite common in the marketing world. Through surveys, people are asked to rate some product along a number of dimensions (here, seven). Each dimension has a weight, and combined, the weighted sum becomes a composite ranking (shown here in gray).

Nothing in the chart stands out as particularly offensive even though the overall effect is repelling. Adding the overall rating on top of each column is not the best idea as it distorts the perception of the column heights. But with all these ingredients, the food comes out bland.

***

The key is editing. Find the stories you want to tell, and then deconstruct the chart to showcase them.

I start with a simple way to show the composite ranking, without any fuss:

Redo_junkcharts_autoevolution_top

[Since these are mockups, I have copied all of the data, just the top 11 items.]

Then, I want to know if individual products have particular strengths or weaknesses along specific dimensions. In a ranking like this, one should expect that some component ratings correlate highly with the overall rating while other components deviate from the overall average.

An example of correlated ratings is the Customers dimension.

Redo_junkcharts_autoevolution_customer

The general pattern of the red dots clings closely to that of the gray bars. The gray bars are the overall composite ratings (re-scaled to the rating range for the Customers dimension). This dimension does not tell us more than what we know from the composite rating.

By contrast, the Developers Ecosystem dimension provides additional information.

Redo_junkcharts_autoevolution_developer

Esri, AzureMaps and Mapbox performed much better on this dimension than on the average dimension. 

***

The following construction puts everything together in one package:

Redo_mapsplatformsratings.002