Simple presentations

In the previous post, I looked at this chart that shows the distributions of four subgroups found in a dataset:

Davidcurran_originenglishwords

This chart takes quite some effort to decipher, as does another version I featured.

The key messages appear to be: (i) most English words are of Germanic origin, (ii) the most popular English words are even more skewed towards Germanic origin, (iii) words of French origin started showing up around rank 50, those of Latin origin around rank 250.

***

If we are making a graphic for presentation, we can simplify the visual clutter tremendously by - hmmm - a set of pie charts.

Junkcharts_redo_originenglishwords_pies

For those allergic to pies, here's a stacked column chart:

Junkcharts_redo_originenglishwords_columns

Both of these can be thought of as "samples" from the original chart, selected to highlight shifts in the relative proportions.

Davidcurran_originenglishwords_sampled

I also reversed the direction of the horizontal axis as I think the story is better told starting from the whole dataset and honing in on subsets.

 

P.S. [1/10/2025] A reader who has expertise in this subject also suggested a stacked column chart with reversed axis in a comment, so my recommendation here is confirmed.


Five-value summaries of distributions

BG commented on my previous post, describing her frustration with the “stacked range chart”:

A stacked graph visualizes cubes stacked one on top of the other. So you can't use it for negative numbers, because there's no such thing [as] "negative data". In graphs, a "minus" sign visualizes the opposite direction of one series from another. Doing average plus average plus average plus average doesn't seem logical at all.

***

I have already planned a second post to discuss the problems of using a stacked column chart to show markers of a numeric distribution.

I tried to replicate how the Youtuber generated his “stacked range chart” by appropriating Excel’s stacked column chart, but failed. I think there are some missing steps not mentioned in the video. At around 3:33 of the video, he shows a “hack” involving adding 100 degrees (any large enough value) to all values (already converted to ranges). Then, the next screen displays the resulting chart. Here is the dataset on the left and the chart on the right.

Minutephysics_londontemperature_datachart

Afterwards, he replaces the axis labels with new labels, effectively shifting the axis. But something is missing from the narrative. Since he’s using a stacked column chart, the values in the table are encoded in the heights of the respective blocks. The total stacked heights of each column should be in the hundreds since he has added 100 to each cell. But that’s not what the chart shows.

***

In the rest of the post, I’ll skip over how to make such a chart in Excel, and talk about the consequences of inserting “range” values into the heights of the blocks of a stacked column chart.

Let’s focus on London, Ontario; the five temperature values, corresponding to various average temperatures, are -3, 5, 9, 14, 24. Just throwing those numbers into a stacked column chart in Excel results in the following useless chart:

Stackedcolumnchart_londonontario

The temperature averages are cumulatively summed, which makes no sense, as noted by reader BG. [My daily temperature data differ somewhat from those in the Youtube. My source is here.]

We should ignore the interiors of the blocks, and instead interpret the edges of these blocks. There are five edges corresponding to the five data values. As in:

Junkcharts_redo_londonontariotemperatures_dotplot

The average temperature in London, Ontario (during Spring 2023-Winter 2024) is 9 C. This overall average hides seasonal as well as diurnal variations in temperature.

If we want to acknowledge that night-time temperatures are lower than day-time temperatures, we draw attention to the two values bracketing 9 C, i.e. 5 C and 14 C. The average daytime (max) temperature is 14 C while the average night-time (min) temperature is 5 C. Furthermore, Ontario experiences seasons, so that the average daytime temperature of 14 C is subject to seasonal variability; in the summer, it goes up to 24 C. In the winter, the average night-time temperature goes down to -3 C, compared to 5 C across all seasons. [For those paying closer attention, daytime/max and night-time/min form congruous pairs because the max temperature occurs during daytime while the min temperature occurs during night-time. Thus, the average of maximum temperatures is the same as the average of daytime maximum temperatures.]

The above dotplot illustrates this dataset adequately. The Youtuber explained why he didn’t like it – I couldn’t quite make sense of what he said. It’s possible he thinks the gaps between those averages are more meaningful than the averages themselves, and therefore he prefers a chart form that draws our attention to the ranges, rather than the values.

***

Our basic model of temperature can be thought of as: temperature on a given day = overall average + adjustment for seasonality + adjustment for diurnality.

Take the top three values 9, 14, 24 from above list. Starting at the overall average of 9 C, the analyst gets to 14 if he hones in on max daily temperatures, and to 24 if he further restricts the analysis to summer months (which have the higher temperatures). The second gap is 10 C, twice as large as the first gap of 5 C. Thus, the seasonal fluctuations have larger magnitude than daily fluctuations. Said differently, the effect of seasons on temperature is bigger than that of hour of day.

In interpreting the “ranges” or gaps between averages, narrow ranges suggest low variability while wider ranges suggest higher variability.

Here's a set of boxplots for the same data:

Junkcharts_redo_londonontariotemperatures

The boxplot "edges" also demarcate five values; they are not the same five values as defined by the Youtuber but both sets of five values describe the underlying distribution of temperatures.

 

P.S. For a different example of something similar, see this old post.


Dot plots with varying dot sizes

In a prior post, I appreciated the effort by the Bloomberg Graphics team to describe the diverging fortunes of Japanese and Chinese car manufacturers in various Asian markets.

The most complex chart used in that feature is the following variant of a dot plot:

Bloomberg_japancars_chinamarket

This chart plots the competitors in the Chinese domestic car market. Each bubble represents a car brand. Using the styling of the entire article, the red color is associated with Japanese brands while the medium gray color indicates Chinese brands. The light gray color shows brands from the rest of the world. (In my view, adding the pink for U.S. and blue for German brands - seen on the first chart in this series - isn't too much.)

The dot size represents the current relative market share of the brand. The main concern of the Bloomberg article is the change in market share in the period 2019-2024. This is placed on the horizontal axis, so the bubbles on the right side represent growing brands while the bubbles on the left, weakening brands.

All the Japanese brands are stagnating or declining, from the perspective of market share.

The biggest loser appears to be Volkswagen although it evidently started off at a high level since its bubble size after shrinkage is still among the largest.

***

This chart form is a composite. There are at least two ways to describe it. I prefer to see it as a dot plot with an added dimension of dot size. A dot plot typically plots a single dimension on a single axis, and here, a second dimension is encoded in the sizes of the dots.

An alternative interpretation is that it is a scatter plot with a third dimension in the dot size. Here, the vertical dimension is meaningless, as the dots are arbitrarily spread out to prevent overplotting. This arrangement is also called the bubble plot if we adopt a convention that a bubble is a dot of variable size. In a typical bubble plot, both vertical and horizontal axes carry meaning but here, the vertical axis is arbitrary.

The bubble plot draws attention to the variable in the bubble size, the scatter plot emphasizes two variables encoded in the grid while the dot plot highlights a single metric. Each shows secondary metrics.

***

Another revelation of the graph is the fragmentation of the market. There are many dots, especially medium gray dots. There are quite a few Chinese local manufacturers, most of which experienced moderate growth. Most of these brands are startups - this can be inferred because the size of the dot is about the same as the change in market share.

The only foreign manufacturer to make material gains in the Chinese market is Tesla.

The real story of the chart is BYD. I almost missed its dot on first impression, as it sits on the far right edge of the chart (in the original webpage, the right edge of the chart is aligned with the right edge of the text). BYD is the fastest growing brand in China, and its top brand. The pedestrian gray color chosen for Chinese brands probably didn't help. Besides, I had a little trouble figuring out if the BYD bubble is larger than the largest bubble in the size legend shown on the opposite end of BYD. (I measured, and indeed the BYD bubble is slightly larger.)

This dot chart (with variable dot sizes) is nice for highlighting individual brands. But it doesn't show aggregates. One of the callouts on the chart reads: "Chinese cars' share rose by 23%, with BYD at the forefront". These words are necessary because it's impossible to figure out that the total share gain by all Chinese brands is 23% from this chart form.

They present this information in the line chart that I included in the last post, repeated here:

Bloomberg_japancars_marketshares

The first chart shows that cumulatively, Chinese brands have increased their share of the Chinese market by 23 percent while Japanese brands have ceded about 9 percent of market share.

The individual-brand view offers other insights that can't be found in the aggregate line chart. We can see that in addition to BYD, there are a few local brands that have similar market shares as Tesla.

***

It's tough to find a single chart that brings out insights at several levels of analysis, which is why we like to talk about a "visual story" which typically comprises a sequence of charts.

 


Small tweaks that make big differences

It's one of those days that a web search led me to an unfamiliar corner, and I found myself poring over a pile of column charts that look like this:

GO-and-KEGG-diagrams-A-Forty-nine-different-GO-term-annotations-of-the-parental-genes

This pair of charts appears to be canonical in a type of genetics analysis. I'll focus on the column chart up top.

The chart plots a variety of gene functions along the horizontal axis. These functions are classified into three broad categories, indicated using axis annotation.

What are some small tweaks that readers will enjoy?

***

First, use colors. Here is an example in which the designer uses color to indicate the function classes:

Fcvm-09-810257-g006-3-colors

The primary design difference between these two column charts is using three colors to indicate the three function classes. This little change makes it much easier to recognize the ending of one class and the start of the other.

Color doesn't have to be limited to column areas. The following example extends the colors to the axis labels:

Fcell-09-755670-g004-coloredlabels

Again, just a smallest of changes but it makes a big difference.

***

It bugs me a lot that the long axis labels are printed in a slanted way, forcing every serious reader to read with slanted heads.

Slanting it the other way doesn't help:

Fig7-swayright

Vertical labels are best read...

OR-43-05-1413-g06-vertical

These vertical labels are best read while doing side planks.

Side-Plank

***

I'm surprised the horizontal alignment is rather rare. Here's one:

Fcell-09-651142-g004-horizontal

 


Adjust, and adjust some more

This Financial Times report illustrates the reason why we should adjust data.

The story explores the trend in economic statistics during 14 years of governing by conservatives. One of those metrics is so-called council funding (local governments). The graphic is interactive: as the reader scrolls the page, the chart transforms.

The first chart shows the "raw" data.

Ft_councilfunding1

The vertical axis shows year-on-year change in funding. It is an index relative to the level in 2010. From this line chart, one concludes that council funding decreased from 2010 to around 2016, then grew; by 2020, funding has recovered to the level of 2010 and then funding expanded rapidly in recent years.

When the reader scrolls down, this chart is replaced by another one:

Ft_councilfunding2

This chart contains a completely different picture. The line dropped from 2010 to 2016 as before. Then, it went flat, and after 2021, it started raising, even though by 2024, the value is still 10 percent below the level in 2010.

What happened? The data journalist has taken the data from the first chart, and adjusted the values for inflation. Inflation was rampant in recent years, thus, some of the raw growth have been dampened. In economics, adjusting for inflation is also called expressing in "real terms". The adjustment is necessary because the same dollar (hmm, pound) is worth less when there is inflation. Therefore, even though on paper, council funding in 2024 is more than 25 percent higher than in 2010, inflation has gobbled up all of that and more, to the point in which, in real terms, council funding has fallen by 20 percent.

This is one material adjustment!

Wait, they have a third chart:

Ft_councilfunding3

It's unfortunate they didn't stabilize the vertical scale. Relative to the middle chart, the lowest point in this third chart is about 5 percent lower, while the value in 2024 is about 10 percent lower.

This means, they performed a second adjustment - for population change. It is a simple adjustment of dividing by the population. The numbers look worse probably because population has grown during these years. Thus, even if the amount of funding stayed the same, the money would have to be split amongst more people. The per-capita adjustment makes this point clear.

***

The final story is much different from the initial one. Not only was the magnitude of change different but the direction of change reversed.

Whenever it comes to adjustments, remember that all adjustments are subjective. In fact, choosing not to adjust is also subjective. Not adjusting is usually much worse.

 

 

 

 


Chart without an axis

When it comes to global warming, most reports cite a single number such as an average temperature rise of Y degrees by year X. Most reports also claim the existence of a consensus within scientists. The Guardian presented the following chart that shows the spread of opinions amongst the experts.

Guardian_globalwarming

Experts were asked how many degrees they expect average global temperature to increase by 2100. The estimates ranged from "below 1.5 degrees" to "5 degrees or more". The most popular answer was 2.5 degrees. Roughly three out of four respondents picked a number at 2.5 degrees or above. The distribution is close to symmetric around the middle.

***

What kind of chart is this?

It's a type of histogram, given that the horizontal axis shows binned ranges of temperature change while the vertical axis shows number of respondents (out of 380).

A (count) histogram typically encodes the count data in the vertical axis. Did you notice there isn't a vertical axis?

That's because the chart has an abnormal axis. Each of the 380 respondents is shown here as a cell. What looks like a "column" is actually two-dimensional. Each row of cells has 10 slots. To find out how many respondents chose the 2.5 celsius category, you count the number of rows and then the number of stray items on top. (It's 132.)

Only the top row of cells can be partially filled so the general shape of the distribution isn't affected much. However, the lack of axis labels makes it hard to learn the count of each column.

It's even harder to know the proportions of respondents, which should be the primary message of the chart. The proportion would have been possible to show if the maximum number of rows was set to 38. The maximum number of rows on the above chart is 22. Using 38 rows leads to a chart with a lot of white space as the tallest column (count of 132) is roughly 35% of the total response.

At the end, I'm not sure this variant of histogram beats the standard histogram.


Losing the plot while stacking up the bars

I came across this chart from an infographics that claims to show which zip codes in the U.S. are the "dirtiest" (link). I won't go into the data analysis in this post - it's the usual "open data" style analysis that takes whatever data they could find (in this case, 311 calls) and make some hay out of it.

03_Dirtiest-Zip-Codes-in-New-York

It's amazing how such analyses frequently land on the Top N, Bottom N table. Top/Bottom N is euphemistically called "insights". But "insights" should answer at least one of these following questions: Where are these zip codes? What's the reason why 11216 has the highest rate of complaints while 11040 has the lowest? What measures can be taken to make the city cleaner?

***

The basic form chosen for this graphic is the bar chart. The data concerns the number of complaints per 100,000 people (about sanitation - they didn't disclose how they classified a complaint as about sanitation).

To mitigate the "boredom" of bar charts, the designer made the edges of the bars swiggly, and added icons of items found in trash inside the bars. These are thankfully not too intrusive.

Why are all the data printed on the chart? Try mentally wiping the data labels, and you'll understand why the designer did it.

If readers look at data labels rather than the bars, then the data visualization surely has failed. I'd prefer to use an axis

If you spend a few more minutes on the chart, you may notice the gray parts. This is not the simple bar chart but a stacked bar chart. In effect, every bar is referenced to the first bar, which shows the maximum number of complaints per 100K people. For example, zip code 10474 has about 90% of the complaints experienced in zip code 11216, the "dirtiest" place in New York.

***

The infographic then moves on to Los Angeles, and repeats the Top N/Bottom N presentation:

04_Dirtiest-Zip-Codes-in-Los-Angeles

With this, the plot is lost.

For an inexplicable reason, the dirtiest zip code in LA does not occupy the entire length of the bar. The worst zip code here fills out 87% of the bar length, implying that the entire bar represents the value of 34,978 complaints per 100K people. How did the designer decide on this number?

As a result, every other value is referenced to 34,978 and not to the rate of complaints in the dirtiest zip code!

***

The infographic eventually covers Houston. Here are the dirtiest two zip codes in Houston:

Housefresh_houston_dirtiest2

How does one interpret the orange section of the second bar? The original intention is for us to see that this zip code is about 80% as dirty as the dirtiest zip code. However, the full length of the bar does not here represent the dirtiest zip code.

***

We also got a hint as to why this entire analysis is problematic. The values in LA are way bigger than those in NY, about 4 times higher at the top of the table. Is LA really that much dirtier than NY? Or perhaps the data have not been properly aligned between cities?

 

P.S. [8-26-2023] Added link to the infographic.

 


The one thing you're afraid to ask about histograms

In the previous post about a variant of the histogram, I glossed over a few perplexing issues - deliberately. Today's post addresses one of these topics: what is going on in the vertical axis of a histogram?

The real question is: what data are encoded in the histogram, and where?

***

Let's return to the dataset from the last post. I grabbed data from a set of international football (i.e. soccer) matches. Each goal scored has a scoring minute. If the goal is scored in regulation time, the scoring minute is a number between 1 and 90 minutes. Specifically, the data collector applies a rounding up: any goal scored between 0 and 60 seconds is recorded as 1, all the way up to a goal scored between 89 and 90th minute being recorded as 90. In this post, I only consider goals scored in regulation time so the horizontal axis is between 1-90 minutes.

The kneejerk answer to the posed question is: counts in bins. Isn't it the case that in constructing a histogram, we divide the range of values (1-90) into bins, and then plot the counts within bins, i.e. the number of goals scored within each bin of minutes?

The following is what we have in mind:

Junkcharts_counthistogram_1

Let's call this the "count histogram".

Some readers may dislike the scale of the vertical axis, as its interpretation hinges on the total sample size. Hence, another kneejerk answer is: frequencies in bins. Instead of plotting counts directly, plot frequencies, which are just standardized counts. Just divide each value by the sample size. Here's the "frequency histogram":

Junkcharts_freqhistogram_1

The count and frequency histograms are identical except for the scale, and appear intuitively clear. The count and frequency data are encoded in the heights of the columns. The column widths are an afterthought, and they adhere to a fixed constant. Unlike a column chart, typically the gap width in a histogram is zero, as we want to partition the horizontal range into adjoining sections.

Now, if you look carefully at the histogram from the last post, reproduced below, you'd find that it plots neither counts nor frequencies:

Junkcharts_densityhistogram_1

The numbers on the axis are fractions, and suggest that they may be frequencies, but a quick check proves otherwise: with 9 columns, the average column should contain at least 10 percent of the data. The total of the displayed fractions is nowhere near 100%, which is our expectation if the values are relative frequencies. You may have come across this strangeness when creating histograms using R or some other software.

The purpose of this post is to explain what values are being plotted and why.

***

What are the kinds of questions we like to answer about the distribution of data?

At a high level, we want to know "where are my data"?

Arguably these two questions are fundamental:

  • what is the probability that the data falls within a given range of values? e.g., what is the probability that a goal is scored in the first 15 minutes of a football match?
  • what is the relative probability of data between two ranges of values? e.g. are teams more likely to score in last 5 minutes of the first half or the last five minutes of the second half of a football match?

In a histogram, the first question is answered by comparing a given column to the entire set of columns while the second question is answered by comparing one column to another column.

Let's see what we can learn from the count histogram.

Junkcharts_counthistograms_questions

In a count histogram, the heights encode the count data. To address the relative probability question, we note that the ratio of heights is the ratio of counts, and the ratio of counts is the same as the ratio of frequencies. Thus, we learn that teams are roughly 3000/1500 = 1.5 times more likely to score in the last 5 minutes of the second half than during the last 5 minutes of the first half. (See the green columns).

[For those who follow football, it's clear that the data collector treated goals scored during injury time of either half as scored during the last minute of the half, so this dataset can't be used to analyze timing of goals unless the real minutes were recorded for injury-time goals.]

To address the range probability question, we compare the aggregate height of the three orange columns with the total heights of all columns. Note that I said "height", not "area," because the heights directly encode counts. It's actually taxing to figure out the total height!

We resort to reading the total area of all columns. This should yield the correct answer: the area is directly proportional to the height because the column widths are fixed as a constant. Bear in mind, though, if the column widths vary (the theme of the last post), then areas and heights are not interchangable concepts.

Estimating the total area is still not easy, especially if the column heights exhibit high variance. What we need is the proportion of the total area that is orange. It's possible to see, not easy.

You may interject now to point out that the total area should equal the aggregate count (sample size). But that is a fallacy! It's very easy to make this error. The aggregate count is actually the total height, and because of that, the total area is the aggregate count multiplied by the column width! In my example, the total height is 23,682, which is the number of goals in the dataset, while the total area is 23,682 times 5 minutes.

[For those who think in equations, the total area is the sum over all columns of height(i) x width(i). When width is constant, we can take it outside the sum, and the sum of height(i) is just the total count.]

***

The count histogram is hard to use because it requires knowing the sample size. It's the first thing that is produced because the raw data are counts in bins. The frequency histogram is better at delivering answers.

In the frequency histogram, the heights encode frequency data. We can therefore just read off the relative probability of the orange column, bypassing the need to compute the total area.

This workaround actually promotes the fallacy described above for the count histogram. It is easy to fall into the trap of thinking that the total area of all columns is 100%. It isn't.

Similar to before, the total height should be the total frequency but the total area is the total frequency multipled by the column width, that is to say, the total area is the reciprocal of the bin width. In the football example, using 5-minute intervals, the total area of the frequency histogram is 1/(5 minutes) in the case of equal bin widths.

How about the relative probability question? On the frequency histogram, the ratio of column heights is the ratio of frequencies, which is exactly what we want. So long as the column width is constant, comparing column heights is easy.

***

One theme in the above discussion is that in the count and frequency histograms, the count and frequency data are encoded in the column heights but not the column areas. This is a source of major confusion. Because of the convention of using equal column widths, one treats areas and heights as interchangable... but not always. The total column area isn't the same as the total column height.

This observation has some unsettling implications.

As shown above, the total area is affected by the column width. The column width in an equal-width histogram is the range of the x-values divided by the number of bins. Thus, the total area is a function of the number of bins.

Consider the following frequency histograms of the same scoring minutes dataset. The only difference is the number of bins used.

Junkcharts_freqhistogram_differentbins

Increasing the number of bins has a series of effects:

  • the columns become narrower
  • the columns become shorter, because each narrower bin can contain at most the same count as the wider bin that contains it.
  • the total area of the columns become smaller.

This last one is unexpected and completely messes up our intuition. When we increase the number of bins, not only are the columns shortening but the total area covered by all the columns is also shrinking. Remember that the total area whether it is a count or frequency histogram has a factor equal to the bin width. Higher number of bins means smaller bin width, which means smaller total area.

***

What if we force the total area to be constant regardless of how many bins we use? This setting seems more intuitive: in the 5-bin histogram, we partition the total area into five parts while in the 10-bin histogram, we divide it into 10 parts.

This is the principle used by R and the other statistical software when they produce so-called density histograms. The count and frequency data are encoded in the column areas - by implication, the same data could not have been encoded simultaneously in the column heights!

The way to accomplish this is to divide by the bin width. If you look at the total area formulas above, for the count histogram, total area is total count x bin width. If the height is count divided by bin width, then the total area is the total count. Similarly, if the height in the frequency histogram is frequency divided by bin width, then the total area is 100%.

Count divided by some section of the x-range is otherwise known as "density". It captures the concept of how tightly the data are packed inside a particular section of the dataset. Thus, in a count-density histogram, the heights encode densities while the areas encode counts. In this case, total area is the total count. If we want to standardize total area to be 1, then we should compute densities using frequencies rather than counts. Frequency densities are just count densities divided by the total count.

To summarize, in a frequency-density histogram, the heights encode densities, defined as frequency divided by the bin width. This is not very intuitive; just think of densities as how closely packed the data are in the specified bin. The column areas encode frequencies so that the total area is 100%.

The reason why density histograms are confusing is that we are reading off column heights while thinking that the total area should add up to 100%. Column heights and column areas cannot both add up to 100%. We have to pick one or the other.

Comparing relative column heights still works when the density histogram has equal bin widths. In this case, the relative height and relative area are the same because relative density equals relative frequencies if the bin width is fixed.

The following charts recap the discussion above. It shows how the frequency histogram does not preserve the total area when bin sizes are changed while the density histogram does.

Junkcharts_freqdensityhistograms_differentbins

***

The density histogram is a major pain for solving range probability questions because the frequencies are encoded in the column areas, not the heights. Areas are not marked out in a graph.

The column height gives us densities which are not probabilities. In order to retrieve probabilities, we have to multiply the density by the bin width, that is to say, we must estimate the area of the column. That requires mapping two dimensions (width, height) onto one (area). It is in fact impossible without measurement - unless we make the bin widths constant.

When we make the bin widths constant, we still can't read densities off the vertical axis, and treat them as probabilities. If I must use the density histogram to answer the question of how likely a team scores in the first 15 minutes, I'd sum the heights of the first 3 columns, which is about 0.025, and then multiply it by the bin width of 5 minutes, which gives 0.125 or 12.5%.

At the end of this exploration, I like the frequency histogram best. The density histogram is useful when we are comparing different histograms, which isn't the most common use case.

***

The histogram is a basic chart in the tool kit. It's more complicated than it seems. I haven't come across any intro dataviz books that explain this clearly.

Most of this post deals with equal-width histograms. If we allow bin widths to vary, it gets even more complicated. Stay tuned.

***

For those using base R graphics, I hope this post helps you interpret what they say in the manual. The default behavior of the "hist" function depends on whether the bins are equal width:

  • if the bin width is constant, then R produces a count histogram. As shown above, in a count histogram, the column heights indicate counts in bins but the total column area does not equal the total sample size, but the total sample size multiplied by the bin width. (Equal width is the default unless the user specifies bin breakpoints.)
  • if the bin width is not constant, then R produces a (frequency-)density histogram. The column heights are densities, defined as frequencies divided by bin width while the column areas are frequencies, with the total area summing to 100%.

Unfortunately, R does not generate a frequency histogram. To make one, you'd have to divide the counts in bins by the sum of counts. (In making some of the graphs above, I tricked it.) You also need to trick it to make a frequency-density histogram with equal-width bins, as it's coded to produce a count histogram when bin size is fixed.

 

P.S. [5-2-2023] As pointed out by a reader, I should clarify that R and I use the word "frequency" differently. Specifically, R uses frequency to mean counts, therefore, what I have been calling the "count histogram", R would have called it a "frequency histogram", and what I have been describing as a "frequency histogram", the "hist" function simply does not generate it unless you trick it to do so. I'm using "frequency" in the everyday sense of the word, such as "the frequency of the bus". In many statistical packages, frequency is used to mean "count", as in the frequency table which is just a table of counts. The reader suggested proportion which I like, or something like weight.

 

 

 

 

 


Deconstructing graphics as an analysis tool in dataviz

One of the useful exercises I like to do with charts is to "deconstruct" them. (This amounts to a deeper version of the self-sufficiency test.)

Here is a chart stripped down to just the main visual elements.

Junkcharts_cbcrevenues_deconstructed1

The game is to guess what is the structure of the data given these visual elements.

I guessed the following:

  • The data has a top-level split into two groups
  • Within each group, the data is further split into 3 parts, corresponding to the 3 columns
  • With each part, there are a variable number of subparts, each of which is given a unique color
  • The color legend suggests that each group's data are split into 7 subparts, so I'm guessing that the 7 subparts are aggregated into 3 parts
  • The core chart form is a stacked column chart with absolute values so relative proportions within each column (part) is important
  • Comparing across columns is not supported because each column has its own total value
  • Comparing same-color blocks across the two groups is meaningful. It's easier to compare their absolute values but harder to compare the relative values (proportions of total)

If I knew that the two groups are time periods, I'd also guess that the group on the left is the earlier time period, and the one on the right is the later time period. In addition to the usual left-to-right convention for time series, the columns are getting taller going left to right. Many things (not all, obviously) grow over time.

The color choice is a bit confusing because if the subparts are what I think they are, then it makes more sense to use one color and different shades within each column.

***

The above guesses are a mixed bag. What one learns from the exercise is what cues readers are receiving from the visual structure.

Here is the same chart with key contextual information added back:

Junkcharts_cbcrevenues_deconstructed2

Now I see that the chart concerns revenues of a business over two years.

My guess on the direction of time was wrong. The more recent year is placed on the left, counter to convention. This entity therefore suffered a loss of revenues from 2017-8 to 2018-9.

The entity receives substantial government funding. In 2017-8, it has 1 dollar of government funds for every 2 dollars of revenues. In 2018-9, it's roughly 2 dollars of government funds per every 3 dollars of revenues. Thus, the ratio of government funding to revenues has increased.

On closer inspection, the 7 colors do not represent 7 components of this entity's funding. The categories listed in the color legend overlap.

It's rather confusing but I missed one very important feature of the chart in my first assessment: the three columns within each year group are nested. The second column breaks down revenues into 3 parts while the third column subdivides advertising revenues into two parts.

What we've found is that this design does not offer any visual cues to help readers understand how the three columns within a year-group relates to each other. Adding guiding lines or changing the color scheme helps.

***

Next, I add back the data labels:

Cbc_revenues_original

The system of labeling can be described as: label everything that is not further broken down into parts on the chart.

Because of the nested structure, this means two of the column segments, which are the sums of subparts, are not labeled. This creates a very strange appearance: usually, the largest parts are split into subparts, so such a labeling system means the largest parts/subparts are not labeled while the smaller, less influential, subparts are labeled!

You may notice another oddity. The pink segment is well above $1 billion but it is roughly the size of the third column, which represents $250 million. Thus, these columns are not drawn to scale. What happened? Keep reading.

***

Here is the whole chart:

Cbc_revenues_original

A twitter follower sent me this chart. Elon Musk has been feuding with the Canadian broadcaster CBC.

Notice the scale of the vertical axis. It has a discontinuity between $700 million and $1.7 billion. In other words, the two pink sections are artificially shortened. The erased section contains $1 billion (!) Notice that the erased section is larger than the visible section.

The focus of Musk's feud with CBC is on what proportion of the company's funds come from the government. On this chart, the only way to figure that out is to copy out the data and divide. It's roughly 1.2/1.7 = 70% approx.

***

The exercise of deconstructing graphics helps us understand what parts are doing what, and it also reveals what cues certain parts send to readers.

In better dataviz, every part of the chart is doing something useful, it's free of redundant parts that take up processing time for no reason, and the cues to readers move them towards the intended message, not away from it.

***

A couple of additional comments:

I'm not sure why old data was cited because in the most recent accounting report, the proportion of government funding was around 65%.

Source of funding is not a useful measure of pro- or anti-government bias, especially in a democracy where different parties lead the government at different times. There are plenty of mouthpiece media that do not apparently receive government funding.


Finding the story in complex datasets

In CT Mirror's feature about Connecticut, which I wrote about in the previous post, there is one graphic that did not rise to the same level as the others.

Ctmirror_highschools

This section deals with graduation rates of the state's high school districts. The above chart focuses on exactly five districts. The line charts are organized in a stack. No year labels are provided. The time window is 11 years from 2010 to 2021. The column of numbers show the difference in graduation rates over the entire time window.

The five lines look basically the same, if we ignore what looks to be noisy year-to-year fluctuations. This is due to the weird aspect ratio imposed by stacking.

Why are those five districts chosen? Upon investigation, we learn that these are the five districts with the biggest improvement in graduation rates during the 11-year time window.

The same five schools also had some of the lowest graduation rates at the start of the analysis window (2010). This must be so because if a school graduated 90% of its class in 2010, it would be mathematically impossible for it to attain a 35% percent point improvement! This is a dissatisfactory feature of the dataviz.

***

In preparing an alternative version, I start by imagining how readers might want to utilize a visualization of this dataset. I assume that the readers may have certain school(s) they are particularly invested in, and want to see its/their graduation performance over these 11 years.

How does having the entire dataset help? For one thing, it provides context. What kind of context is relevant? As discussed above, it's futile to compare a school at the top of the ranking to one that is near the bottom. So I created groups of schools. Each school is compared to other schools that had comparable graduation rates at the start of the analysis period.

Amistad School District, which takes pole position in the original dataviz, graduated only 58% of its pupils in 2010 but vastly improved its graduation rate by 35% over the decade. In the chart below (left panel), I plotted all of the schools that had graduation rates between 50 and 74% in 2010. The chart shows that while Amistad is a standout, almost all schools in this group experienced steady improvements. (Whether this phenomenon represents true improvement, or just grade inflation, we can't tell from this dataset alone.)

Redo_junkcharts_ctmirrorhighschoolsgraduation_1

The right panel shows the group of schools with the next higher level of graduation rates in 2010. This group of schools too increased their graduation rates almost always. The rate of improvement in this group is lower than in the previous group of schools.

The next set of charts show school districts that already achieved excellent graduation rates (over 85%) by 2010. The most interesting group of schools consists of those with 85-89% rates in 2010. Their performance in 2021 is the most unpredictable of all the school groups. The majority of districts did even better while others regressed.

Redo_junkcharts_ctmirrorhighschoolsgraduation_2

Overall, there is less variability than I'd expect in the top two school groups. They generally appeared to have been able to raise or maintain their already-high graduation rates. (Note that the scale of each chart is different, and many of the lines in the second set of charts are moving within a few percentages.)

One more note about the charts: The trend lines are "smoothed" to focus on the trends rather than the year to year variability. Because of smoothing, there is some awkward-looking imprecision e.g. the end-to-end differences read from the curves versus the observed differences in the data. These discrepancies can easily be fixed if these charts were to be published.