One doesn't have to plot raw data

Visual Capitalist chose a treemap to show us where gold is produced (link):

Viscap_gold2023

The treemap is embedded into a brick of gold. Any treemap is difficult to read, mostly because some block are vertical, others horizontal. A rough understanding is nevertheless possible: the entire global production can be roughly divided into four parts: China plus three other Asian producers account for roughly (not quite) a quarter; "rest of the world" (i.e. all countries not individually listed) is a quarter; Russia and Australia together is again a bit less than a quarter.

***

When I look at datasets that rank countries by some metric, I'm hoping to present insights, rather than the raw data. Insights typically involve comparing countries, or sets of countries, or one country against a set of countries. So, I made the following chart that includes some of these insights I found in the gold production dataset:

Junkcharts_redo_viscap_gold2023

For example, the top 4 producers in Asia account for almost a quarter of the world's output; Canada, U.S. and Australia together also roughly produce a quarter; the rest of the world has a similar output. In Asia, China's output is about the sum of the next 3 producers, which is about the same as U.S. and Canada, which is about the same as the top 5 in Africa.

 


An elaborate data vessel

Visualcapitalist_globaloilproductionI recently came across the following dataviz showing global oil production (link).

This is an ambitious graphic that addresses several questions of composition.

The raw data show the amount of production by country adding up to the global total. The countries are then grouped by region. Further, the graph presents an oil-and-gas specific grouping, as indicated by the legend shown just below the chart title. This grouping is indicated by the color of the circumference of the circle containing the flag of the country.

This chart form is popular in modern online graphics programs. It is like an elaborate data vessel. Because the countries are lined up around the barrel, a space has been created on three sides to admit labels and text annotations. This is a strength of this chart form.

***

The chart conveys little information about the underlying data. Each country is given a unique odd shaped polygon, making it impossible to compare sizes. It’s definitely possible to pick out U.S., Russia, Saudi Arabia as the top producers. But in presenting the ranks of the data, this chart form pales in comparison to a straightforward data table, or a bar chart. The less said about presenting values, the better.

Indeed, our self-sufficiency test exposes the inability of these polygons to convey the data. This is precisely why almost all values of the dataset are present on the chart.

***

The dataviz subtly presumes some knowledge on the part of the readers.

The regions are not directly labeled. The readers must know that Saudi Arabia is in the Middle East, U.S. is part of North America, etc. Admittedly this is not a big ask, but it is an ask.

It is also assumed that readers know their flags, especially those of smaller countries. Some of the small polygons have no space left for country names and they are labeled with just flags.

Visualcapitalist_globaloilproduction_nocountrylabels

In addition, knowing country acronyms is required for smaller countries as well. For example, in Africa, we find AGO, COG and GAB.

Visualcapitalist_globaloilproduction_countryacronyms

For this chart form the designer treats each country according to the space it has on the chart (except those countries that found themselves on the edges of the barrel). Font sizes, icons, labels, acronyms, data labels, etc. vary.

The readers are assumed to know the significance of OPEC and OPEC+. This grouping is given second fiddle, and can be found via the color of the circumference of the flag icons.

Visualcapitalist_globaloilproduction_opeclegend

I’d have not assigned a color to the non-OPEC countries, and just use the yellow and blue for OPEC and OPEC+. This is a little edit but makes the search for the edges more efficient.

Visualcapitalist_globaloilproduction_twoopeclabels

***

Let’s now return to the perception of composition.

In exactly the same manner as individual countries, the larger regions are represented by polygons that have arbitrary shapes. One can strain to compile the rank order of regions but it’s impossible to compare the relative values of production across regions. Perhaps this explains the presence of another chart at the bottom that addresses this regional comparison.

The situation is worse for the OPEC/OPEC+ grouping. Now, the readers must find all flag icons with edges of a specific color, then mentally piece together these arbitrarily shaped polygons, then realizing that they won’t fit together nicely, and so must now mentally morph the shapes in an area-preserving manner, in order to complete this puzzle.

This is why I said earlier this is an elaborate data vessel. It’s nice to look at but it doesn’t convey information about composition as readers might expect it to.

Visualcapitalist_globaloilproduction_excerpt


Partition of Europe

A long-time reader sent me the following map via twitter:

Europeelects_map

This map tells how the major political groups divide up the European Parliament. I’ll spare you the counting. There are 27 countries, and nine political groups (including the "unaffiliated").

The key chart type is a box of dots. Each country gets its own box. Each box has its own width. What determines the width? If you ask me, it’s the relative span of the countries on the map. For example, the narrow countries like Ireland and Portugal have three dots across while the wider countries like Spain, Germany and Italy have 7, 10 and 8 dots across respectively.

Each dot represents one seat in the Parliament. Each dot has one of 9 possible colors. Each color shows a political lean e.g. the green dots represent Green parties while the maroon dots display “Left” parties.

The end result is a counting game. If we are interested in counts of seats, we have to literally count each dot. If we are interested in proportion of seats, take your poison: either eyeball it or count each color and count the total.

Who does the underlying map serve? Only readers who know the map of Europe. If you don’t know where Hungary or Latvia is, good luck. The physical constraints of the map work against the small-multiples set up of the data. In a small multiples, you want each chart to be identical, except for the country-specific data. The small-multiples structure requires a panel of equal-sized cells. The map does not offer this feature, as many small countries are cramped into Eastern Europe. Also, Europe has a few tiny states e.g. Luxembourg (population 660K)  and Malta (population 520K). To overcome the map, the designer produces boxes of different sizes, substantially loading up the cognitive burden on readers.

The map also dictates where the boxes are situated. The centroids of each country form the scaffolding, with adjustments required when the charts overlap. This restriction ensures a disorderly appearance. By contrast, the regular panel layout of a small multiples facilitates comparisons.

***

Here is something I sketched using a tile map.

Eu parties print sm

First, I have to create a tile map of European countries. Some parts, e.g. western part, are straightforward. The eastern side becomes very congested.

The tile map encodes location in an imprecise sense. Think about the scaffolding of centroids of countries referred to prior. The tile map imposes an order to the madness - we're shifting these centroids so that they line up in a tidier pattern. What we gain in comparability we concede in location precision.

For the EU tile map, I decided to show the Baltic countries in a row rather than a column; the latter would have been more faithful to the true geography. Malta is shown next to Italy even though it could have been placed below. Similarly, Cyprus in relation to Greece. I also included several key countries that are not part of the EU for context.

Instead of raw seat counts, I'm showing the proportion of seats within each country claimed by each political group. I think this metric is more useful to readers.

The legend is itself a chart that shows the aggregate statistics for all 27 countries.


When words speak louder than pictures

I've been staring at this chart from the Wall Street Journal (link) about U.S. workers working remotely:

Wsj_remotework_byyear

It's one of those offerings I think on which the designer spent a lot of effort, but ultimately didn't realize that the reader would spend equal if not more effort deciphering.

However, the following paragraph lifted straight from the article says exactly what needs to be said:

Workers overall spent an average of 5 hours and 25 minutes a day working from home in 2022. That is about two hours more than in 2019, the year before Covid-19 sent millions of workers scrambling to set up home oces, and down just 12 minutes from 2021, according to the Labor Department’s American Time Use Survey.

***

Why is the chart so hard to read?

_trifectacheckup_imageIt's mostly because the visual is fighting the message. In the Trifecta Checkup (link), this is represented by a disconnect between the Q(uestion) and the V(isual) corners - note the green arrow between these two corners.

The message concentrates on two comparisons: first, the increase in amount of remote work after the pandemic; and second, the mild decrease in 2022 relative to 2021.

On the chart, the elements that grab my attention are (a) the green and orange columns (b) the shading in the bottom part of those green and orange columns (c) the thick black line that runs across the chart (d) the indication on the left side that tells me one unit is an hour.

None of those visual elements directly addresses the comparisons. The first comparison - before and after the pandemic - is found by how much the green column spikes above the thick black line. Our comprehension is retarded by the decision to forego the typical axis labels in favor of chopping columns into one-hour blocks.

The second comparison - between 2022 and 2021 - is found in the white space above the top of the orange column.

So, in reality, the text labels that say exactly what needs to be said are carrying a lot of weight. A slight edit to the pointers helps connect those descriptions to the visual depiction, like this:

Redo_junkcharts_wsj_remotework

I've essentially flipped the tactics used in the various pointers. For the average level of remote work pre-pandemic, I dispense of any pointers while I'm using double-headed arrows to indicate differences across time.

Nevertheless, this modified chart is still too complex.

***

Here is a version that aligns the visual to the message:

Redo_junkcharts_wsj_remotework_2

It's a bit awkward because the 2 hour 48 minutes calculation is the 2021 number minus the average of 2015-19, skipping the 2020 year.

 


The one thing you're afraid to ask about histograms

In the previous post about a variant of the histogram, I glossed over a few perplexing issues - deliberately. Today's post addresses one of these topics: what is going on in the vertical axis of a histogram?

The real question is: what data are encoded in the histogram, and where?

***

Let's return to the dataset from the last post. I grabbed data from a set of international football (i.e. soccer) matches. Each goal scored has a scoring minute. If the goal is scored in regulation time, the scoring minute is a number between 1 and 90 minutes. Specifically, the data collector applies a rounding up: any goal scored between 0 and 60 seconds is recorded as 1, all the way up to a goal scored between 89 and 90th minute being recorded as 90. In this post, I only consider goals scored in regulation time so the horizontal axis is between 1-90 minutes.

The kneejerk answer to the posed question is: counts in bins. Isn't it the case that in constructing a histogram, we divide the range of values (1-90) into bins, and then plot the counts within bins, i.e. the number of goals scored within each bin of minutes?

The following is what we have in mind:

Junkcharts_counthistogram_1

Let's call this the "count histogram".

Some readers may dislike the scale of the vertical axis, as its interpretation hinges on the total sample size. Hence, another kneejerk answer is: frequencies in bins. Instead of plotting counts directly, plot frequencies, which are just standardized counts. Just divide each value by the sample size. Here's the "frequency histogram":

Junkcharts_freqhistogram_1

The count and frequency histograms are identical except for the scale, and appear intuitively clear. The count and frequency data are encoded in the heights of the columns. The column widths are an afterthought, and they adhere to a fixed constant. Unlike a column chart, typically the gap width in a histogram is zero, as we want to partition the horizontal range into adjoining sections.

Now, if you look carefully at the histogram from the last post, reproduced below, you'd find that it plots neither counts nor frequencies:

Junkcharts_densityhistogram_1

The numbers on the axis are fractions, and suggest that they may be frequencies, but a quick check proves otherwise: with 9 columns, the average column should contain at least 10 percent of the data. The total of the displayed fractions is nowhere near 100%, which is our expectation if the values are relative frequencies. You may have come across this strangeness when creating histograms using R or some other software.

The purpose of this post is to explain what values are being plotted and why.

***

What are the kinds of questions we like to answer about the distribution of data?

At a high level, we want to know "where are my data"?

Arguably these two questions are fundamental:

  • what is the probability that the data falls within a given range of values? e.g., what is the probability that a goal is scored in the first 15 minutes of a football match?
  • what is the relative probability of data between two ranges of values? e.g. are teams more likely to score in last 5 minutes of the first half or the last five minutes of the second half of a football match?

In a histogram, the first question is answered by comparing a given column to the entire set of columns while the second question is answered by comparing one column to another column.

Let's see what we can learn from the count histogram.

Junkcharts_counthistograms_questions

In a count histogram, the heights encode the count data. To address the relative probability question, we note that the ratio of heights is the ratio of counts, and the ratio of counts is the same as the ratio of frequencies. Thus, we learn that teams are roughly 3000/1500 = 1.5 times more likely to score in the last 5 minutes of the second half than during the last 5 minutes of the first half. (See the green columns).

[For those who follow football, it's clear that the data collector treated goals scored during injury time of either half as scored during the last minute of the half, so this dataset can't be used to analyze timing of goals unless the real minutes were recorded for injury-time goals.]

To address the range probability question, we compare the aggregate height of the three orange columns with the total heights of all columns. Note that I said "height", not "area," because the heights directly encode counts. It's actually taxing to figure out the total height!

We resort to reading the total area of all columns. This should yield the correct answer: the area is directly proportional to the height because the column widths are fixed as a constant. Bear in mind, though, if the column widths vary (the theme of the last post), then areas and heights are not interchangable concepts.

Estimating the total area is still not easy, especially if the column heights exhibit high variance. What we need is the proportion of the total area that is orange. It's possible to see, not easy.

You may interject now to point out that the total area should equal the aggregate count (sample size). But that is a fallacy! It's very easy to make this error. The aggregate count is actually the total height, and because of that, the total area is the aggregate count multiplied by the column width! In my example, the total height is 23,682, which is the number of goals in the dataset, while the total area is 23,682 times 5 minutes.

[For those who think in equations, the total area is the sum over all columns of height(i) x width(i). When width is constant, we can take it outside the sum, and the sum of height(i) is just the total count.]

***

The count histogram is hard to use because it requires knowing the sample size. It's the first thing that is produced because the raw data are counts in bins. The frequency histogram is better at delivering answers.

In the frequency histogram, the heights encode frequency data. We can therefore just read off the relative probability of the orange column, bypassing the need to compute the total area.

This workaround actually promotes the fallacy described above for the count histogram. It is easy to fall into the trap of thinking that the total area of all columns is 100%. It isn't.

Similar to before, the total height should be the total frequency but the total area is the total frequency multipled by the column width, that is to say, the total area is the reciprocal of the bin width. In the football example, using 5-minute intervals, the total area of the frequency histogram is 1/(5 minutes) in the case of equal bin widths.

How about the relative probability question? On the frequency histogram, the ratio of column heights is the ratio of frequencies, which is exactly what we want. So long as the column width is constant, comparing column heights is easy.

***

One theme in the above discussion is that in the count and frequency histograms, the count and frequency data are encoded in the column heights but not the column areas. This is a source of major confusion. Because of the convention of using equal column widths, one treats areas and heights as interchangable... but not always. The total column area isn't the same as the total column height.

This observation has some unsettling implications.

As shown above, the total area is affected by the column width. The column width in an equal-width histogram is the range of the x-values divided by the number of bins. Thus, the total area is a function of the number of bins.

Consider the following frequency histograms of the same scoring minutes dataset. The only difference is the number of bins used.

Junkcharts_freqhistogram_differentbins

Increasing the number of bins has a series of effects:

  • the columns become narrower
  • the columns become shorter, because each narrower bin can contain at most the same count as the wider bin that contains it.
  • the total area of the columns become smaller.

This last one is unexpected and completely messes up our intuition. When we increase the number of bins, not only are the columns shortening but the total area covered by all the columns is also shrinking. Remember that the total area whether it is a count or frequency histogram has a factor equal to the bin width. Higher number of bins means smaller bin width, which means smaller total area.

***

What if we force the total area to be constant regardless of how many bins we use? This setting seems more intuitive: in the 5-bin histogram, we partition the total area into five parts while in the 10-bin histogram, we divide it into 10 parts.

This is the principle used by R and the other statistical software when they produce so-called density histograms. The count and frequency data are encoded in the column areas - by implication, the same data could not have been encoded simultaneously in the column heights!

The way to accomplish this is to divide by the bin width. If you look at the total area formulas above, for the count histogram, total area is total count x bin width. If the height is count divided by bin width, then the total area is the total count. Similarly, if the height in the frequency histogram is frequency divided by bin width, then the total area is 100%.

Count divided by some section of the x-range is otherwise known as "density". It captures the concept of how tightly the data are packed inside a particular section of the dataset. Thus, in a count-density histogram, the heights encode densities while the areas encode counts. In this case, total area is the total count. If we want to standardize total area to be 1, then we should compute densities using frequencies rather than counts. Frequency densities are just count densities divided by the total count.

To summarize, in a frequency-density histogram, the heights encode densities, defined as frequency divided by the bin width. This is not very intuitive; just think of densities as how closely packed the data are in the specified bin. The column areas encode frequencies so that the total area is 100%.

The reason why density histograms are confusing is that we are reading off column heights while thinking that the total area should add up to 100%. Column heights and column areas cannot both add up to 100%. We have to pick one or the other.

Comparing relative column heights still works when the density histogram has equal bin widths. In this case, the relative height and relative area are the same because relative density equals relative frequencies if the bin width is fixed.

The following charts recap the discussion above. It shows how the frequency histogram does not preserve the total area when bin sizes are changed while the density histogram does.

Junkcharts_freqdensityhistograms_differentbins

***

The density histogram is a major pain for solving range probability questions because the frequencies are encoded in the column areas, not the heights. Areas are not marked out in a graph.

The column height gives us densities which are not probabilities. In order to retrieve probabilities, we have to multiply the density by the bin width, that is to say, we must estimate the area of the column. That requires mapping two dimensions (width, height) onto one (area). It is in fact impossible without measurement - unless we make the bin widths constant.

When we make the bin widths constant, we still can't read densities off the vertical axis, and treat them as probabilities. If I must use the density histogram to answer the question of how likely a team scores in the first 15 minutes, I'd sum the heights of the first 3 columns, which is about 0.025, and then multiply it by the bin width of 5 minutes, which gives 0.125 or 12.5%.

At the end of this exploration, I like the frequency histogram best. The density histogram is useful when we are comparing different histograms, which isn't the most common use case.

***

The histogram is a basic chart in the tool kit. It's more complicated than it seems. I haven't come across any intro dataviz books that explain this clearly.

Most of this post deals with equal-width histograms. If we allow bin widths to vary, it gets even more complicated. Stay tuned.

***

For those using base R graphics, I hope this post helps you interpret what they say in the manual. The default behavior of the "hist" function depends on whether the bins are equal width:

  • if the bin width is constant, then R produces a count histogram. As shown above, in a count histogram, the column heights indicate counts in bins but the total column area does not equal the total sample size, but the total sample size multiplied by the bin width. (Equal width is the default unless the user specifies bin breakpoints.)
  • if the bin width is not constant, then R produces a (frequency-)density histogram. The column heights are densities, defined as frequencies divided by bin width while the column areas are frequencies, with the total area summing to 100%.

Unfortunately, R does not generate a frequency histogram. To make one, you'd have to divide the counts in bins by the sum of counts. (In making some of the graphs above, I tricked it.) You also need to trick it to make a frequency-density histogram with equal-width bins, as it's coded to produce a count histogram when bin size is fixed.

 

P.S. [5-2-2023] As pointed out by a reader, I should clarify that R and I use the word "frequency" differently. Specifically, R uses frequency to mean counts, therefore, what I have been calling the "count histogram", R would have called it a "frequency histogram", and what I have been describing as a "frequency histogram", the "hist" function simply does not generate it unless you trick it to do so. I'm using "frequency" in the everyday sense of the word, such as "the frequency of the bus". In many statistical packages, frequency is used to mean "count", as in the frequency table which is just a table of counts. The reader suggested proportion which I like, or something like weight.

 

 

 

 

 


Showing both absolute and relative values on the same chart 2

In the previous post, I looked at Visual Capitalist's visualization of the amount of uninsured deposits at U.S. banks. Using a stacked bar chart, I placed both absolute and relative values on the same chart.

In making that chart, I made these three tradeoffs.

First, I elevated absolute values (dollar amounts) over relative values (proportions). The original designer decided the opposite.

Second, I elevated the TBTF banks over the smaller banks. The original designer also decided the opposite.

Third, I elevated the total value over the disaggregated values (insured, uninsured). The original designer only visualized the uninsured values in the bars.

Which chart is better depends on what story one wants to tell.

***
For today's post, I'm showing another sketch of the same data, with the same goal of putting both absolute and relative values on the same chart.

Redo_visualcapitalist_uninsureddeposits_2b

The starting point of this sketch is the original chart - the stacked bar chart showing relative proportions. I added the insured portion so that it is on almost equal footing as the uninsured portion of the deposits. This edit is crucial to convey the impression of proportions.

My story hasn't changed; I still want to elevate the TBTF banks.

For this version, I try a different way of elevating TBTF banks. The key step is to encode data into the heights of the bars. I use these bar heights to convey the relative importance of banks, as reflected by total deposits.

The areas of the red blocks represent the uninsured amounts. That said, it's not easy to compare rectangular areas when both dimensions are different.

Comparing the total red area with the total yellow area, we learn that the majority of deposits in these banks are uninsured(!)

 


Showing both absolute and relative values on the same chart 1

Visual Capitalist has a helpful overview on the "uninsured" deposits problem that has become the talking point of the recent banking crisis. Here is a snippet of the chart that you can see in full at this link:

Visualcapitalist_uninsureddeposits_top

This is in infographics style. It's a bar chart that shows the top X banks. Even though the headline says "by uninsured deposits", the sort order is really based on the proportion of deposits that are uninsured, i.e. residing in accounts that exceed $250K.  They used a red color to highlight the two failed banks, both of which have at least 90% of deposits uninsured.

The right column provides further context: the total amounts of deposits, presented both as a list of numbers as well as a column of bubbles. As readers know, bubbles are not self-sufficient, and if the list of numbers were removed, the bubbles lost most of their power of communication. Big, small, but how much smaller?

There are little nuggets of text in various corners that provide other information.

Overall, this is a pretty good one as far as infographics go.

***

I'd prefer to elevate information about the Too Big to Fail banks (which are hiding in plain sight). Addressing this surfaces the usual battle between relative and absolute values. While the smaller banks have some of the highest concentrations of uninsured deposits, each TBTF bank has multiples of the absolute dollars of uninsured deposits as the smaller banks.

Here is a revised version:

Redo_visualcapitalist_uninsuredassets_1

The banks are still ordered in the same way by the proportions of uninsured value. The data being plotted are not the proportions but the actual deposit amounts. Thus, the three TBTF banks (Citibank, Chase and Bank of America) stand out of the crowd. Aside from Citibank, the other two have relatively moderate proportions of uninsured assets but the sizes of the red bars for any of these three dominate those of the smaller banks.

Notice that I added the gray segments, which portray the amount of deposits that are FDIC protected. I did this not just to show the relative sizes of the banks. Having the other part of the deposits allow readers to answer additional questions, such as which banks have the most insured deposits? They also visually present the relative proportions.

***

The most amazing part of this dataset is the amount of uninsured money. I'm trying to think who these account holders are. It would seem like a very small collection of people and/or businesses would be holding these accounts. If they are mostly businesses, is FDIC insurance designed to protect business deposits? If they are mostly personal accounts, then surely only very wealthy individuals hold most of these accounts.

In the above chart, I'm assuming that deposits and assets are referring to the same thing. This may not be the correct interpretation. Deposits may be only a portion of the assets. It would be strange though that the analysts only have the proportions but not the actual deposit amounts at these banks. Nevertheless, until proven otherwise, you should see my revision as a sketch - what you can do if you have both the total deposits and the proportions uninsured.


Bivariate choropleths

A reader submitted a link to Joshua Stephen's post about bivariate choropleths, which is the technical term for the map that FiveThirtyEight printed on abortion bans, discussed here. Joshua advocates greater usage of maps with two-dimensional color scales.

As a reminder, the fundamental building block is expressed in this bivariate color legend:

Fivethirtyeight_abortionmap_colorlegend

Counties are classified into one of these nine groups, based on low/middle/high ratings on two dimensions, distance and congestion.

The nine groups are given nine colors, built from superimposing shades of green and pink. All nine colors are printed on the same map.

Joshuastephens_singlemap

Without a doubt, using these nine related colors are better than nine arbitrary colors. But is this a good data visualization?

Specifically, is the above map better than the pair of maps below?

Joshuastephens_twomaps

The split map is produced by Josh to explain that the bivariate choropleth is just the superposition of two univariate choropleths. I much prefer the split map to the superimposed one.

***

Think about what the reader goes through when comparing two counties.

Junkcharts_bivariatechoropleths

Superimposing the two univariate maps solves one problem: it removes the need to scan back and forth between two maps, looking for the same locations, something that is imprecise. (Unless, the map is interactive, and highlighting one county highlights the same county in the other map.)

For me, that's a small price to pay for quicker translation of color into information.

 

 


Finding the story in complex datasets

In CT Mirror's feature about Connecticut, which I wrote about in the previous post, there is one graphic that did not rise to the same level as the others.

Ctmirror_highschools

This section deals with graduation rates of the state's high school districts. The above chart focuses on exactly five districts. The line charts are organized in a stack. No year labels are provided. The time window is 11 years from 2010 to 2021. The column of numbers show the difference in graduation rates over the entire time window.

The five lines look basically the same, if we ignore what looks to be noisy year-to-year fluctuations. This is due to the weird aspect ratio imposed by stacking.

Why are those five districts chosen? Upon investigation, we learn that these are the five districts with the biggest improvement in graduation rates during the 11-year time window.

The same five schools also had some of the lowest graduation rates at the start of the analysis window (2010). This must be so because if a school graduated 90% of its class in 2010, it would be mathematically impossible for it to attain a 35% percent point improvement! This is a dissatisfactory feature of the dataviz.

***

In preparing an alternative version, I start by imagining how readers might want to utilize a visualization of this dataset. I assume that the readers may have certain school(s) they are particularly invested in, and want to see its/their graduation performance over these 11 years.

How does having the entire dataset help? For one thing, it provides context. What kind of context is relevant? As discussed above, it's futile to compare a school at the top of the ranking to one that is near the bottom. So I created groups of schools. Each school is compared to other schools that had comparable graduation rates at the start of the analysis period.

Amistad School District, which takes pole position in the original dataviz, graduated only 58% of its pupils in 2010 but vastly improved its graduation rate by 35% over the decade. In the chart below (left panel), I plotted all of the schools that had graduation rates between 50 and 74% in 2010. The chart shows that while Amistad is a standout, almost all schools in this group experienced steady improvements. (Whether this phenomenon represents true improvement, or just grade inflation, we can't tell from this dataset alone.)

Redo_junkcharts_ctmirrorhighschoolsgraduation_1

The right panel shows the group of schools with the next higher level of graduation rates in 2010. This group of schools too increased their graduation rates almost always. The rate of improvement in this group is lower than in the previous group of schools.

The next set of charts show school districts that already achieved excellent graduation rates (over 85%) by 2010. The most interesting group of schools consists of those with 85-89% rates in 2010. Their performance in 2021 is the most unpredictable of all the school groups. The majority of districts did even better while others regressed.

Redo_junkcharts_ctmirrorhighschoolsgraduation_2

Overall, there is less variability than I'd expect in the top two school groups. They generally appeared to have been able to raise or maintain their already-high graduation rates. (Note that the scale of each chart is different, and many of the lines in the second set of charts are moving within a few percentages.)

One more note about the charts: The trend lines are "smoothed" to focus on the trends rather than the year to year variability. Because of smoothing, there is some awkward-looking imprecision e.g. the end-to-end differences read from the curves versus the observed differences in the data. These discrepancies can easily be fixed if these charts were to be published.


Thoughts on Daniel's fix for dual-axes charts

I've taken a little time to ponder Daniel Z's proposed "fix" for dual-axes charts (link). The example he used is this:

Danielzvinca_dualaxes_linecolumn

In that long post, Daniel explained why he preferred to mix a line with columns, rather than using the more common dual lines construction: to prevent readers from falsely attributing meaning to crisscrossing lines. There are many issues with dual-axes charts, which I won't repeat in this post; one of their most dissatisfying features is the lack of connection between the two vertical scales, and thus, it's pretty easy to manufacture an image of correlation when it doesn't exist. As shown in this old post, one can expand or restrict one of the vertical axes and shift the line up and down to "match" the other vertical axis.

Daniel's proposed fix retains the dual axes, and he even restores the dual lines construction.

Danielzvinca_dualaxes_estimatedy

How is this chart different from the typical dual-axes chart, like the first graph in this post?

Recall that the problem with using two axes is that the designer could squeeze, expand or shift one of the axes in any number of ways to manufacture many realities. What Daniel effectively did here is selecting one specific way to transform the "New Customers" axis (shown in gray).

His idea is to run a simple linear regression between the two time series. Think of fitting a "trendline" in Excel between Revenues and New Customers. Then, use the resulting regression equation to compute an "estimated" revenues based on the New Customers series. The coefficients of this regression equation then determines the degree of squeezing/expansion and shifting applied to the New Customers axis.

The main advantage of this "fix" is to eliminate the freedom to manufacture multiple realities. There is exactly one way to transform the New Customers axis.

The chart itself takes a bit of time to get used to. The actual values plotted in the gray line are "estimated revenues" from the regression model, thus the blue axis values on the left apply to the gray line as well. The gray axis shows the respective customer values. Because we performed a linear fit, each value of estimated revenues correspond to a particular customer value. The gray line is thus a squeezed/expanded/shifted replica of the New Customers line (shown in orange in the first graph). The gray line can then be interpreted on two connected scales, and both the blue and gray labels are relevant.

***

What are we staring at?

The blue line shows the observed revenues while the gray line displays the estimated revenues (predicted by the regression line). Thus, the vertical gaps between the two lines are the "residuals" of the regression model, i.e. the estimation errors. If you have studied Statistics 101, you may remember that the residuals are the components that make up the R-squared, which measures the quality of fit of the regression model. R-squared is the square of r, which stands for the correlation between Customers and the observed revenues. Thus the higher the (linear) correlation between the two time series, the higher the R-squared, the better the regression fit, the smaller the gaps between the two lines.

***

There is some value to this chart, although it'd be challenging to explain to someone who has not taken Statistics 101.

While I like that this linear regression approach is "principled", I wonder why this transformation should be preferred to all others. I don't have an answer to this question yet.

***

Daniel's fix reminds me of a different, but very common, chart.

Forecastvsactualinflationchart

This chart shows actual vs forecasted inflation rates. This chart has two lines but only needs one axis since both lines represent inflation rates in the same range.

We can think of the "estimated revenues" line above as forecasted or expected revenues, based on the actual number of new customers. In particular, this forecast is based on a specific model: one that assumes that revenues is linearly related to the number of new customers. The "residuals" are forecasting errors.

In this sense, I think Daniel's solution amounts to rephrasing the question of the chart from "how closely are revenues and new customers correlated?" to "given the trend in new customers, are we over- or under-performing on revenues?"

Instead of using the dual-axes chart with two different scales, I'd prefer to answer the question by showing this expected vs actual revenues chart with one scale.

This does not eliminate the question about the "principle" behind the estimated revenues, but it makes clear that the challenge is to justify why revenues is a linear function of new customers, and no other variables.

Unlike the dual-axes chart, the actual vs forecasted chart is independent of the forecasting method. One can produce forecasted revenues based on a complicated function of new customers, existing customers, and any other factors. A different model just changes the shape of the forecasted revenues line. We still have two comparable lines on one scale.