Atypical time order and bubble labeling

This chart appeared in a Charles Schwab magazine in Summer, 2019.


This bubble chart does not print any data labels. The bubbles take our attention but the designer realizes that the actual values of the volatility are not intuitive numbers. The same is true of any standard deviation numbers. If you're told SD of a data series is 3, it doesn't tell you much by itself.

I first transformed this chart into the equivalent column chart:


Two problems surface on the axes.

For the time axis, the years are jumbled. Readers experience vertigo, as we try to figure out how to read the chart. Our expectation that time moves left to right is thwarted. This ordering also requires every single year label to be present.

For the vertical axis, I could have left out the numbers completely. They are not really meaningful. These represent the areas of the bubbles but only relative to how I measured them.


In the next version, I sorted time in the conventional manner. Following Tufte's classic advice, only the tops of the columns are plotted.


What you see is that this ordering is much easier to comprehend. Figuring out that 2018 is an average year in terms of volatility is not any harder than in the original. In fact, we can reproduce the order of the previous chart just by letting our eyes sweep top to bottom.

To make it even easier to read the vertical axis, I converted the numbers into an index, with the average volatility as 100 (assigned to 0% on the chart) .


Now, you can see that 2018 is roughly at the average while 2008 is 400% above the average level. (How should we interpret this statement? That's a question I pose to my statistics students. It's not intuitive how one should interpret the statement that the standard deviation is 5 times higher.)



Convincing charts showing containment measures work

The disorganized nature of U.S.'s response to the coronavirus pandemic has created a sort of natural experiment that allows data journalists to explore important scientific questions, such as the impact of containment measures on cases and hospitalizations. This New York Times article represents the best of such work.

The key finding of the analysis is beautifully captured by this set of scatter plots:


Each dot is a state. The cases (left plot) and hospitalizations (right plot) are plotted against the severity of containment measures for November. The negative correlation is unmistakable: the more containment measures taken, the lower the counts.

There are a few features worth noting.

The severity index came from a group at Oxford, and is a number between 0 and 100. The journalists decided to leave out the numerical labels, instead simply showing More and Fewer. This significantly reduces processing time. Readers won't be able to understand the index values anyway without reading the manual.

The index values are doubly encoded. They are first encoded by the location on the horizontal axis and redundantly encoded on the blue-red scale. Ordinarily, I do not like redundant encoding because the reader might assume a third dimension exists. In this case, I had no trouble with it.

The easiest way to see the effect is to ignore the muddy middle and focus on the two ends of the severity index. Those states with the fewest measures - South Dakota, North Dakota, Iowa - are the worst in cases and hospitalizations while those states with the most measures - New York, Hawaii - are among the best. This comparison is similar to what is frequently done in scientific studies, e.g. when they say coffee is good for you, they typically compare heavy drinkers (4 or more cups a day) with non-drinkers, ignoring the moderate and light drinkers.

Notably, there is quite a bit of variability for any level of containment measures - roughly 50 cases per 100,000, and 25 hospitalizations per 100,000. This indicates that containment measures are not sufficient to explain the counts. For example, the hospitalization statistic is affected by the stock of hospital beds, which I assume differ by state.

Whenever we use a scatter plot, we run the risk of xyopia. This chart form invites readers to explain an outcome (y-axis values) using one explanatory variable (on x-axis). There is an assumption that all other variables are unimportant, which is usually false.


Because of the variability, the horizontal scale has meaningless precision. The next chart cures this by grouping the states into three categories: low, medium and high level of measures.


This set of charts extends the time window back to March 1. For the designer, this creates a tricky problem - because states adapt their policies over time. As indicated in the subtitle, the grouping is based on the average severity index since March, rather than just November, as in the scatter plots above.


The interplay between policy and health indicators is captured by connected scatter plots, of which the Times article included a few examples. Here is what happened in New York:


Up until April, the policies were catching up with the cases. The policies tightened even after the case-per-capita started falling. Then, policies eased a little, and cases started to spike again.

The Note tells us that the containment severity index is time shifted to reflect a two-week lag in effect. So, the case count on May 1 is not paired with the containment severity index of May 1 but of April 15.


You can find the full article here.




Bloomberg made me digest these graphics slowly

Ask the experts to name the success metric of good data visualization, and you will receive a dozen answers. The field doesn't have an all-encompassing metric. A useful reference is Andrew Gelman and Antony Urwin (2012) in which they discussed the tradeoff between beautiful and informative, which derives from the familiar tension between art and science.

For a while now, I've been intrigued by metrics that measure "effort". Some years ago, I described the concept of a "return on effort" in this post. Such a metric can be constructed like the dominating financial metric of return on investment. The investment here is an investment of time, of attention. I strongly believe that if the consumer judges a data visualization to be compelling, engaging or  ell constructed, s/he will expend energy to devour it.

Imagine grub you discard after the first bite, compared to the delicious food experienced slowly, savoring every last bit.

Bloomberg_ambridge_smI'm writing this post while enjoying the September issue of Bloomberg Businessweek, which focuses on the upcoming U.S. Presidential election. There are various graphics infused into the pages of the magazine. Many of these graphics operate at a level of complexity above what typically show up in magazines, and yet I spent energy learning to understand them. This response, I believe, is what visual designers should aim for.


Today, I discuss one example of these graphics, shown on the right. You might be shocked by the throwback style of these graphics. They look like they arrived from decades ago!

Grayscale, simple forms, typewriter font, all caps. Have I gone crazy?

The article argues that a town like Ambridge in Beaver County, Pennslyvania may be pivotal in the November election. The set of graphics provides relevant data to understand this argument.

It's evidence that data visualization does not need whiz-bang modern wizardry to excel.

Let me focus on the boxy charts from the top of the column. These:


These charts solve a headache with voting margin data in the U.S.  We have two dominant political parties so in any given election, the vote share data split into three buckets: Democratic, Republican, and a catch-all category that includes third parties, write-ins, and none of the above. The third category rarely exceeds 5 percent.  A generic pie chart representation looks like this:


Stacked bars have this look:


In using my Trifecta framework (link), the top point is articulating the question. The primary issue here is the voting margin between the winner and the second-runner-up, which is the loser in what is typically a two-horse race. There exist two sub-questions: the vote-share difference between the top two finishers, and the share of vote effectively removed from the pot by the remaining candidates.

Now, take another look at the unusual chart form used by Bloomberg:


The catch-all vote share sits at the bottom while the two major parties split up the top section. This design demonstrates a keen understanding of the context. Consider the typical outcome, in which the top two finishers are from the two major parties. When answering the first sub-question, we can choose the raw vote shares, or the normalized vote shares. Normalizing shifts the base from all candidates to the top two candidates.

The Bloomberg chart addresses both scales. The normalized vote shares can be read directly by focusing only on the top section. In an even two-horse race, the top section is split by half - this holds true regardless of the size of the bottom section.

This is a simple chart that packs a punch.


Election visuals 2: informative and playful

In yesterday's post, I reviewed one section of 538's visualization of its election forecasting model, specifically, the post focuses on the probability plot visualization.

The visualization, technically called  a pdf, is a mainstay of statistical graphics. While every one of 40,000 scenarios shows up on this chart, it doesn't offer a direct answer to our topline question. What is Nate's call at this point in time? Elsewhere in their post, we learn that the 538 model currently gives Biden a 75% chance of winning, thrice that of Trump's.


In graphical terms, the area to the right of the 270-line is three times the size of the left area (on the bottom chart). That's not apparent in the pdf representation. Addressing this, statisticians may convert the pdf into a cdf, which depicts the cumulative area as we sweep from the left to the right along the horizontal axis.  

The cdf visualization rarely leaves the pages of a scientific journal because it's not easy for a novice to understand. Not least because the relevant probability is 1 minus the cumulative probability. The cdf for the bottom chart will show 25% at the 270-line while the chance of Biden winning is 1 - 25% = 75%.

The cdf presentation is also wasteful for the election scenario. No one cares about any threshold other than the 270 votes needed to win, but the standard cdf shows every possible threshold.

The second graphical concept in the 538 post (link) is an attempt to solve this problem.


If you drop all the dots to an imaginary horizontal baseline, the above dotplot looks like this:


There is a recent trend toward centering dots to produce symmetry. It's actually harder to perceive the differences in heights of the band.

The secret sauce is to put down 100 dots, with a 75-25 blue-red split that conveys the 75% chance of a Biden win. Imposing the pdf line from the other visualization, I find that the density of dots roughly mimics the probability of outcomes.


It's easier to estimate the blue vs red areas using those dots than the lines.

The dots are stuffed toys. Clicking on each dot reveals a map showing one of the 40,000 scenarios. It displays which candidate wins which state. For example, the most extreme example of a Trump win is:


Here is a scenario of a razor-tight election won by Trump:


This presentation has a weakness as well. It gives the impression that each of the dots is equally important because they are the same size. In reality, the importance of each dot is proportional to the height of the band. Since the band is generally wider near the middle, the dots near the middle are more likely scenarios than the dots shown on the two edges.

On balance, I like this visualization that is both informative and playful.

As before, what strikes me about the simulation result is the flatness of the probability surface. This feature is obscured when we summarize the result as 75% chance of a Biden victory.

Election visuals: three views of FiveThirtyEight's probabilistic forecasts

As anyone who is familiar with Nate Silver's forecasting of U.S. presidential elections knows, he runs a simulation that explores the space of possible scenarios. The polls that provide a baseline forecast make certain assumptions, such as who's a likely voter. Nate's model unshackles these assumptions from the polling data, exploring how the outcomes vary as these assumptions shift.

In the most recent simulation, his computer explores 40,000 scenarios, each of which predicts a split of the electoral vote, from which the winner of the election can be determined. The model's outcome is usually summarized by a winning probability, which is just the proportion of scenarios under which one candidate wins.

This type of forecasting was responsible for the infamous meltdown in 2016 when most of these models - Nate's being an exception - issued extremely confident predictions that Hillary Clinton wins with 95% or higher probability. Essentially, the probability distribution collapses to a point. This is analogous to an extremely narrow confidence band, indicating almost zero uncertainty about the event. It was as if almost all of the 40,000 scenarios predicted Clinton to be the winner.

The 538 data team has come up with various ways of visualizing the outputs of the model (link). The entire post is worth reading. Here, I'll highlight the most scientific, and direct visual representation, which is the third display.


We start by looking at the bottom of the two charts, showing the predicted electoral votes won  by Democratic challenger Joe Biden, in each of the 40,000 scenarios. Our attention is directed to the thick line that gives the relative chance of Biden's electoral-vote tally. This line is a smoothed summary of the columns in the background, which show the number of times the simulation produces each electoral-vote count.

The highlighted, right side of the chart recounts scenarios in which Biden becomes President, that is to say, he wins more than 270 electoral votes (out of 538, doh). The faded, left side represents scenarios in which Biden is defeated and Trump wins a second term.

The reason I focused on the bottom chart is that the top chart is merely a mirror image of this one. Just reflect the bottom chart around the vertical axis of 270 electoral votes, change the color scheme to red, and swap annotations related to Trump and Biden, and you get the other chart. This is because the narrative has excluded third-party and write-in candidates, leaving us with a zero-sum situation.

Alternatively, one can jam both charts into one, while supplying extra labels, like this:


I prefer the denser single chart because my mind wanders away searching for extra meaning when chart elements are mirrored.

One advantage of the mirrored presentation is that the probability profiles of the potential Trump or Biden wins can be directly compared. We learn that Trump's winning margins are smaller, rarely above 150, and never above 250.

This comparison is made easier by flipping left side of the chart onto the right side:


Those are three different visualizations using the same chart form. I'd have to run a poll to figure out which is the best. What's your opinion?

Deaths as percent neither of cases nor of population. Deaths as percent of normal.

Yesterday, I posted a note about excess deaths on the book blog (link). The post was inspired by a nice data visualization by the New York Times (link). This is a great example of data journalism.


Excess deaths is a superior metric for measuring the effect of Covid-19 on public health. It's better than deaths as percent of cases. Also better than percent of the population.What excess deaths measure is deaths as a percent of normal. Normal is usually defined as the average deaths in the respective week in years past.

The red areas indicate how far the deaths in the Southern states are above normal. The highest peak, registered in Texas in late July, is 60 percent above the normal level.


The best way to appreciate the effort that went into this graphic is to imagine receiving the outputs from the model that computes excess deaths. A three-column spreadsheet with columns "state", "week number" and "estimated excess deaths".

The first issue is unequal population sizes. More populous states of course have higher death tolls. Transforming death tolls to an index pegged to the normal level solves this problem. To produce this index, we divide actual deaths by the normal level of deaths. So the spreadsheet must be augmented by two additional columns, showing the historical average deaths and actual deaths for each state for each week. Then, the excess death index can be computed.

The journalist builds a story around the migration of the coronavirus between different regions as it rages across different states  during different weeks. To this end, the designer first divides the dataset into four regions (South, West, Midwest and Northeast). Within each region, the states must be ordered. For each state, the week of peak excess deaths is identified, and the peak index is used to sort the states.

The graphic utilizes a small-multiples framework. Time occupies the horizontal axis, by convention. The vertical axis is compressed so that the states are not too distant. For the same reason, the component graphs are allowed to overlap vertically. The benefit of the tight arrangement is clearer for the Northeast as those peaks are particularly tall. The space-saving appearance reminds me of sparklines, championed by Ed Tufte.

There is one small tricky problem. In most of June, Texas suffered at least 50 percent more deaths than normal. The severity of this excess death toll is shortchanged by the low vertical height of each component graph. What forced such congestion is probably the data from the Northeast. For example, New York City:



New York City's death toll was almost 8 times the normal level at the start of the epidemic in the U.S. If the same vertical scale is maintained across the four regions, then the Northeastern states dwarf all else.


One key takeaway from the graphic for the Southern states is the persistence of the red areas. In each state, for almost every week of the entire pandemic period, actual deaths have exceeded the normal level. This is strong indication that the coronavirus is not under control.

In fact, I'd like to see a second set of plots showing the cumulative excess deaths since March. The weekly graphic is better for identifying the ebb and flow while the cumulative graphic takes measure of the total impact of Covid-19.


The above description leaves out a huge chunk of work related to computing excess deaths. I assumed the designer receives these estimates from a data scientist. See the related post in which I explain how excess deaths are estimated from statistical models.


How many details to include in a chart

This graphic by Bloomberg provides the context for understanding the severity of the Atlantic storm season. (link)


At this point of the season, 2020 appears to be one of the most severe in history.

I was momentarily fascinated by a feature of modern browser-based data visualization: the death of the aspect ratio. When the browser window is stretched sufficiently wide, the chart above is transformed to this look:


The chart designer has lost control of the aspect ratio.


This Bloomberg chart is an example of the spaghetti-style plots that convey variability by displaying individual units of data (here, storm years). The envelope of the growth curves gives the range of historical counts while the density of curves roughly offers some sense of the most likely counts at different points of the season.

But these spaghetti-style plots are not precise at conveying the variability because the density is hard to gauge. That's where aggregating the individual units helps.

The following chart does not show individual storm years. It shows the counts for the median season at selected points in time, and also a band of variability (for example, you'd include say 90 or 95% of the seasons).


I don't have the raw data so the aggregating is done by eyeballing the spaghetti.

I prefer this presentation even though it does not plot every single data point one has in the dataset.



On data volume, reliability, uncertainty and confidence bands

This chart from the Economist caught my eye because of the unusual use of color-coded hexagonal tiles.


The basic design of the chart is easy to grasp: It relates people's "happiness" to national wealth. The thick black line shows that the average citizen of wealthier countries tends to rate their current life situation better.

For readers alert to graphical details, things can get a little confusing. The horizontal "wealth" axis is shown in log scale, which means that the data on the right side of the chart have been compressed while the data on the left side of the chart have been stretched out. In other words, the curve in linear scale is much flatter than depicted.


One thing you might notice is how poor the fit of the line is at both ends. Singapore and Afghanistan are clearly not explained by the fitted line. (That said, the line is based on many more dots than those eight we can see.) Moreover, because countries are widely spread out on the high end of the wealth axis, the fit is not impressive. Log scales tend to give a false impression of the tightness of fit, as I explained before when discussing coronavirus case curves.


The hexagonal tiles replace the more typical dot scatter or contour shading. The raw data consist of results from polls conducted in different countries in different years. For each poll, the analyst computes the average life satisfaction score for that country in that year. From national statistics, the analyst pulls out that country's GDP per capita in that year. Thus, each data point is a dot on the canvass. A few data points are shown as black dots. Those are for eight highlighted countries for the year 2018.

The black line is fitted to the underlying dot scatter and summarizes the correlation between average wealth and average life satisfaction. Instead of showing the scatter, this Economist design aggregates nearby dots into hexagons. The deepest red hexagon, sandwiched between Finland and the US, contains about 60-70 dots, according to the color legend.

These details are tough to take in. It's not clear which dots have been collected into that hexagon: are they all Finland or the U.S. in various years, or do they include other countries? Each country is represented by multiple dots, one for each poll year. It's also not clear how much variation there exists within a country across years.


The hexagonal tiles presumably serve the same role as a dot scatter or contour shading. They convey the amount of data supporting the fitted curve along its trajectory. More data confers more reliability.

For this chart, the hexagonal tiles do not add any value. The deepest red regions are those closest to the black line so nothing is actually lost by showing just the line and not the tiles.


Using the line chart obviates the need for readers to figure out the hexagons, the polls, the aggregation, and the inevitable unanswered questions.


An alternative concept is to show the "confidence band" or "error bar" around the black line. These bars display the uncertainty of the data. The wider the band, the less certain the analyst is of the estimate. Typically, the band expands near the edges where we have less data.

Here is conceptually what we should see (I don't have the underlying dataset so can't compute the confidence band precisely)


The confidence band picture is the mirror image of the hexagonal tiles. Where the poll density is high, the confidence band narrows, and where poll density is low, the band expands.

A simple way to interpret the confidence band is to find the country's wealth on the horizontal axis, and look at the range of life satisfaction rating for that value of wealth. Now pick any number between the range, and imagine that you've just conducted a survey and computed the average rating. That number you picked is a possible survey result, and thus a valid value. (For those who know some probability, you should pick a number not at random within the range but in accordance with a Bell curve, meaning picking a number closer to the fitted line with much higher probability than a number at either edge.)

Visualizing data involves a series of choices. For this dataset, one such choice is displaying data density or uncertainty or neither.

This chart shows why the PR agency for the UK government deserves a Covid-19 bonus

The Economist illustrated some interesting consumer research with this chart (link):


The survey by Dalia Research asked people about the satisfaction with their country's response to the coronavirus crisis. The results are reduced to the "Top 2 Boxes", the proportion of people who rated their government response as "very well" or "somewhat well".

This dimension is laid out along the horizontal axis. The chart is a combo dot and bubble chart, arranged in rows by region of the world. Now what does the bubble size indicate?

It took me a while to find the legend as I was expecting it either in the header or the footer of the graphic. A larger bubble depicts a higher cumulative number of deaths up to June 15, 2020.

The key issue is the correlation between a country's death count and the people's evaluation of the government response.

Bivariate correlation is typically shown on a scatter plot. The following chart sets out the scatter plots in a small multiples format with each panel displaying a region of the world.


The death tolls in the Asian countries are low relative to the other regions, and yet the people's ratings vary widely. In particular, the Japanese people are pretty hard on their government.

In Europe, the people of Greece, Netherlands and Germany think highly of their government responses, which have suppressed deaths. The French, Spaniards and Italians are understandably unhappy. The British appears to be the most forgiving of their government, despite suffering a higher death toll than France, Spain or Italy. This speaks well of their PR operation.

Cumulative deaths should be adjusted by population size for a proper comparison across nations. When the same graphic is produced using deaths per million (shown on the right below), the general story is preserved while the pattern is clarified:


The right chart shows deaths per million while the left chart shows total deaths.


In the original Economist chart, what catches our attention first is the bubble size. Eventually, we notice the horizontal positioning of these bubbles. But the star of this chart ought to be the new survey data. I swapped those variables and obtained the following graphic:


Instead of using bubble size, I switched to using color to illustrate the deaths-per-million metric. If ratings of the pandemic response correlate tightly with deaths per million, then we expect the color of these dots to evolve from blue on the left side to red on the right side.

The peculiar loss of correlation in the U.K. stands out. Their PR firm deserves a bonus!

Cornell must remove the logs before it reopens the campus in the fall

Against all logic, Cornell announced last week it would re-open in the fall because a mathematical model under development by several faculty members and grad students predicts that a "full re-opening" would lead to 80 percent fewer infections than a scenario of full virtual instruction. That's what was reported by the media.

The model is complicated, with loads of assumptions, and the report is over 50 pages long. I will put up my notes on how they attained this counterintuitive result in the next few days. The bottom line is - and the research team would agree - it is misleading to describe the analysis as "full re-open" versus "no re-open". The so-called full re-open scenario assumes the entire community including students, faculty and staff submit to a full program of test-trace-isolate, including (mandatory) PCR diagnostic testing once every five days throughout the 16-week semester, and immediate quarantine and isolation of new positive cases, as well as those in contact with such persons, plus full compliance with this program. By contrast, it assumes students do not get tested in the online instruction scenario. In other words, the researchers expect Cornell to get done what the U.S. governments at all levels failed to do until now.

[7/8/2020: The post on the Cornell model is now up on the book blog. Here.]

The report takes us back to the good old days of best-base-worst-case analysis. There is no data for validating such predictions so they performed sensitivity analyses, defined as changing one factor at a time assuming all other factors are fixed at "nominal" (i.e. base case) values. In a large section of the report, they publish a series of charts of the following style:


Each line here represents one of the best-base-worst cases (respectively, orange-blue-green). Every parameter except one is given the "nominal" value (which represents the base case). The parameter that is manpulated is shown on the horizontal axis, and for the above chart, the variable is the assumption of average number of daily contacts per person. The vertical axis shows the main outcome variable, which is the percentage of the community infected by the end of term.

This flatness of the lines in the above chart appears to say that the outcome is quite insensitive to the change in the average daily contact rate under all three scenarios - until the daily contact rises above 10 per person per day. It also appears to show that the blue line is roughly midway between the orange and the green so the percent infected is slightly less-than halved under the optimistic scenario, and a bit more than doubled under the pessimistic scenario, relative to the blue line.

Look again.

The vertical axis is presented in log scale, and only labeled at values 1% and 10%. About midway between 1 and 10 on the horizontal axis, the outcome value has already risen above 10%. Because of the log transformation, above 10%, each tick represents an increase of 10% in proportion. So, the top of the vertical axis indicates 80% of the community being infected! Nothing in the description or labeling of the vertical axis prepares the reader for this.

The report assumes a fixed value for average daily contacts of 8 (I rounded the number for discussion), which is invariable across all three scenarios. Drawing a vertical line about eight-tenths of the way towards 10 appears to signal that this baseline daily contact rate places the outcome in the relatively flat part of the curve.

Look again.

The horizontal axis too is presented in log scale. To birth one log-scale may be regarded as a misfortune; to birth two log scales looks like carelessness. 

Since there exists exactly one tick beyond 10 on the horizontal axis, the right-most value is 20. The model has been run for values of average daily contacts from 1 to 20, with unit increases. I can think of no defensible reason why such a set of numbers should be expressed in a log scale.

For the vertical axis, the outcome is a proportion, which is confined to within 0 percent and 100 percent. It's not a number that can explode.


Every log scale on a chart is birthed by its designer. I know of no software that automatically performs log transforms on data without the user's direction. (I write this line with trepidation wishing that I haven't planted a bad idea in some software developer's head.)

Here is what the shape of the original data looks like - without any transformation. All software (I'm using JMP here) produces something of this type:


At the baseline daily contact rate value of 8, the model predicts that 3.5% of the Cornell community will get infected by the end of the semester (again, assuming strict test-trace-isolate fully implemented and complied).  Under the pessimistic scenario, the proportion jumps to 14%, which is 4 or 5 times higher than the base case. In this worst-case scenario, if the daily contact rate were about twice the assumed value (just over 16), half of the community would be infected in 16 weeks!

I actually do not understand how there could only be 8 contacts per person per day when the entire student body has returned to 100% in-person instruction. (In the report, they even say the 8 contacts could include multiple contacts with the same person.) I imagine an undergrad student in a single classroom with 50 students. This assumption says the average student in this class only comes into contact with at most 8 of those. That's one class. How about other classes? small tutorials? dining halls? dorms? extracurricular activities? sports? parties? bars?

Back to graphics. Something about the canonical chart irked the report writers so they decided to try a log scale. Here is the same chart with the vertical axis in log scale:


The log transform produces a visual distortion. On the right side, where the three lines are diverging rapidly, the log transform pulls them together. On the left side, where the three lines are close together, the log transform pulls them apart.

Recall that on the log scale, a straight line is exponential growth. Look at the green line (worst case). That line is approximately linear so in the pessimistic scenario, despite assuming full compliance to a strict test-trace-isolate regimen, the cases are projected to grow exponentially.

Something about that last chart still irked the report writers so they decided to birth a second log scale. Here is the chart they ultimately settled on:


As with the other axis, the effect of the log transform is to squeeze the larger values (on the right side) and spread out the smaller values (on the left side). After this cosmetic surgery, the left side looks relatively flat while the right side looks steep.

In the next version of the Cornell report, they should replace all these charts with ones using linear scales.


Upon discovering this graphical mischief, I wonder if the research team received a mandate that includes a desired outcome.


[P.S. 7/8/2020. For more on the Cornell model, see this post.]