On the interpretability of log-scaled charts

A previous post featured the following chart showing stock returns over time:

Gelman_overnightreturns_tsla

Unbeknownst to readers,  the chart plots one thing but labels it something else.

The designer of the chart explains how to read the chart in a separate note, which I included in my previous post (link). It's a crucial piece of information. Before reading his explanation, I didn't realize the sleight of hand: he made a chart with one time series, then substituted the y-axis labels with another set of values.

As I explored this design choice further, I realize that it has been widely adopted in a common chart form, without fanfare. I'll get to it in due course.

***

Let's start our journey with as simple a chart as possible. Here is a line chart showing constant growth in the revenues of a small business:

Junkcharts_dollarchart_origvalues

For all the charts in this post, the horizontal axis depicts time (x = 0, 1, 2, ...). To simplify further, I describe discrete time steps although nothing changes if time is treated as continuous.

The vertical scale is in dollars, the original units. It's conventional to modify the scale to units of thousands of dollars, like this:

Junkcharts_dollarchart_thousands

No controversy arises if we treat these two charts as identical. Here I put them onto the same plot, using dual axes, emphasizing the one-to-one correspondence between the two scales.

Junkcharts_dollarchart_dualaxes

We can do the same thing for two time series that are linearly related. The following chart shows constant growth in temperature using both Celsius and Fahrenheit scales:

Junkcharts_tempchart_dualaxes

Here is the chart displaying only the Fahrenheit axis:

Junkcharts_tempchart_fahrenheit

This chart admits two interpretations: (A) it is a chart constructed using F values directly and (B) it is a chart created using C values, after which the axis labels were replaced by F values. Interpretation B implements the sleight of hand of the log-returns plot. The issue I'm wrestling with in this post is the utility of interpretation B.

Before we move to our next stop, let's stipulate that if we are exposed to that Fahrenheit-scaled chart, either interpretation can apply; readers can't tell them apart.

***

Next, we look at the following line chart:

Junkcharts_trendchart_y

Notice the vertical axis uses a log10 scale. We know it's a log scale because the equally-spaced tickmarks represent different jumps in value: the first jump is from 1 to 10, the next jump is from 10, not to 20, but to 100.

Just like before, I make a dual-axes version of the chart, putting the log Y values on the left axis, and the original Y values on the right axis.

Junkcharts_trendchart_dualaxes
By convention, we often print the original values as the axis labels of a log chart. Can you recognize that sleight of hand? We make the chart using the log values, after which we replace the log value labels with the original value labels. We adopt this graphical trick because humans don't think in log units, thus, the log value labels are less "interpretable".

As with the temperature chart, we will attempt to interpret the chart two ways. I've already covered interpretation B. For interpretation A, we regard the line chart as a straightforward plot of the values shown on the right axis (i.e., the original values). Alas, this viewpoint fails for the log chart.

If the original data are plotted directly, the chart should look like this:

Junkcharts_trendchart_y_origvalues

It's not a straight line but a curve.

What have I just shown? That, after using the sleight of hand, we cannot interpret the chart as if it were directly plotting the data expressed in the original scale.

To nail down this idea, we ask a basic question of any chart showing trendlines. What's the rate of change of Y?

Using the transformed log scale (left axis), we find that the rate of change is 1 unit per unit time. Using the original scale, the rate of change from t=1 to t=2 is (100-10)/1 = 90 units per unit time; from t=2 to t=3, it is (1000-100)/1 = 900 units per unit time. Even though the rate of change varies by time step, the log chart using original value labels sends the misleading picture that the rate of change is constant over time (thus a straight line). The decision to substitute the log value labels backfires!

This is one reason why I use log charts sparingly. (I do like them a lot for exploratory analyses, but I avoid using them as presentation graphics.) This issue of interpretation is why I dislike the sleight of hand used to produce those log stock returns charts, even if the designer offers a note of explanation.

Do we gain or lose "interpretability" when we substitute those axis labels?

***

Let's re-examine the dual-axes temperature chart, building on what we just learned.

Junkcharts_tempchart_dualaxes

The above chart suggests that whichever scale (axis) is chosen, we get the same line, with the same steepness. Thus, the rate of change is the same regardless of scale. This turns out to be an illusion.

Using the left axis, the slope of the line is 10 degrees Celsius per unit time. Using the right axis, the slope is 18 degrees Fahrenheit per unit time. 18 F is different from 10 C, thus, the slopes are not really the same! The rate of change of the temperature is given algebraically by the slope, and visually by the steepness of the line. Since two different slopes result in the same line steepness, the visualization conveys a lie.

This situation here is a bit better than that in the log chart. Here, in either scale, the rate of change is constant over time. Differentiating the temperature conversion formula, we find that the slope of the Fahrenheit line is always 9/5*the slope of the Celsius line. So a rate of 10 Celsius per unit time corresponds to 18 Fahrenheit per unit time.

What if the chart is presented with only the Fahrenheit axis labels although it is built using Celsius data? Since readers only see the F labels, the observed slope is in Fahrenheit units. Meanwhile, the chart creator uses Celsius units. This discrepancy is harmless for the temperature chart but it is egregious for the log chart. The underlying reason is the nonlinearity of the log transform - the slope of log Y vs time is not proportional to the slope of Y vs time; in fact, it depends on the value of Y.  

***

The log chart is a sacred cow of scientists, a symbol of our sophistication. Are they as potent as we'd think? In particular, when we put original data values on the log chart, are we making it more intepretable, or less?

 

P.S. I want to tie this discussion back to my Trifecta Checkup framework. The design decision to substitute those axis labels is an example of an act that moves the visual (V) away from the data (D). If the log units were printed, the visual makes sense; when the original units were dropped in, the visual no longer conveys features of the data - the reader must ignore what the eyes are seeing, and focus instead on the brain's perspective.


Logging a sleight of hand

Andrew puts up an interesting chart submitted by one of his readers (link):

Gelman_overnightreturns_tsla

Bruce Knuteson who created this chart is pursuing a theory that there is some fishy going on in the stock markets over night (i.e. between the close of one day and the open of the next day). He split the price data into two interleaving parts: the blue line represents returns overnight and the green line represents returns intraday (from open of one day to the close of the same day). In this example related to Tesla's stock, the overnight "return" is an eyepopping 36850% while the intraday "return" is -46%.

This is an example of an average masking interesting details in the data. One typically looks at the entire sequence of values at once, while this analysis breaks it up into two subsequences. I'll write more about the data analysis at a later point. This post will be purely about the visualization.

***

It turns out that while the chart looks like a standard time series, it isn't. Bruce wrote out the following essential explanation:

Gelman_overnightreturns

The chart can't be interpreted without first reading this note.

The left chart (a) is the standard time-series chart we're thinking about. It plots the relative cumulative percentage change in the value of the investment over time. Imagine one buys $1 of Apple stock on day 1. It shows the cumulative return on day X, expressed as a percent relative to the initial investment amount. As mentioned above, the data series was split into two: the intraday return series (green) is dwarfed by the overnight return series (blue), and is barely visiable hugging the horizontal axis.

Almost without thinking, a graphics designer applies a log transform to the vertical axis. This has the effect of "taming" the extreme values in the blue line. This is the key design change in the middle chart (b). The other change is to switch back to absolute values. The day 1 number is now $1 so the day X number shows the cumulative value of the investment on day X if one started with $1 on day 1.

There's a reason why I emphasized the log transform over the switch to absolute values. That's because the relationship between absolute and relative values here is a linear one. If y(t) is the absolute cumulative value of $1 at time t, then the percent change r(t) = 100(y(t) -1). (Note that y(0) = 1 by definition.)  The shape of the middle chart is primarily conditioned by the log transform.

In the right chart (c), which is the design that Bruce features in all his work, the visual elements of chart (b) are retained while he replaced the vertical axis labels with those from chart (a). In other words, the lines show the cumulative absolute values while the labels show the relative cumulative percent returns.

I left this note on Gelman's blog (corrected a mislabeling of the chart indices):

I'm interested in the the sleight of hand related to the plots, also tying this back to the recent post about log scales. In plot (b) (a) [middle of the panel], he transformed the data to show the cumulative value of the investment assuming one puts $1 in the stock on day 1. He applied a log scale on the vertical axis. This is fine. Then in plot (c) (b), he retained the chart but changed the vertical axis labels so instead of absolute value of the investment, he shows percent changes relative to the initial value.

Why didn't he just plot the relative percent changes? Let y(t) be the absolute values and r(t) = the percent change = 100*(y(t) -1) is a simple linear transformation of y(t). This is where the log transform creates problems! The y(t) series is guaranteed to be positive since hitting y(t) = 0 means the entire investment is lost. However, the r(t) series can hit negative values and also cross over zero many times over time. Thus, log r(t) is inoperable. The problem is using the log transform for data that are not always positive, and the sleight of hand does not fix it!

Just pick any day in which the absolute return fell below $1, e.g. the last day of the plot in which the absolute value of the investment was down to $0.80. In the middle plot (b), the value depicted is ln(0.8) = -0.22. Note that the plot is in log scale, so what is labeled as $1 is really ln(1) = 0. If we instead try to plot the relative percent changes, then the day 1 number should be ln(0) which is undefined while the last number should be ln(-20%) which is also undefined.

This is another example of something umcomfortable about using log scales which I pointed out in this post. It's this idea that when we do log plots, we can freely substitute axis labels which are not directly proportional to the actual labels. It's plotting one thing, and labelling it something else. These labels are then disconnected from the visual encoding. It's against the goal of visualizing data.

 


Swarmed by ants

Andrew discussed the following chart in a recent blog post:

Agelmanblog_gdpel-logscale

Alert! A swarm of ants has marched onto a bubble chart.

These overlapping long text labels are dominating the chart; the length of these labels encodes the length of country names, which has nothing to do with the data.

We're waiting - hoping - for the ants to march off the page.

***
Andrew's blog post is about something else, the use of log scales. The chart above is a log-log plot. Both axes have log scales.

Andrew's correspondent doesn't like log scales. Andrew does.

One problem we encounter in practice with log scales is that people without science background can't read them. Andrew's correspondent said as much, while also misinterpreting the log-log chart. He says the log-log chart "visually creates a much stronger correlation than there actually is".

But that's not what happened. It's more appropriate to say that the log transformations allow us to see the correlation that exists. The correlation is not linear which is why the usual scatter plot does not reveal it. 

Nevertheless, I agree with the correspondent on avoiding log scales in data displays because most readers don't get it.

***

Consider the following pair of plots.

Junkcharts_loglog_sample

The underlying data follow the pattern Y = 0.003 * X^2.5 but for what we're talking about, the specific pattern doesn't matter so long as X and Y has a "power" relationship. 

The left plot directly shows the relationship between X and Y using regular scales. Readers see that Y is running away from X. The slope of the line increases as X increases. The speed of growth of Y exceeds that of X. This relationship is curved, which can't be described in words succinctly.

The right plot visually shows a linear relationship between X and Y but it's not really between X and Y. It's between log(X) and log(Y). Note that log(Y) = log(0.003*X^2.5) = log(0.003) + 2.5*log(X), which is a straight line with slope 2.5 and intercept log(0.003). The gap between gridlines now represents a 10-fold jump in value (of X or of Y). The linear relationship is between X and Y in log scale; in linear scale, it's a power relationship, not linear.

The practice of printing axis labels in the original scale, rather than log scale, adds to the confusion. On the right plot, the points labeled 5,000 and 50,000 do not actually lie on the line; what fall in line are the points log(5,000) and log(50,000). The reason for this confusing practice is that humans have trouble understanding data in log scale. For example, if $50,000 is the GDP per capita for some country, then log($50,000) = $4.5 which can't be interpreted.

Whether we are talking about the gaps between gridlines or about specific points on the line, what readers see on the log-log chart is only part of the story. Readers must also recognize that for the log-log chart to work, equal gaps between gridlines do not signify equal gaps in the data, while the linear relationship is between the log of the axis labels, not the labels themselves.

The X-Y plot can be interpreted visually in a direct way while the log-log plot requires the reader to transcend the visual representation, entering an abstract realm.

 

 


Deliberately obstructing chart elements as a plot point

Bbc_globalwarming_ridgeplot smThese "ridge plots" have become quite popular in recent times. The following example, from this BBC report (link), shows the change in global air temperatures over time.

***

This chart is in reality a panel of probability density plots, one for each year of the dataset. The years are arranged with the oldest at the top and the most recent at the bottom. You take those plots and squeeze every ounce of the space out, so that each chart overlaps massively with the ones above it.

The plot at the bottom is the only one that can be seen unobstructed.

Overplotting chart elements, deliberately obstructing them, doesn't sound useful. Is there something gained for what's lost?

***

The appeal of the ridge plot is the metaphor of ridges, or crests if you see ocean waves. What do these features signify?

The legend at the bottom of the chart gives a hint.

The main metric used to describe global warming is the amount of excess temperature, defined as the temperature relative to a historical average, set as the average temperature during the pre-industrial age. In recent years, the average global temperature is about 1.5 degrees Celsius above the reference level.

One might think that the higher the peak in a given plot, the higher the excess temperature. Not so. The heights of those peaks do not indicate temperatures.

What's the scale of the vertical axis? The labels suggest years, but that's a distractor also. If we consider the panel of non-overlapping probability density charts, the vertical axis should show probability density. In such a panel, the year labels should go to the titles of individual plots. On the ridge plot, the density axes are sacrificed, while the year labels are shifted to the vertical axis.

Admittedly, probability density is not an intuitive concept, so not much is lost by its omission.

The legend appears to suggest that the vertical scale is expressed in number of days so that in any given year, the peak of the curve occurs where the most likely excess temperature is found. But the amount of excess is read from the horizontal axis, not the vertical axis - it is encoded as a displacement in location horizontally away from the historical average. In other words, the height of the peak still doesn't correlate with the magnitude of the excess temperature.

The following set of probability density curves (with made-up data) each has the same average excess temperature of 1.5 degrees. Going from top to bottom, the variability of the excess temperatures increases. The height of the peak decreases accordingly because in a density plot, we require the total area under the curve to be fixed. Thus, the higher the peak, the lower the daily variability of the excess temperature.

Kfung_pdf_variances

A problem with this ridge plot is that it draws our attention to the heights of the peaks, which provide information about a secondary metric.

If we want to find the story that the amount of excess temperature has been increasing over time, we would have to trace a curve through the ridges, which strangely enough is a line that moves top to bottom, initially somewhat vertically, then moving sideways to the right. In a more conventional chart, the line that shows growth over time moves from bottom left to top right.

***

The BBC article (link) features several charts. The first one shows how the average excess temperature trends year to year. This is a simple column chart. By supplementing the column chart with the ridge plot, I assume that the designer wants to tell readers that the average annual excess temperature masks daily variability. Therefore, each annual average has been disaggregated into 366 daily averages.

In the column chart, the annual average is compared to the historical average of 50 years. In the ridge plot, the daily average is compared to ... the same historical average of 50 years. That's what the reference line labeled pre-industrial average is saying to me.

It makes more sense to compare the 366 daily averages to 366 daily averages from those 50 years.

But now I've ruined the dataviz because in each probability density plot, there are 366 different reference points. But not really. We just have to think a little more abstractly. These 366 different temperatures are all mapped to the number zero, after adjustment. Thus, they all coincide at the same location on the horizontal axis.

(It's possible that they actually used 366 daily averages as references to construct the ridge plot. I'm guessing not but feel free to comment if you know how these values are computed.)


Patiently looking

Voronoi (aka Visual Economist) made this map about service times at emergency rooms around the U.S.

 

Voronoi_EmergencyRoomWaitTImes

This map shows why one shouldn’t just stick state-level data into a state-level map by default.

The data are median service times, defined as the duration of the visit from the moment a patients arrive to the moment they leave. For reasons to be explained below, I don’t like this metric. The data are in terms of hours and minutes, and encoded in the color scale.

As with any choropleth, the dominant features of this map are the shapes and sizes of various pieces but these don’t carry any data. The eastern seaboard contains many states that are small in area but dense in population, and always produces a messy, crowded smorgasbord of labels and guiding lines.

The color scale is progressive (continuous) making it even harder to gain an appreciation of the spatial pattern. For the sake of argument, imagine a truly continuous color scale tuned to the median service times in number of minutes. There would be as many shades as there are unique number of values on the map. For example, the state with 2 hr 12 min median time would receive a different shade than the one with 2 hr 11 min. Looking at the dataset, I found 43 unique values of median service time in the 52 states and territories. Thus, almost every state would wear its unique shade, making it hard to answer such common questions as: which cluster of states have high/medium/low median service times?

(As the underlying software may only be capable of printing a finite number of shades so in reality, there aren’t any true continuous scales. A continuous scale is just a discrete scale with many levels of shades. For this map, I’d group the states into at most five categories, requiring five shades.)

***

We’re now reaching the D corner of the Trifecta Checkup (link). _trifectacheckup_image

I’d transform the data to relative values, such as an index against the median or average in the nation. The colors now indicate how much higher or lower is the state’s median service time than that of the nation. With this transformed data, it makes more sense to use a bidirectional color scale so that there are different colors for higher vs lower than average.

Lastly, I’m not sure about the use of median service time, as opposed to average (mean) service time. I suspect that the distribution is heavily skewed toward longer values so that the median service time falls below the mean service time. If, however, the service time distribution is roughly symmetric around the median, then the mean and median service times will be very similar, and thus the metric selection doesn’t matter.

Imagine you're the healthcare provider and your bonus is based on managing median service times. You have an incentive to let a small number of patients wait an extraordinary amount of time, while serving a bunch of patients who require relatively simple procedures. If it's a mean service time, the values of the extreme outliers will be spread over all the patients while the median service time is affected by the number of such outliers but not their magnitudes.

When I pulled down the publicly available data (link), I found additional data fields. The emergency room visits are further broken into four categories (low, medium, high, very high), and a median is reported within each category. Thus, we have a little idea how extreme the top values can be.

The following dotplot shows this:

Junkcharts_redo_voronoi_emergencyrooms

A chart like this is still challenging to read since there are 52 territories, ordered by the value on a metric. If the analyst can say what are interesting questions, e.g. breaking up the territories into regions, then a grouping can be applied to the above chart to aid comprehension.

 


Five-value summaries of distributions

BG commented on my previous post, describing her frustration with the “stacked range chart”:

A stacked graph visualizes cubes stacked one on top of the other. So you can't use it for negative numbers, because there's no such thing [as] "negative data". In graphs, a "minus" sign visualizes the opposite direction of one series from another. Doing average plus average plus average plus average doesn't seem logical at all.

***

I have already planned a second post to discuss the problems of using a stacked column chart to show markers of a numeric distribution.

I tried to replicate how the Youtuber generated his “stacked range chart” by appropriating Excel’s stacked column chart, but failed. I think there are some missing steps not mentioned in the video. At around 3:33 of the video, he shows a “hack” involving adding 100 degrees (any large enough value) to all values (already converted to ranges). Then, the next screen displays the resulting chart. Here is the dataset on the left and the chart on the right.

Minutephysics_londontemperature_datachart

Afterwards, he replaces the axis labels with new labels, effectively shifting the axis. But something is missing from the narrative. Since he’s using a stacked column chart, the values in the table are encoded in the heights of the respective blocks. The total stacked heights of each column should be in the hundreds since he has added 100 to each cell. But that’s not what the chart shows.

***

In the rest of the post, I’ll skip over how to make such a chart in Excel, and talk about the consequences of inserting “range” values into the heights of the blocks of a stacked column chart.

Let’s focus on London, Ontario; the five temperature values, corresponding to various average temperatures, are -3, 5, 9, 14, 24. Just throwing those numbers into a stacked column chart in Excel results in the following useless chart:

Stackedcolumnchart_londonontario

The temperature averages are cumulatively summed, which makes no sense, as noted by reader BG. [My daily temperature data differ somewhat from those in the Youtube. My source is here.]

We should ignore the interiors of the blocks, and instead interpret the edges of these blocks. There are five edges corresponding to the five data values. As in:

Junkcharts_redo_londonontariotemperatures_dotplot

The average temperature in London, Ontario (during Spring 2023-Winter 2024) is 9 C. This overall average hides seasonal as well as diurnal variations in temperature.

If we want to acknowledge that night-time temperatures are lower than day-time temperatures, we draw attention to the two values bracketing 9 C, i.e. 5 C and 14 C. The average daytime (max) temperature is 14 C while the average night-time (min) temperature is 5 C. Furthermore, Ontario experiences seasons, so that the average daytime temperature of 14 C is subject to seasonal variability; in the summer, it goes up to 24 C. In the winter, the average night-time temperature goes down to -3 C, compared to 5 C across all seasons. [For those paying closer attention, daytime/max and night-time/min form congruous pairs because the max temperature occurs during daytime while the min temperature occurs during night-time. Thus, the average of maximum temperatures is the same as the average of daytime maximum temperatures.]

The above dotplot illustrates this dataset adequately. The Youtuber explained why he didn’t like it – I couldn’t quite make sense of what he said. It’s possible he thinks the gaps between those averages are more meaningful than the averages themselves, and therefore he prefers a chart form that draws our attention to the ranges, rather than the values.

***

Our basic model of temperature can be thought of as: temperature on a given day = overall average + adjustment for seasonality + adjustment for diurnality.

Take the top three values 9, 14, 24 from above list. Starting at the overall average of 9 C, the analyst gets to 14 if he hones in on max daily temperatures, and to 24 if he further restricts the analysis to summer months (which have the higher temperatures). The second gap is 10 C, twice as large as the first gap of 5 C. Thus, the seasonal fluctuations have larger magnitude than daily fluctuations. Said differently, the effect of seasons on temperature is bigger than that of hour of day.

In interpreting the “ranges” or gaps between averages, narrow ranges suggest low variability while wider ranges suggest higher variability.

Here's a set of boxplots for the same data:

Junkcharts_redo_londonontariotemperatures

The boxplot "edges" also demarcate five values; they are not the same five values as defined by the Youtuber but both sets of five values describe the underlying distribution of temperatures.

 

P.S. For a different example of something similar, see this old post.


Aligning the visual and the message to hot things up

The headline of this NBC News chart (link) tells readers that Phoenix (Arizona) has been very, very hot this year. It has over 120 days in which the average temperature exceeded 100F (38 C).

Nbcnews_phoenix_tmax

It's not obvious how extreme this situation is. To help readers, it would be useful to add some kind of reference points.

A couple of possibilities come to mind:

First, how many days are depicted in the chart? Since there is one cell for each day of the year, and the day of week is plotted down the vertical axis, we just need to count the number of columns. There are 38 columns, but the first column has one missing cell while the last column has only 3 cells. Thus, the number of days depicted is (36*7)+6+3 = 261. So, the average temperature in Phoenix exceeded 100F on about 46% of the days of the year thus far.

That sounds like a high number. For a better reference point, we'd also like to know the historical average. Is Phoenix just a very hot place? Is 2024 hotter than usual?

***

Let's walk through how one reads the Phoenix "heatmap".

We already figured out that each column represents a week of the year, and each row shows a cross-section of a given day of week throughout the year.

The first column starts on a Monday because the first day of 2024 falls on a Monday. The last column ends on a Tuesday, which corresponds to Sept 17, 2024, the last day of data when this chart was created.

The columns are grouped into months, although such division is complicated by the fact that the number of days in a month (except for a leap month) isn't ever divisible by seven. The designer subtly inserted a thicker border between months. This feature allows readers to comment on the average temperature in a given month. It also lets readers learn quickly that we are two weeks and three days into September.

The color legend explains that temperature readings range from yellow (lower) to red (higher). The range of average daily temperatures during 2024 was 54-118F (12-48C). The color scale is progressive.

Nbcnews_phoenix_colorlegend

Given that 100F is used as a threshold to define "hot days," it makes sense to accentuate this in the visual presentation. For example:

Junkcharts_redo_nbcnewsphoenixmaxtemp

Here, all days with maximum temperature at 100F or above have a red hue.


Adjust, and adjust some more

This Financial Times report illustrates the reason why we should adjust data.

The story explores the trend in economic statistics during 14 years of governing by conservatives. One of those metrics is so-called council funding (local governments). The graphic is interactive: as the reader scrolls the page, the chart transforms.

The first chart shows the "raw" data.

Ft_councilfunding1

The vertical axis shows year-on-year change in funding. It is an index relative to the level in 2010. From this line chart, one concludes that council funding decreased from 2010 to around 2016, then grew; by 2020, funding has recovered to the level of 2010 and then funding expanded rapidly in recent years.

When the reader scrolls down, this chart is replaced by another one:

Ft_councilfunding2

This chart contains a completely different picture. The line dropped from 2010 to 2016 as before. Then, it went flat, and after 2021, it started raising, even though by 2024, the value is still 10 percent below the level in 2010.

What happened? The data journalist has taken the data from the first chart, and adjusted the values for inflation. Inflation was rampant in recent years, thus, some of the raw growth have been dampened. In economics, adjusting for inflation is also called expressing in "real terms". The adjustment is necessary because the same dollar (hmm, pound) is worth less when there is inflation. Therefore, even though on paper, council funding in 2024 is more than 25 percent higher than in 2010, inflation has gobbled up all of that and more, to the point in which, in real terms, council funding has fallen by 20 percent.

This is one material adjustment!

Wait, they have a third chart:

Ft_councilfunding3

It's unfortunate they didn't stabilize the vertical scale. Relative to the middle chart, the lowest point in this third chart is about 5 percent lower, while the value in 2024 is about 10 percent lower.

This means, they performed a second adjustment - for population change. It is a simple adjustment of dividing by the population. The numbers look worse probably because population has grown during these years. Thus, even if the amount of funding stayed the same, the money would have to be split amongst more people. The per-capita adjustment makes this point clear.

***

The final story is much different from the initial one. Not only was the magnitude of change different but the direction of change reversed.

Whenever it comes to adjustments, remember that all adjustments are subjective. In fact, choosing not to adjust is also subjective. Not adjusting is usually much worse.

 

 

 

 


Chart without an axis

When it comes to global warming, most reports cite a single number such as an average temperature rise of Y degrees by year X. Most reports also claim the existence of a consensus within scientists. The Guardian presented the following chart that shows the spread of opinions amongst the experts.

Guardian_globalwarming

Experts were asked how many degrees they expect average global temperature to increase by 2100. The estimates ranged from "below 1.5 degrees" to "5 degrees or more". The most popular answer was 2.5 degrees. Roughly three out of four respondents picked a number at 2.5 degrees or above. The distribution is close to symmetric around the middle.

***

What kind of chart is this?

It's a type of histogram, given that the horizontal axis shows binned ranges of temperature change while the vertical axis shows number of respondents (out of 380).

A (count) histogram typically encodes the count data in the vertical axis. Did you notice there isn't a vertical axis?

That's because the chart has an abnormal axis. Each of the 380 respondents is shown here as a cell. What looks like a "column" is actually two-dimensional. Each row of cells has 10 slots. To find out how many respondents chose the 2.5 celsius category, you count the number of rows and then the number of stray items on top. (It's 132.)

Only the top row of cells can be partially filled so the general shape of the distribution isn't affected much. However, the lack of axis labels makes it hard to learn the count of each column.

It's even harder to know the proportions of respondents, which should be the primary message of the chart. The proportion would have been possible to show if the maximum number of rows was set to 38. The maximum number of rows on the above chart is 22. Using 38 rows leads to a chart with a lot of white space as the tallest column (count of 132) is roughly 35% of the total response.

At the end, I'm not sure this variant of histogram beats the standard histogram.


Aligning V and Q by way of D

In the Trifecta Checkup (link), there is a green arrow between the Q (question) and V (visual) corners, indicating that they should align. This post illustrates what I mean by that.

I saw the following chart in a Washington Post article comparing dairy milk and plant-based "milks".

Vitamins

The article contains a whole series of charts. The one shown here focuses on vitamins.

The red color screams at the reader. At first, it appears to suggest that dairy milk is a standout on all four categories of vitamins. But that's not what the data say.

Let's take a look at the chart form: it's a grid of four plots, each containing one square for each of four types of "milk". The data are encoded in the areas of the squares. The red and green colors represent category labels and do not reflect data values.

Whenever we make bubble plots (the closest relative of these square plots), we have to solve a scale problem. What is the relationship between the scales of the four plots?

I noticed the largest square is the same size across all four plots. So, the size of each square is made relative to the maximum value in each plot, which is assigned a fixed size. In effect, the data encoding scheme is that the areas of the squares show the index values relative to the group maximum of each vitamin category. So, soy milk has 72% as much potassium as dairy milk while oat and almond milks have roughly 45% as much as dairy.

The same encoding scheme is applied also to riboflavin. Oat milk has the most riboflavin, so its square is the largest. Soy milk is 80% of oat, while dairy has 60% of oat.

***

_trifectacheckup_imageLet's step back to the Trifecta Checkup (link). What's the question being asked in this chart? We're interested in the amount of vitamins found in plant-based milk relative to dairy milk. We're less interested in which type of "milk" has the highest amount of a particular vitamin.

Thus, I'd prefer the indexing tied to the amount found in dairy milk, rather than the maximum value in each category. The following set of column charts show this encoding:

Junkcharts_redo_msn_dairyplantmilks_2

I changed the color coding so that blue columns represent higher amounts than dairy while yellow represent lower.

From the column chart, we find that plant-based "milks" contain significantly less potassium and phosphorus than dairy milk while oat and soy "milks" contain more riboflavin than dairy. Almond "milk" has negligible amounts of riboflavin and phosphorus. There is vritually no difference between the four "milk" types in providing vitamin D.

***

In the above redo, I strengthen the alignment of the Q and V corners. This is accomplished by making a stop at the D corner: I change how the raw data are transformed into index values. 

Just for comparison, if I only change the indexing strategy but retain the square plot chart form, the revised chart looks like this:

Junkcharts_redo_msn_dairyplantmilks_1

The four squares showing dairy on this version have the same size. Readers can evaluate the relative sizes of the other "milk" types.