Partition of Europe

A long-time reader sent me the following map via twitter:

Europeelects_map

This map tells how the major political groups divide up the European Parliament. I’ll spare you the counting. There are 27 countries, and nine political groups (including the "unaffiliated").

The key chart type is a box of dots. Each country gets its own box. Each box has its own width. What determines the width? If you ask me, it’s the relative span of the countries on the map. For example, the narrow countries like Ireland and Portugal have three dots across while the wider countries like Spain, Germany and Italy have 7, 10 and 8 dots across respectively.

Each dot represents one seat in the Parliament. Each dot has one of 9 possible colors. Each color shows a political lean e.g. the green dots represent Green parties while the maroon dots display “Left” parties.

The end result is a counting game. If we are interested in counts of seats, we have to literally count each dot. If we are interested in proportion of seats, take your poison: either eyeball it or count each color and count the total.

Who does the underlying map serve? Only readers who know the map of Europe. If you don’t know where Hungary or Latvia is, good luck. The physical constraints of the map work against the small-multiples set up of the data. In a small multiples, you want each chart to be identical, except for the country-specific data. The small-multiples structure requires a panel of equal-sized cells. The map does not offer this feature, as many small countries are cramped into Eastern Europe. Also, Europe has a few tiny states e.g. Luxembourg (population 660K)  and Malta (population 520K). To overcome the map, the designer produces boxes of different sizes, substantially loading up the cognitive burden on readers.

The map also dictates where the boxes are situated. The centroids of each country form the scaffolding, with adjustments required when the charts overlap. This restriction ensures a disorderly appearance. By contrast, the regular panel layout of a small multiples facilitates comparisons.

***

Here is something I sketched using a tile map.

Eu parties print sm

First, I have to create a tile map of European countries. Some parts, e.g. western part, are straightforward. The eastern side becomes very congested.

The tile map encodes location in an imprecise sense. Think about the scaffolding of centroids of countries referred to prior. The tile map imposes an order to the madness - we're shifting these centroids so that they line up in a tidier pattern. What we gain in comparability we concede in location precision.

For the EU tile map, I decided to show the Baltic countries in a row rather than a column; the latter would have been more faithful to the true geography. Malta is shown next to Italy even though it could have been placed below. Similarly, Cyprus in relation to Greece. I also included several key countries that are not part of the EU for context.

Instead of raw seat counts, I'm showing the proportion of seats within each country claimed by each political group. I think this metric is more useful to readers.

The legend is itself a chart that shows the aggregate statistics for all 27 countries.


Visual story-telling: do you know or do you think?

One of the most important data questions of all time is: do you know? or do you think?

And one of the easiest traps to fall into is: I think, therefore I know.

***

Visual story-telling can be great but it can also mislead. Deception sometimes happens when readers are nudged to "fill in the blanks" with stuff they think they know, but they don't.

A Twitter reader asked me to look at the map in this Los Angeles Times (paywall) opinion column.

Latimes_lifeexpectancy_postcovid

The column promptly announces its premise:

Years of widening economic inequality, compounded by the pandemic and political storm and stress, have given Americans the impression that the country is on the wrong track. Now there’s empirical data to show just how far the country has run off the rails: Life expectancies have been falling.

The writer creates the expectation that he will reveal evidence in the form of data to show that life expectancies have been driven down by economic inequality, pandemic, and politics. Does he succeed?

***

The map portrays average life expectancy (at birth) for some mysterious, presumably very recent, year for every county in the United States. From the color legend, we learn that the bottom-to-top range is about 20 years. There is a clear spatial pattern, with the worst results in the south (excepting south Florida).

The choice of colors is telling. Red and blue on a U.S. map has heavy baggage, as they signify the two main political parties in the country. Given that the author believes politics to be a key driver of health outcomes, the usage of red and blue here is deliberate. Throughout the article, the columnist connects the lower life expectancies in southern states to its politics.

For example, he said "these geographical disparities aren't artifacts of pure geography or demographics; they're the consequences of policy decisions at the state level... Of the 20 states with the worst life expectancies, eight are among the 12 that have not implemented Medicaid expansion under the Affordable Care Act..."

Casual readers may fall into a trap here. There is nothing on the map itself that draws the connection between politics and life expectancies; the idea is evoked purely through the red-blue color scheme. So, as readers, we are filling in the blanks with our own politics.

What could have been done instead? Let's look at the life expectancy map side by side with the map of the U.S. 2020 Presidential election.

Junkcharts_lifeexpectancy_elections

Because of how close recent elections have been, we may think the political map has a nice balance of red and blue but it isn't. The Democrats' votes are heavily concentrated in densely-populated cities so most of the Presidential election map is red. When placed next to each other, it's obvious that politics don't explain the variance in life expectancy well. The Midwest is deep red and yet they have above average life expectancies. I have circled out various regions that contradict the claim that Republican politics drove life expectancies down.

It's not sufficient to point to the South, in which Republican votes and life expectancy are indeed inversely correlated. A good theory has to explain most of the country.

***

The columnist also suggests that poverty is the cause of low life expectancy. That too cannot be gleaned from the published map. Again, readers are nudged to use their wild imagination to fill in the blank.

Data come to the rescue. Here is a side-by-side comparison of the map of life expectancies and the map of median incomes.

Junkcharts_lifeexpectancy_income

A similar conundrum. While the story feels right in the South, it fails to explain the northwest, Florida, and various other parts of the country. Take a look again at the circled areas. Lower income brackets are also sometimes associated with high life expectancies.

***

The author supplies a third cause of lower life expectancies: Covid-19 response. Because Covid-19 was the "most obvious and convenient" explanation for the loss of life expectancy during the pandemic, this theory suggests that the red areas on the life expectancy map should correspond to the regions most ravaged by Covid-19.

Let's see the data.

Junkcharts_lifeexpectancy_covidcases

The map on the right shows the number of confirmed cases until June 2021. As before, the correlation holds somewhat in the South but there are notable exceptions, e.g. the Midwest. We also have states with low Covid-19 cases but below-average life expectancy.

***

What caused the decline of life expectancy in the U.S. - which began before the pandemic, and has continued beyond - is highly complex, beyond what a single map or a pair of maps or a few pairs of maps could convey. Showing a red-blue map presents a trap for readers to fall into, in which they start thinking, without knowing.

 


Graph workflow and defaults wreak havoc

For the past week or 10 days, every time I visited one news site, it insisted on showing me an article about precipitation in North Platte. It's baiting me to write a post about this lamentable bar chart (link):

Northplatte_rainfall

***

This chart got problems, and the problems start with the tooling, which dictates a workflow.

I imagine what the chart designer had to deal with.

For a bar chart, the tool requires one data series to be numeric, and the other to be categorical. A four-digit year is a number, which can be treated either as numeric or categorical. In most cases, and by default, numbers are considered numeric. To make this chart, the user asked the tool to treat years as categorical.

Junkcharts_northplattedry_datatypes

Many tools treat categories as distinct entities ("nominal"), mapping each category to a distinct color. So they have 11 colors for 11 years, which is surely excessive.

This happens because the year data is not truly categorical. These eleven years were picked based on the amount of rainfall. There isn't a single year with two values, it's not even possible. The years are just irregularly spaced indices. Nevertheless, the tool misbehaves if the year data are regarded as numeric. (It automatically selects a time-series line chart, because someone's data visualization flowchart says so.) Mis-specification in order to trick the tool has consequences.

The designer's intention is to compare the current year 2023 to the driest years in history. This is obvious from the subtitle in which 2023 is isolated and its purple color is foregrounded.

Junkcharts_northplattedry_titles

How unfortunate then that among the 11 colors, this tool grabbed 4 variations of purple! I like to think that the designer wanted to keep 2023 purple, and turn the other bars gray -- but the tool thwarted this effort.

Junkcharts_northplattedry_purples

The tool does other offensive things. By default, it makes a legend for categorical data. I like the placement of the legend right beneath the title, a recognition that on most charts, the reader must look at the legend first to comprehend what's on the chart.

Not so in this case. The legend is entirely redundant. Removing the legend does not affect our cognition one bit. That's because the colors encode nothing.

Worse, the legend sows confusion because it presents the same set of years in chronological order while the bars below are sorted by amount of precipitation: thus, the order of colors in the legend differs from that in the bar chart.

Junkcharts_northplattedry_legend

I can imagine the frustration of the designer who finds out that the tool offers no option to delete the legend. (I don't know this particular tool but I have encountered tools that are rigid in this manner.)

***

Something else went wrong. What's the variable being plotted on the numeric (horizontal) axis?

The answer is inches of rainfall but the answer is actually not found anywhere on the chart. How is it possible that a graphing tool does not indicate the variables being plotted?

I imagine the workflow like this: the tool by default puts an axis label which uses the name of the column that holds the data. That column may have a name that is not reader-friendly, e.g. PRECIP. The designer edits the name to "Rainfall in inches". Being a fan of the Economist graphics style, they move the axis label to the chart title area.

The designer now works the chart title. The title is made to spell out the story, which is that North Platte is experiencing a historically dry year. Instead of mentioning rainfall, the new title emphasizes the lack thereof.

The individual steps of this workflow make a lot of sense. It's great that the title is informative, and tells the story. It's great that the axis label was fixed to describe rainfall in words not database-speak. But the end result is a confusing mess.

The reader must now infer that the values being plotted are inches of rainfall.

Further, the tool also imposes a default sorting of the bars. The bars run from longest to shortest, in this case, the longest bar has the most rainfall. After reading the title, our expectation is to find data on the Top 11 driest years, from the driest of the driest to the least dry of the driest. But what we encounter is the opposite order.

Junkcharts_northplattedry_sorting

Most graphics software behaves like this as they are plotting the ranks of the categories with the driest being rank 1, counting up. Because the vertical axis moves upwards from zero, the top-ranked item ends up at the bottom of the chart.

***

_trifectacheckup_imageMoving now from the V corner to the D corner of the Trifecta checkup (link), I can't end this post without pointing out that the comparisons shown on the chart don't work. It's the first few months of 2023 versus the full years of the others.

The fix is to plot the same number of months for all years. This can be done in two ways: find the partial year data for the historical years, or project the 2023 data for the full year.

(If the rainy season is already over, then the chart will look exactly the same at the end of 2023 as it is now. Then, I'd just add a note to explain this.)

***

Here is a version of the chart after doing away with unhelpful default settings:


Redo_junkcharts_northplattedry


Deconstructing graphics as an analysis tool in dataviz

One of the useful exercises I like to do with charts is to "deconstruct" them. (This amounts to a deeper version of the self-sufficiency test.)

Here is a chart stripped down to just the main visual elements.

Junkcharts_cbcrevenues_deconstructed1

The game is to guess what is the structure of the data given these visual elements.

I guessed the following:

  • The data has a top-level split into two groups
  • Within each group, the data is further split into 3 parts, corresponding to the 3 columns
  • With each part, there are a variable number of subparts, each of which is given a unique color
  • The color legend suggests that each group's data are split into 7 subparts, so I'm guessing that the 7 subparts are aggregated into 3 parts
  • The core chart form is a stacked column chart with absolute values so relative proportions within each column (part) is important
  • Comparing across columns is not supported because each column has its own total value
  • Comparing same-color blocks across the two groups is meaningful. It's easier to compare their absolute values but harder to compare the relative values (proportions of total)

If I knew that the two groups are time periods, I'd also guess that the group on the left is the earlier time period, and the one on the right is the later time period. In addition to the usual left-to-right convention for time series, the columns are getting taller going left to right. Many things (not all, obviously) grow over time.

The color choice is a bit confusing because if the subparts are what I think they are, then it makes more sense to use one color and different shades within each column.

***

The above guesses are a mixed bag. What one learns from the exercise is what cues readers are receiving from the visual structure.

Here is the same chart with key contextual information added back:

Junkcharts_cbcrevenues_deconstructed2

Now I see that the chart concerns revenues of a business over two years.

My guess on the direction of time was wrong. The more recent year is placed on the left, counter to convention. This entity therefore suffered a loss of revenues from 2017-8 to 2018-9.

The entity receives substantial government funding. In 2017-8, it has 1 dollar of government funds for every 2 dollars of revenues. In 2018-9, it's roughly 2 dollars of government funds per every 3 dollars of revenues. Thus, the ratio of government funding to revenues has increased.

On closer inspection, the 7 colors do not represent 7 components of this entity's funding. The categories listed in the color legend overlap.

It's rather confusing but I missed one very important feature of the chart in my first assessment: the three columns within each year group are nested. The second column breaks down revenues into 3 parts while the third column subdivides advertising revenues into two parts.

What we've found is that this design does not offer any visual cues to help readers understand how the three columns within a year-group relates to each other. Adding guiding lines or changing the color scheme helps.

***

Next, I add back the data labels:

Cbc_revenues_original

The system of labeling can be described as: label everything that is not further broken down into parts on the chart.

Because of the nested structure, this means two of the column segments, which are the sums of subparts, are not labeled. This creates a very strange appearance: usually, the largest parts are split into subparts, so such a labeling system means the largest parts/subparts are not labeled while the smaller, less influential, subparts are labeled!

You may notice another oddity. The pink segment is well above $1 billion but it is roughly the size of the third column, which represents $250 million. Thus, these columns are not drawn to scale. What happened? Keep reading.

***

Here is the whole chart:

Cbc_revenues_original

A twitter follower sent me this chart. Elon Musk has been feuding with the Canadian broadcaster CBC.

Notice the scale of the vertical axis. It has a discontinuity between $700 million and $1.7 billion. In other words, the two pink sections are artificially shortened. The erased section contains $1 billion (!) Notice that the erased section is larger than the visible section.

The focus of Musk's feud with CBC is on what proportion of the company's funds come from the government. On this chart, the only way to figure that out is to copy out the data and divide. It's roughly 1.2/1.7 = 70% approx.

***

The exercise of deconstructing graphics helps us understand what parts are doing what, and it also reveals what cues certain parts send to readers.

In better dataviz, every part of the chart is doing something useful, it's free of redundant parts that take up processing time for no reason, and the cues to readers move them towards the intended message, not away from it.

***

A couple of additional comments:

I'm not sure why old data was cited because in the most recent accounting report, the proportion of government funding was around 65%.

Source of funding is not a useful measure of pro- or anti-government bias, especially in a democracy where different parties lead the government at different times. There are plenty of mouthpiece media that do not apparently receive government funding.


Showing both absolute and relative values on the same chart 1

Visual Capitalist has a helpful overview on the "uninsured" deposits problem that has become the talking point of the recent banking crisis. Here is a snippet of the chart that you can see in full at this link:

Visualcapitalist_uninsureddeposits_top

This is in infographics style. It's a bar chart that shows the top X banks. Even though the headline says "by uninsured deposits", the sort order is really based on the proportion of deposits that are uninsured, i.e. residing in accounts that exceed $250K.  They used a red color to highlight the two failed banks, both of which have at least 90% of deposits uninsured.

The right column provides further context: the total amounts of deposits, presented both as a list of numbers as well as a column of bubbles. As readers know, bubbles are not self-sufficient, and if the list of numbers were removed, the bubbles lost most of their power of communication. Big, small, but how much smaller?

There are little nuggets of text in various corners that provide other information.

Overall, this is a pretty good one as far as infographics go.

***

I'd prefer to elevate information about the Too Big to Fail banks (which are hiding in plain sight). Addressing this surfaces the usual battle between relative and absolute values. While the smaller banks have some of the highest concentrations of uninsured deposits, each TBTF bank has multiples of the absolute dollars of uninsured deposits as the smaller banks.

Here is a revised version:

Redo_visualcapitalist_uninsuredassets_1

The banks are still ordered in the same way by the proportions of uninsured value. The data being plotted are not the proportions but the actual deposit amounts. Thus, the three TBTF banks (Citibank, Chase and Bank of America) stand out of the crowd. Aside from Citibank, the other two have relatively moderate proportions of uninsured assets but the sizes of the red bars for any of these three dominate those of the smaller banks.

Notice that I added the gray segments, which portray the amount of deposits that are FDIC protected. I did this not just to show the relative sizes of the banks. Having the other part of the deposits allow readers to answer additional questions, such as which banks have the most insured deposits? They also visually present the relative proportions.

***

The most amazing part of this dataset is the amount of uninsured money. I'm trying to think who these account holders are. It would seem like a very small collection of people and/or businesses would be holding these accounts. If they are mostly businesses, is FDIC insurance designed to protect business deposits? If they are mostly personal accounts, then surely only very wealthy individuals hold most of these accounts.

In the above chart, I'm assuming that deposits and assets are referring to the same thing. This may not be the correct interpretation. Deposits may be only a portion of the assets. It would be strange though that the analysts only have the proportions but not the actual deposit amounts at these banks. Nevertheless, until proven otherwise, you should see my revision as a sketch - what you can do if you have both the total deposits and the proportions uninsured.


Bivariate choropleths

A reader submitted a link to Joshua Stephen's post about bivariate choropleths, which is the technical term for the map that FiveThirtyEight printed on abortion bans, discussed here. Joshua advocates greater usage of maps with two-dimensional color scales.

As a reminder, the fundamental building block is expressed in this bivariate color legend:

Fivethirtyeight_abortionmap_colorlegend

Counties are classified into one of these nine groups, based on low/middle/high ratings on two dimensions, distance and congestion.

The nine groups are given nine colors, built from superimposing shades of green and pink. All nine colors are printed on the same map.

Joshuastephens_singlemap

Without a doubt, using these nine related colors are better than nine arbitrary colors. But is this a good data visualization?

Specifically, is the above map better than the pair of maps below?

Joshuastephens_twomaps

The split map is produced by Josh to explain that the bivariate choropleth is just the superposition of two univariate choropleths. I much prefer the split map to the superimposed one.

***

Think about what the reader goes through when comparing two counties.

Junkcharts_bivariatechoropleths

Superimposing the two univariate maps solves one problem: it removes the need to scan back and forth between two maps, looking for the same locations, something that is imprecise. (Unless, the map is interactive, and highlighting one county highlights the same county in the other map.)

For me, that's a small price to pay for quicker translation of color into information.

 

 


Yet another off radar plot

Bloomberg compares people's lives in retirement in this interesting dataviz project (link, paywall). The "showcase" chart is a radar plot that looks like this:

Bloomberg_retirementages_radar_male

The radar plot may count as the single chart type that has the most number of lives. I'm afraid this one does not go into the hall of fame, either.

The setup leading to this plot is excellent, though. The analytical framework is to divide the retirement period into two parts: healthy and not so healthy. The countries in the radar plot are in fact ordered by the duration of the "healthy retirement period", with France leading the pack. The reference levels used throughout the article is the OECD average. On average, the OECD resident retires at age 64, and dies at age 82, so they spend 18 years in retirement, and 13 of them while "healthy".

In the radar plot, the three key dates are plotted as yellow, green and purple dots. The yellow represents the retirement age, the green, the end of the healthy period, and the purple, the end of life.

Now, take 10, 20, 30 seconds, and try to come up with a message for the above chart.

Not easy at all.

***

Notice the control panel up top. The male and female data are plotted separately. I place the two segments next to each other:

Bloomberg_retirementages_radar_malefemale

It's again hard to find any insight - other than the most obvious, which is that female life expectancy is higher.

But note that the order for the countries is different for each chart, and so even the above statement takes a bit of time to verify.

***

There are many structural challenges to using radar charts. I'll cover one of these here - the amount of non data-ink baggage that comes with using this chart form.

In the Bloomberg example, the baggage includes radial gridlines for countries, concentric gridlines for the years dimension, the country labels around the circle, the age labels in the middle, the color legend, the set of arrows that map to the healthy retirement period, and the country ranks (and little arrow) that indicate the direction of reading. That's a lot of information to process.

In the next post, I'll try a different visual form.

 

 


Area chart is not the solution

A reader left a link to a Wiki chart, which is ghastly:

House_Seats_by_State_1789-2020_Census

This chart concerns the trend of relative proportions of House representatives in the U.S. Congress by state, and can be found at this Wikipedia entry. The U.S. House is composed of Representatives, and the number of representatives is roughly proportional to each state's population. This scheme actually gives small states disporportional representation, since the lowest number of representatives is 1 while the total number of representatives is fixed at 435.

We can do a quick calculation: 1/435 = 0.23% so any state that has less than 0.23% of the population is over-represented in the House. Alaska, Vermont and Wyoming are all close to that level. The primary way in which small states get larger representation is via the Senate, which sits two senators per state no matter the size. (If you've wondered about Nate Silver's website: 435 Representatives + 100 Senators + 3 for DC = 538 electoral votes for U.S. Presidental elections.)

***

So many things have gone wrong with this chart. There are 50 colors for 50 states. The legend arranges the states by the appropriate metric (good) but in ascending order (bad). This is a stacked area chart, which makes it very hard to figure out the values other than the few at the bottom of the chart.

A nice way to plot this data is a tile map with line charts. I found a nice example that my friend Xan put together in 2018:

Xang_cdcflu_tilemap_lines

A tile map is a conceptual representation of the U.S. map in which each state is represented by equal-sized squares. The coordinates of the states are distorted in order to line up the tiles. A tile map is a small-multiples setup in which each square contains a chart of the same design to faciliate inter-state comparisons.

In the above map, Xan also takes advantage of the foregrounding concept. Each chart actually contains all 50 lines for every state, all shown in gray while the line for the specific state is bolded and shown in red.

***

A chart with 50 lines looks very different from one with 50 areas stacked on each other. California, the most populous state, has 12% of the total population so the line chart has 50 lines that will look like spaghetti. Thus, the fore/backgrounding is important to make sure it's readable.

I suspect that the designer chose a stacked area chart because the line chart looked like spaghetti. But that's the wrong solution. While the lines no longer overlap each other, it is a real challenge to figure out the state-level trends - one has to focus on the heights of the areas, rather than the boundary lines.

[P.S. 2/27/2023] As we like to say, a picture is worth a thousand words. Twitter reader with the handle LHZGJG made the tile map I described above. It looks like this:

Lhzgjg_redo_houseapportionment

You can pick out the states with the key changes really fast. California, Texas, Florida on the upswing, and New York, Pennsylvania going down. I like the fact that the state names are spelled out. Little tweaks are possible but this is a great starting point. Thanks LHZGJG! ]

 


If you blink, you'd miss this axis trick

When I set out to write this post, I was intending to make a quick point about the following chart found in the current issue of Harvard Magazine (link):

Harvardmag_humanities

This chart concerns the "tectonic shift" of undergraduates to STEM majors at the expense of humanities in the last 10 years.

I like the chart. The dot plot is great for showing this data. They placed the long text horizontally. The use of color is crucial, allowing us to visually separate the STEM majors from the humanities majors.

My intended post is to suggest dividing the chart into four horizontal slices, each showing one of the general fields. It's a small change that makes the chart even more readable. (It has the added benefit of not needing a legend box.)

***

Then, the axis announced itself.

I was baffled, then disgusted.

Here is a magnified view of the axis:

Harvardmag_humanitiesmajors_axis

It's not a linear scale, as one would have expected. What kind of transformation did they use? It's baffling.

Notice the following features of this transformed scale:

  • It can't be a log scale because many of the growth values are negative.
  • The interval for 0%-25% is longer than for 25%-50%. The interval for 0%-50% is also longer than for 50%-100%. On the positive side, the larger values are pulled in and the smaller values are pushed out.
  • The interval for -20%-0% is the same length as that for 0%-25%. So, the transformation is not symmetric around 0

I have no idea what transformation was applied. I took the growth values, measured the locations of the dots, and asked Excel to fit a polynomial function, and it gave me a quadratic fit, R square > 99%.

Redo_harvardmaghumanitiesmajors_scale2

This formula fits the values within the range extremely well. I hope this isn't the actual transformation. That would be disgusting. Regardless, they ought to have advised readers of their unusual scale.

***

Without having the fitted formula, there is no way to retrieve the actual growth values except for those that happen to fall on the vertical gridlines. Using the inverse of the quadratic formula, I deduced what the actual values were. The hardest one is for Computer Science, since the dot sits to the right of the last gridline. I checked that value against IPEDS data.

The growth values are not extreme, falling between -50% and 125%. There is nothing to be gained by transforming the scale.

The following chart undoes the transformation, and groups the majors by field as indicated above:

Redo_harvardmagazine_humanitiesmajors

***

Yesterday, I published a version of this post at Andrew's blog. Several readers there figured out that the scale is the log of the relative ratio of the number of degrees granted. In the above notation, it is log10(100%+x), where x is the percent change in number of degrees between 2011 and 2021.

Here is a side-by-side view of the two scales:

Redo_harvardmaghumanitiesmajors_twoscales

The chart on the right spreads the negative growth values further apart while slightly compressing the large positive values. I still don't think there is much benefit to transforming this set of data.

 

P.S. [1/31/2023]

(1) A reader on Andrew's blog asked what's wrong with using the log relative ratio scale. What's wrong is exactly what this post is about. For any non-linear scale, the reader can't make out the values between gridlines. In the original chart, there are four points that exist between 0% and 25%. What values are those? That chart is even harder because now that we know what the transform is, we'd need to first think in terms of relative ratios, so 1.25 instead of 25%, then think in terms of log, if we want to know what those values are.

(2) The log scale used for change values is often said to have the advantage that equal distances on either side represent counterbalancing values. For example, (1.5) (0.66) = (3/2) (2/3)  = 1. But this is a very specific scenario that doesn't actually apply to our dataset.  Consider these scenarios:

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: # degrees went from 2000 to 3000 i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while the number of Psychology degrees grew by 1000 (Psychology I think is the more popular major)

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: from 1000 to 1500, i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while # of Psychology degrees grew by 500
(Assume same starting values)

History: # degrees went from 1000 to 666 i.e. Relative ratio = 2/3
Psychology: from 666 to 666*3/2 = 999 i.e. Relative ratio = 3/2

The # of History degrees dropped by 334 while # of Psychology degrees grew by 333
(Assume Psychology's starting value to be History's ending value)

Psychology: # degrees went from 1000 to 1500 i.e. Relative ratio = 3/2
History: # degrees went from 1500 to 1000 i.e. Relative ratio = 2/3

The # of Psychology degrees grew by 500 while the # of History degrees dropped by 500
(Assume History's starting value to be Psychology's ending value)

 

 


Longest life, shortest length

Racetrack charts refuse to die. For old time's sake, here is a blog post from 2005 in which I explain why they don't make good dataviz.

Our latest example comes from Visual Capitalist (link), which publishes a fair share of nice dataviz. In this infographics, they feature a racetrack chart, just because the topic is the lifespan of cars.

Visualcapitalist_lifespan_cars_top

The whole infographic has four parts, each a racetrack chart. I'll focus on the first racetrack chart (shown above), which deals with the product category of sedans and hatchbacks.

The first thing I noticed is the reference value of 100,000 miles, which is described as the expected lifespan of a typical car made in the 1970s. This is of dubious value since the top of the page informs us the current relevant reference value is 200,000 miles, which is unlabeled. We surmise that 200,000 miles is indicated by the end of the grey sections of the racetrack. (This is eventually confirmed in the next racettrack chart for SUVs in the second sectiotn of the infographic.)

Now let's zoom in on the brown section of the track. Each of the four sections illustrates the same datum = 100,000 miles and yet they exhibit different lengths. From this, we learn that the data are not encoded in the lengths of these tracks -- but rather the data are to be found in the angle sustained at the centre of the concentric circles. The problem with racetrack charts is that readers are drawn to the lengths of the tracks rather than the angles at the center, which are not explicitly represented.

The Avalon model has the longest life span on this chart, and yet it is shown as the shortest curve.

***

The most baffling part of this chart is not the visual but the analysis methodology.

I quote:

iSeeCars analyzed over 2M used cars on the road between Jan. and Oct. 2022. Rankings are based on the mileage that the top 1% of cars within each model obtained.

According to this blurb, the 245,710 miles number for Avalon is the average mileage found in the top 1% of Avalons within the iSeeCars sample of 2M used cars.

The word "lifespan" strikes me as incorporating a date of death, and yet nothing in the above text indicates that any of the sampled cars are at end of life. The cars they really need are not found in their sample at all.

I suppose taking the top 1% is meant to exclude younger cars but why 1%? Also, this sample completely misses the cars that prematurely died, e.g. the cars that failed after 100,000 miles but before 200,000 miles. This filtering also ensures that newer models are excluded from the sample.

_trifectacheckup_imageIn the Trifecta Checkup, this qualifies as Type DV. The dataset does not answer the question of concern while the visual form distorts the data.