Finding the story in complex datasets

In CT Mirror's feature about Connecticut, which I wrote about in the previous post, there is one graphic that did not rise to the same level as the others.

Ctmirror_highschools

This section deals with graduation rates of the state's high school districts. The above chart focuses on exactly five districts. The line charts are organized in a stack. No year labels are provided. The time window is 11 years from 2010 to 2021. The column of numbers show the difference in graduation rates over the entire time window.

The five lines look basically the same, if we ignore what looks to be noisy year-to-year fluctuations. This is due to the weird aspect ratio imposed by stacking.

Why are those five districts chosen? Upon investigation, we learn that these are the five districts with the biggest improvement in graduation rates during the 11-year time window.

The same five schools also had some of the lowest graduation rates at the start of the analysis window (2010). This must be so because if a school graduated 90% of its class in 2010, it would be mathematically impossible for it to attain a 35% percent point improvement! This is a dissatisfactory feature of the dataviz.

***

In preparing an alternative version, I start by imagining how readers might want to utilize a visualization of this dataset. I assume that the readers may have certain school(s) they are particularly invested in, and want to see its/their graduation performance over these 11 years.

How does having the entire dataset help? For one thing, it provides context. What kind of context is relevant? As discussed above, it's futile to compare a school at the top of the ranking to one that is near the bottom. So I created groups of schools. Each school is compared to other schools that had comparable graduation rates at the start of the analysis period.

Amistad School District, which takes pole position in the original dataviz, graduated only 58% of its pupils in 2010 but vastly improved its graduation rate by 35% over the decade. In the chart below (left panel), I plotted all of the schools that had graduation rates between 50 and 74% in 2010. The chart shows that while Amistad is a standout, almost all schools in this group experienced steady improvements. (Whether this phenomenon represents true improvement, or just grade inflation, we can't tell from this dataset alone.)

Redo_junkcharts_ctmirrorhighschoolsgraduation_1

The right panel shows the group of schools with the next higher level of graduation rates in 2010. This group of schools too increased their graduation rates almost always. The rate of improvement in this group is lower than in the previous group of schools.

The next set of charts show school districts that already achieved excellent graduation rates (over 85%) by 2010. The most interesting group of schools consists of those with 85-89% rates in 2010. Their performance in 2021 is the most unpredictable of all the school groups. The majority of districts did even better while others regressed.

Redo_junkcharts_ctmirrorhighschoolsgraduation_2

Overall, there is less variability than I'd expect in the top two school groups. They generally appeared to have been able to raise or maintain their already-high graduation rates. (Note that the scale of each chart is different, and many of the lines in the second set of charts are moving within a few percentages.)

One more note about the charts: The trend lines are "smoothed" to focus on the trends rather than the year to year variability. Because of smoothing, there is some awkward-looking imprecision e.g. the end-to-end differences read from the curves versus the observed differences in the data. These discrepancies can easily be fixed if these charts were to be published.


Thoughts on Daniel's fix for dual-axes charts

I've taken a little time to ponder Daniel Z's proposed "fix" for dual-axes charts (link). The example he used is this:

Danielzvinca_dualaxes_linecolumn

In that long post, Daniel explained why he preferred to mix a line with columns, rather than using the more common dual lines construction: to prevent readers from falsely attributing meaning to crisscrossing lines. There are many issues with dual-axes charts, which I won't repeat in this post; one of their most dissatisfying features is the lack of connection between the two vertical scales, and thus, it's pretty easy to manufacture an image of correlation when it doesn't exist. As shown in this old post, one can expand or restrict one of the vertical axes and shift the line up and down to "match" the other vertical axis.

Daniel's proposed fix retains the dual axes, and he even restores the dual lines construction.

Danielzvinca_dualaxes_estimatedy

How is this chart different from the typical dual-axes chart, like the first graph in this post?

Recall that the problem with using two axes is that the designer could squeeze, expand or shift one of the axes in any number of ways to manufacture many realities. What Daniel effectively did here is selecting one specific way to transform the "New Customers" axis (shown in gray).

His idea is to run a simple linear regression between the two time series. Think of fitting a "trendline" in Excel between Revenues and New Customers. Then, use the resulting regression equation to compute an "estimated" revenues based on the New Customers series. The coefficients of this regression equation then determines the degree of squeezing/expansion and shifting applied to the New Customers axis.

The main advantage of this "fix" is to eliminate the freedom to manufacture multiple realities. There is exactly one way to transform the New Customers axis.

The chart itself takes a bit of time to get used to. The actual values plotted in the gray line are "estimated revenues" from the regression model, thus the blue axis values on the left apply to the gray line as well. The gray axis shows the respective customer values. Because we performed a linear fit, each value of estimated revenues correspond to a particular customer value. The gray line is thus a squeezed/expanded/shifted replica of the New Customers line (shown in orange in the first graph). The gray line can then be interpreted on two connected scales, and both the blue and gray labels are relevant.

***

What are we staring at?

The blue line shows the observed revenues while the gray line displays the estimated revenues (predicted by the regression line). Thus, the vertical gaps between the two lines are the "residuals" of the regression model, i.e. the estimation errors. If you have studied Statistics 101, you may remember that the residuals are the components that make up the R-squared, which measures the quality of fit of the regression model. R-squared is the square of r, which stands for the correlation between Customers and the observed revenues. Thus the higher the (linear) correlation between the two time series, the higher the R-squared, the better the regression fit, the smaller the gaps between the two lines.

***

There is some value to this chart, although it'd be challenging to explain to someone who has not taken Statistics 101.

While I like that this linear regression approach is "principled", I wonder why this transformation should be preferred to all others. I don't have an answer to this question yet.

***

Daniel's fix reminds me of a different, but very common, chart.

Forecastvsactualinflationchart

This chart shows actual vs forecasted inflation rates. This chart has two lines but only needs one axis since both lines represent inflation rates in the same range.

We can think of the "estimated revenues" line above as forecasted or expected revenues, based on the actual number of new customers. In particular, this forecast is based on a specific model: one that assumes that revenues is linearly related to the number of new customers. The "residuals" are forecasting errors.

In this sense, I think Daniel's solution amounts to rephrasing the question of the chart from "how closely are revenues and new customers correlated?" to "given the trend in new customers, are we over- or under-performing on revenues?"

Instead of using the dual-axes chart with two different scales, I'd prefer to answer the question by showing this expected vs actual revenues chart with one scale.

This does not eliminate the question about the "principle" behind the estimated revenues, but it makes clear that the challenge is to justify why revenues is a linear function of new customers, and no other variables.

Unlike the dual-axes chart, the actual vs forecasted chart is independent of the forecasting method. One can produce forecasted revenues based on a complicated function of new customers, existing customers, and any other factors. A different model just changes the shape of the forecasted revenues line. We still have two comparable lines on one scale.

 

 

 

 

 


All about Connecticut

This dataviz project by CT Mirror is excellent. The project walks through key statistics of the state of Connecticut.

Here are a few charts I enjoyed.

The first one shows the industries employing the most CT residents. The left and right arrows are perfect, much better than the usual dot plots.

Ctmirror_growingindustries

The industries are sorted by decreasing size from top to bottom, based on employment in 2019. The chosen scale is absolute, showing the number of employees. The relative change is shown next to the arrow heads in percentages.

The inclusion of both absolute and relative scales may be a source of confusion as the lengths of the arrows encode the absolute differences, not the relative differences indicated by the data labels. This type of decision is always difficult for the designer. Selecting one of the two scales may improve clarity but induce loss aversion.

***

The next example is a bumps chart showing the growth in residents with at least a bachelor's degree.

Ctmirror_highered

This is more like a slopegraph as it appears to draw straight lines between two time points 9 years apart, omitting the intervening years. Each line represents a state. Connecticut's line is shown in red. The message is clear. Connecticut is among the most highly educated out of the 50 states. It maintained this advantage throughout the period.

I'd prefer to use solid lines for the background states, and the axis labels can be sparser.

It's a little odd that pretty much every line has the same slope. I'm suspecting that the numbers came out of a regression model, with varying slopes by state, but the inter-state variance is low.

In the online presentation, one can click on each line to see the values.

***

The final example is a two-sided bar chart:

Ctmirror_migration

This shows migration in and out of the state. The red bars represent the number of people who moved out, while the green bars represent those who moved into the state. The states are arranged from the most number of in-migrants to the least.

I have clipped the bottom of the chart as it extends to 50 states, and the bottom half is barely visible since the absolute numbers are so small.

I'd suggest showing the top 10 states. Then group the rest of the states by region, and plot them as regions. This change makes the chart more compact, as well as more useful.

***

There are many other charts, and I encourage you to visit and support this data journalism.

 

 

 


Yet another off radar plot 2

In the last post, I described my experience reading the radar plot, by Bloomberg Graphics, that compares countries in terms of their citizens' post-retirement lives.

Bloomberg_retirementages_radar_male

I used a different approach:

Redo_bloomberg_retirementages_radar_male

Instead of focusing on the actual time points (ages), my chart highlights the variance from the OECD averages.

The chart compares countries along three metrics: total life expectancy (including healthy and unhealthy periods), effective retirement age, and the number of healthy years in retirement, which is the issue of greatest interest.

From the above chart, France and Luxembourg have the same profiles. Their citizens live a year or two above the average life expectancy. They retire about 5 years earlier than average, and enjoy about 5 more years of healthy retirement.

Meanwhile, the life expectancy of Americans is about the same as the average OECD resident. Retirement also occurs around the same age as the OECD average. Nevertheless, Americans end up with fewer years of healthy retirement than the OECD average.

 

 


The blue mist

The New York Times printed several charts about Twitter "blue checks," and they aren't one of their best efforts (link).

Blue checks used to be credentials given to legitimate accounts, typically associated with media outlets, celebrities, brands, professors, etc. They are free but must be approved by Twitter. Since Elon Musk acquired Twitter, he turned blue checks into a revenue generator. Yet another subscription service (but you're buying "freedom"!). Anyone can get a blue check for US$8 per month.

[The charts shown here are scanned from the printed edition.]

Nyt_twitterblue_chart1

The first chart is a scatter plot showing the day of joining Twitter and the total number of followers the account has as of early November, 2022. Those are very strange things to pair up on a scatter plot but I get it: the designer could only work with the data that can be pulled down from Twitter's API.

What's wrong with the data? It would seem the interesting question is whether blue checks are associated with number of followers. The chart shows only Twitter Blue users so there is nothing to compare to. The day of joining Twitter is not the day of becoming "Twitter Blue", almost surely not for any user (Nevetheless, the former is not a standard data element released by Twitter). The chart has a built-in time bias since the longer an account exists, one would assume the higher the number of followers (assuming all else equal). Some kind of follower rate (e.g. number of followers per year of existence) might be more informative.

Still, it's hard to know what the chart is saying. That most Blue accounts have fewer than 5,000 followers? I also suspect that they chopped off the top of the chart (outliers) and forgot to mention it. Surely, some of the celebrity accounts have way over 150,000 followers. Another sign that the top of the chart was removed is that an expected funnel effect is not seen. Given the follower count is cumulative from the day of registration, we'd expect the accounts that started in the last few months should have markedly lower counts than those created years ago. (This is even more true if there is a survivorship bias - less successful accounts are more likely to be deleted over time.)

The designer arbitrarily labelled six specific accounts ("Crypto influencer", "HBO fan", etc.) but this feature risks sending readers the wrong message. There might be one HBO fan account that quickly grew to 150,000 followers in just a few months but does the data label suggest to readers that HBO fan accounts as a group tend to quickly attain high number of followers?

***

The second chart, which is an inset of the first, attempts to quantify the effect of the Musk acquisition on the number of "registrations and subscriptions". In the first chart, the story was described as "Elon Musk buys Twitter sparking waves of new users who later sign up for Twitter Blue".

Nyt_twitterblue_chart2

The second chart confuses me. I was trying to figure out what is counted in the vertical axis. This was before I noticed the inset in the first chart, easy to miss as it is tucked into the lower right corner. I had presumed that the axis would be the same as in the first chart since there weren't any specific labels. In that case, I am looking at accounts with 0 to 500 followers, pretty inconsequential accounts. Then, the chart title uses the words "registrations and subscriptions." If the blue dots on this chart also refer to blue-check accounts as in the first chart, then I fail to see how this chart conveys any information about registrations (wbich presumably would include free accounts). As before, new accounts that aren't blue checks won't appear.

Further, to the extent that this chart shows a surge in subscriptions, we are restricted to accounts with fewer than 500 followers, and it's really unclear what proportion of total subscribers is depicted. Nor is it possible to estimate the magnitude of this surge.

Besides, I'm seeing similar densities of the dots across the entire time window between October 2021 and 2022. Perhaps the entire surge is hidden behind the black lines indicating the specific days when Musk announced and completed the acquisition, respectively. If the surge is hiding behind the black vertical lines, then this design manages to block the precise spots readers are supposed to notice.

Here is where we can use the self-sufficiency test. Imagine the same chart without the text. What story would you have learned from the graphical elements themselves? Not much, in my view.

***

The third chart isn't more insightful. This chart purportedly shows suspended accounts, only among blue-check accounts.

Nyt_twitterblue_chart3

From what I could gather (and what I know about Twitter's API), the chart shows any Twitter Blue account that got suspended at any time. For example, all the black open circles occurring prior to October 27, 2022 represent suspensions by the previous management, and presumably have nothing to do with Elon Musk, or his decision to turn blue checks into a subscription product.

There appears to be a cluster of suspensions since Musk took over. I am not sure what that means. Certainly, it says he's not about "total freedom". Most of these suspended accounts have fewer than 50 followers, and only been around for a few weeks. And as before, I'm not sure why the analyst decided to focus on accounts with fewer than 500 followers.

What could have been? Given the number of suspended accounts are relatively small, an interesting analysis would be to form clusters of suspended accounts, and report on the change in what types of accounts got suspended before and after the change of management.

***

The online article (link) is longer, filling in some details missing from the printed edition.

There is one view that shows the larger accounts:

Nyt_twitterblue_largestaccounts

While more complete, this view isn't very helpful as the biggest accounts are located in the sparsest area of the chart. The data labels again pick out strange accounts like those of adult film stars and an Arabic news site. It's not clear if the designer is trying to tell us that most of Twitter Blue accounts belong to those categories.

***
See here for commentary on other New York Times graphics.

 

 

 

 


Finding the right context to interpret household energy data

Bloomberg_energybillBloomberg's recent article on surging UK household energy costs, projected over this winter, contains data about which I have long been intrigued: how much energy does different household items consume?

A twitter follower alerted me to this chart, and she found it informative.

***
If the goal is to pick out the appliances and estimate the cost of running them, the chart serves its purpose. Because the entire set of data is printed, a data table would have done equally well.

I learned that the mobile phone costs almost nothing to charge: 1 pence for six hours of charging, which is deemed a "single use" which seems double what a full charge requires. The games console costs 14 pence for a "single use" of two hours. That might be an underestimate of how much time gamers spend gaming each day.

***

Understanding the design of the chart needs a bit more effort. Each appliance is measured by two metrics: the number of hours considered to be "single use", and a currency value.

It took me a while to figure out how to interpret these currency values. Each cost is associated with a single use, and the duration of a single use increases as we move down the list of appliances. Since the designer assumes a fixed cost of electicity (shown in the footnote as 34p per kWh), at first, it seems like the costs should just increase from top to bottom. That's not the case, though.

Something else is driving these numbers behind the scene, namely, the intensity of energy use by appliance. The wifi router listed at the bottom is turned on 24 hours a day, and the daily cost of running it is just 6p. Meanwhile, running the fridge and freezer the whole day costs 41p. Thus, the fridge&freezer consumes electricity at a rate that is almost 7 times higher than the router.

The chart uses a split axis, which artificially reduces the gap between 8 hours and 24 hours. Here is another look at the bottom of the chart:

Bloomberg_energycost_bottom

***

Let's examine the choice of "single use" as a common basis for comparing appliances. Consider this:

  • Continuous appliances (wifi router, refrigerator, etc.) are denoted as 24 hours, so a daily time window is also implied
  • Repeated-use appliances (e.g. coffee maker, kettle) may be run multiple times a day
  • Infrequent use appliances may be used less than once a day

I prefer standardizing to a "per day" metric. If I use the microwave three times a day, the daily cost is 3 x 3p = 9 p, which is more than I'd spend on the wifi router, run 24 hours. On the other hand, I use the washing machine once a week, so the frequency is 1/7, and the effective daily cost is 1/7 x 36 p = 5p, notably lower than using the microwave.

The choice of metric has key implications on the appearance of the chart. The bubble size encodes the relative energy costs. The biggest bubbles are in the heating category, which is no surprise. The next largest bubbles are tumble dryer, dishwasher, and electric oven. These are generally not used every day so the "per day" calculation would push them lower in rank.

***

Another noteworthy feature of the Bloomberg chart is the split legend. The colors divide appliances into five groups based on usage category (e.g. cleaning, food, utility). Instead of the usual color legend printed on a corner or side of the chart, the designer spreads the category labels around the chart. Each label is shown the first time a specific usage category appears on the chart. There is a presumption that the reader scans from top to bottom, which is probably true on average.

I like this arrangement as it delivers information to the reader when it's needed.

 

 

 


People flooded this chart presented without comment with lots of comments

The recent election in Italy has resulted in some dubious visual analytics. A reader sent me this Excel chart:

Italy_elections_RDC-M5S

In brief, an Italian politician (trained as a PhD economist) used the graph above to make a point that support of the populist Five Star party (M5S) is highly correlated with poverty - the number of people on RDC (basic income). "Senza commento" - no comment needed.

Except a lot of people noticed the idiocy of the chart, and ridiculed it.

The chart appeals to those readers who don't spend time understanding what's being plotted. They notice two lines that show similar "trends" which is a signal for high correlation.

It turns out the signal in the chart isn't found in the peaks and valleys of the "trends".  It is tempting to observe that when the blue line peaks (Campania, Sicilia, Lazio, Piedmonte, Lombardia), the orange line also pops.

But look at the vertical axis. He's plotting the number of people, rather than the proportion of people. Population varies widely between Italian provinces. The five mentioned above all have over 4 million residents, while the smaller ones such as Umbira, Molise, and Basilicata have under 1 million. Thus, so long as the number of people, not the proportion, is plotted, no matter what demographic metric is highlighted, we will see peaks in the most populous provinces.

***

The other issue with this line chart is that the "peaks" are completely contrived. That's because the items on the horizontal axis do not admit a natural order. This is NOT a time-series chart, for which there is a canonical order. The horizontal axis contains a set of provinces, which can be ordered in whatever way the designer wants.

The following shows how the appearance of the lines changes as I select different metrics by which to sort the provinces:

Redo_italianelections_m5srdc_1

This is the reason why many chart purists frown on people who use connected lines with categorical data. I don't like this hard rule, as my readers know. In this case, I have to agree the line chart is not appropriate.

***

So, where is the signal on the line chart? It's in the ratio of the heights of the two values for each province.

Redo_italianelections_m5srdc_2

Here, we find something counter-intuitive. I've highlighted two of the peaks. In Sicilia, about the same number of people voted for Five Star as there are people who receive basic income. In Lombardia, more than twice the number of people voted for Five Star as there are people who receive basic income. 

Now, Lombardy is where Milan is, essentially the richest province in Italy while Sicily is one of the poorest. Could it be that Five Star actually outperformed their demographics in the richer provinces?

***

Let's approach the politician's question systematically. He's trying to say that the Five Star moement appeals especially to poorer people. He's chosen basic income as a proxy for poverty (this is like people on welfare in the U.S.). Thus, he's divided the population into two groups: those on welfare, and those not.

What he needs is the relative proportions of votes for Five Star among these two subgroups. Say, Five Star garnered 30% of the votes among people on welfare, and 15% of the votes among people not on welfare, then we have a piece of evidence that Five Star differentially appeals to people on welfare. If the vote share is the same among these two subgroups, then Five Star's appeal does not vary with welfare.

The following diagram shows the analytical framework:

Redo_italianelections_m5srdc_3

What's the problem? He doesn't have the data needed to establish his thesis. He has the total number of Five Star voters (which is the sum of the two yellow boxes) and he has the total number of people on RDC (which is the dark orange box).

Redo_italianelections_m5srdc_4

As shown above, another intervening factor is the proportion of people who voted. It is conceivable that the propensity to vote also depends on one's wealth.

So, in this case, fixing the visual will not fix the problem. Finding better data is key.


Another reminder that aggregate trends hide information

The last time I looked at the U.S. employment situation, it was during the pandemic. The data revealed the deep flaws of the so-called "not in labor force" classification. This classification is used to dehumanize unemployed people who are declared "not in labor force," in which case they are neither employed nor unemployed -- just not counted at all in the official unemployment (or employment) statistics.

The reason given for such a designation was that some people just have no interest in working, or even looking for a job. Now they are not merely discouraged - as there is a category of those people. In theory, these people haven't been looking for a job for so long that they are no longer visible to the bean counters at the Bureau of Labor Statistics.

What happened when the pandemic precipitated a shutdown in many major cities across America? The number of "not in labor force" shot up instantly, literally within a few weeks. That makes a mockery of the reason for such a designation. See this post for more.

***

The data we saw last time was up to April, 2020. That's more than two years old.

So I have updated the charts to show what has happened in the last couple of years.

Here is the overall picture.

Junkcharts_unemployment_notinLFparttime_all_2

In this new version, I centered the chart at the 1990 data. The chart features two key drivers of the headline unemployment rate - the proportion of people designated "invisible", and the proportion of those who are considered "employed" who are "part-time" workers.

The last two recessions have caused structural changes to the labor market. From 1990 to late 2000s, which included the dot-com bust, these two metrics circulated within a small area of the chart. The Great Recession of late 2000s led to a huge jump in the proportion called "invisible". It also pushed the proportion of part-timers to all0time highs. The proportion of part-timers has fallen although it is hard to interpret from this chart alone - because if the newly invisible were previously part-time employed, then the same cause can be responsible for either trend.

_numbersense_bookcoverReaders of Numbersense (link) might be reminded of a trick used by school deans to pump up their US News rankings. Some schools accept lots of transfer students. This subpopulation is invisible to the US News statisticians since they do not factor into the rankings. The recent scandal at Columbia University also involves reclassifying students (see this post).

Zooming in on the last two years. It appears that the pandemic-related unemployment situation has reversed.

***

Let's split the data by gender.

American men have been stuck in a negative spiral since the 1990s. With each recession, a higher proportion of men are designated BLS invisibles.

Junkcharts_unemployment_notinLFparttime_men_2

In the grid system set up in this scatter plot, the top right corner is the worse of all worlds - the work force has shrunken and there are more part-timers among those counted as employed. The U.S. men are not exiting this quadrant any time soon.

***
What about the women?

Junkcharts_unemployment_notinLFparttime_women_2

If we compare 1990 with 2022, the story is not bad. The female work force is gradually reaching the same scale as in 1990 while the proportion of part-time workers have declined.

However, celebrating the above is to ignore the tremendous gains American women made in the 1990s and 2000s. In 1990, only 58% of women are considered part of the work force - the other 42% are not working but they are not counted as unemployed. By 2000, the female work force has expanded to include about 60% with similar proportions counted as part-time employed as in 1990. That's great news.

The Great Recession of the late 2000s changed that picture. Just like men, many women became invisible to BLS. The invisible proportion reached 44% in 2015 and have not returned to anywhere near the 2000 level. Fewer women are counted as part-time employed; as I said above, it's hard to tell whether this is because the women exiting the work force previously worked part-time.

***

The color of the dots in all charts are determined by the headline unemployment number. Blue represents low unemployment. During the 1990-2022 period, there are three moments in which unemployment is reported as 4 percent or lower. These charts are intended to show that an aggregate statistic hides a lot of information. The three times at which unemployment rate reached historic lows represent three very different situations, if one were to consider the sizes of the work force and the number of part-time workers.

 

P.S. [8-15-2022] Some more background about the visualization can be found in prior posts on the blog: here is the introduction, and here's one that breaks it down by race. Chapter 6 of Numbersense (link) gets into the details of how unemployment rate is computed, and the implications of the choices BLS made.

P.S. [8-16-2022] Corrected the axis title on the charts (see comment below). Also, added source of data label.


Visualizing the impossible

Note [July 6, 2022]: Typepad's image loader is broken yet again. There is no way for me to fix the images right now. They are not showing despite being loaded properly yesterday. I also cannot load new images. Apologies!

Note 2: Manually worked around the automated image loader.

Note 3: Thanks Glenn for letting me about the image loading problem. It turns out the comment approval function is also broken, so I am not able to approve the comment.

***

A twitter user sent me this chart:

twitter_greatreplacement

It's, hmm, mystifying. It performs magic, as I explain below.

What's the purpose of the gridlines and axis labels? Even if there is a rationale for printing those numbers, they make it harder, not easier, for readers to understand the chart!

I think the following chart shows the main message of this poll result. Democrats are much more likely to think of immigration as a positive compared to Republicans, with Independents situated in between.

Redo_greatreplacement

***

The axis title gives a hint as to what the chart designer was aiming for with the unconventional axis. It reads "Overall Percentage for All Participants". It appears that the total length of the stacked bar is the weighted aggregate response rate. Roughly 17% of Americans thought this development to be "very positive" which include 8% of Republicans, 27% of Democrats and 12% of Independents. Since the three segments are not equal in size, 17% is a weighted average of the three proportions.

Within each of the three political affiliations, the data labels add to 100%. These numbers therefore are unweighted response rates for each segment. (If weighted, they should add up to the proportion of each segment.)

This sets up an impossible math problem. The three segments within each bar then represent the sum of three proportions, each unweighted within its segment. Adding these unweighted proportions does not yield the desired weighted average response rate. To get the weighted average response rate, we need to sum the weighted segment response rates instead.

This impossible math problem somehow got resolved visually. We can see that each bar segment faithfully represent the unweighted response rates shown in the respective data labels. Summing them would not yield the aggregate response rates as shown on the axis title. The difference is not a simple multiplicative constant because each segment must be weighted by a different multiplier. So, your guess is as good as mine: what is the magic that makes the impossible possible?

[P.S. Another way to see this inconsistency. The sum of all the data labels is 300% because the proportions of each segment add up to 100%. At the same time, the axis title implies that the sum of the lengths of all five bars should be 100%. So, the chart asserts that 300% = 100%.]

***

This poll question is a perfect classroom fodder to discuss how wording of poll questions affects responses (something called "response bias"). Look at the following variants of the same questions. Are we likely to get answers consistent with the above question?

As you know, the demographic makeup of America is changing and becoming more diverse, while the U.S. Census estimates that white people will still be the largest race in approximately 25 years. Generally speaking, do you find these changes to be very positive, somewhat positive, somewhat negative or very negative?

***

As you know, the demographic makeup of America is changing and becoming more diverse, with the U.S. Census estimating that black people will still be a minority in approximately 25 years. Generally speaking, do you find these changes to be very positive, somewhat positive, somewhat negative or very negative?

***

As you know, the demographic makeup of America is changing and becoming more diverse, with the U.S. Census estimating that Hispanic, black, Asian and other non-white people together will be a majority in approximately 25 years. Generally speaking, do you find these changes to be very positive, somewhat positive, somewhat negative or very negative?

What is also amusing is that in the world described by the pollster in 25 years, every race will qualify as a "minority". There will be no longer majority since no race will constitute at least 50% of the U.S. population. So at that time, the word "minority" will  have lost meaning.


Selecting the right analysis plan is the first step to good dataviz

It's a new term, and my friend Ray Vella shared some student projects from his NYU class on infographics. There's always something to learn from these projects.

The starting point is a chart published in the Economist a few years ago.

Economist_richgetricher

This is a challenging chart to read. To save you the time, the following key points are pertinent:

a) income inequality is measured by the disparity between regional averages

b) the incomes are given in a double index, a relative measure. For each country and year combination, the average national GDP is set to 100. A value of 150 means the richest region of Spain has an average income that is 50% higher than Spain's national average in the year 2015.

The original chart - as well as most of the student work - is based on a specific analysis plan. The difference in the index values between the richest and poorest regions is used as a measure of the degree of income inequality, and the change in the difference in the index values over time, as a measure of change in the degree of income inequality over time. That's as big a mouthful as the bag of words sounds.

This analysis plan can be summarized as:

1) all incomes -> relative indices, at each region-year combination
2) inequality = rich - poor region gap, at each region-year combination
3) inequality over time = inequality in 2015 - inequality in 2000, for each country
4) country difference = inequality in country A - inequality in country B, for each year

***

One student, J. Harrington, looks at the data through an alternative lens that brings clarity to the underlying data. Harrington starts with change in income within the richest regions (then the poorest regions), so that a worsening income inequality should imply that the richest region is growing incomes at a faster clip than the poorest region.

This alternative analysis plan can be summarized as:
1) change in income over time for richest regions for each country
2) change in income over time for poorest regions for each country
3) inequality = change in income over time: rich - poor, for each country

The restructuring of the analysis plan makes a big difference!

Here is one way to show this alternative analysis:

Junkcharts_kfung_sixeurocountries_gdppercapita

The underlying data have not changed but the reader's experience is transformed.