Gazing at petals

Reader Murphy pointed me to the following infographic developed by Altmetric to explain their analytics of citations of journal papers. These metrics are alternative in that they arise from non-academic media sources, such as news outlets, blogs, twitter, and reddit.

The key graphic is the petal diagram with a number in the middle.

Altmetric_tetanus

I have a hard time thinking of this object as “data visualization”. Data visualization should visualize the data. Here, the connection between the data and the visual design is tenuous.

There are eight petals arranged around the circle. The legend below the diagram maps the color of each petal to a source of data. Red, for example, represents mentions in news outlets, and green represents mentions in videos.

Each petal is the same size, even though the counts given below differ. So, the petals are like a duplicative legend.

The order of the colors around the circle does not align with its order in the table below, for a mysterious reason.

Then comes another puzzle. The bluish-gray petal appears three times in the diagram. This color is mapped to tweets. Does the number of petals represent the much higher counts of tweets compared to other mentions?

To confirm, I pulled up the graphic for a different paper.

Altmetric_worldwidedeclineofentomofauna

Here, each petal has a different color. Eight petals, eight colors. The count of tweets is still much larger than the frequencies of the other sources. So, the rule of construction appears to be one petal for each relevant data source, and if the total number of data sources fall below eight, then let Twitter claim all the unclaimed petals.

A third sample paper confirms this rule:

Altmetric_dnananodevices

None of the places we were hoping to find data – size of petals, color of petals, number of petals – actually contain any data. Anything the reader wants to learn can be directly read. The “score” that reflects the aggregate “importance” of the corresponding paper is found at the center of the circle. The legend provides the raw data.

***

Some years ago, one of my NYU students worked on a project relating to paper citations. He eventually presented the work at a conference. I featured it previously.

Michaelbales_citationimpact

Notice how the visual design provides context for interpretation – by placing each paper/researcher among its peers, and by using a relative scale (percentiles).

***

I’m ignoring the D corner of the Trifecta Checkup in this post. For any visualization to be meaningful, the data must be meaningful. The type of counting used by Altmetric treats every tweet, every mention, etc. as a tally, making everything worth the same. A mention on CNN counts as much as a mention by a pseudonymous redditor. A pan is the same as a rave. Let’s not forget the fake data menace (link), which  affects all performance metrics.


Graph literacy, in a sense

Ben Jones tweeted out this chart, which has an unusual feature:

Malefemaleliteracyrates

What's unusual is that time runs in both directions. Usually, the rule is that time runs left to right (except, of course, in right-to-left cultures). Here, the purple area chart follows that convention while the yellow area chart inverts it.

On the one hand, this is quite cute. Lines meeting in the middle. Converging. I get it.

On the other hand, every time a designer defies conventions, the reader has to recognize it, and to rationalize it.

In this particular graphic, I'm not convinced. There are four numbers only. The trend on either side looks linear so the story is simple. Why complicate it using unusual visual design?

Here is an entirely conventional bumps-like chart that tells the story:

Redo_literacyratebygender

I've done a couple of things here that might be considered controversial.

First, I completely straightened out the lines. I don't see what additional precision is bringing to the chart.

Second, despite having just four numbers, I added the year 1996 and vertical gridlines indicating decades. A Tufte purist will surely object.

***

Related blog post: "The Return on Effort in Data Graphics" (link)


The rule governing which variable to put on which axis, served a la mode

When making a scatter plot, the two variables should not be placed arbitrarily. There is a rule governing this: the outcome variable should be shown on the vertical axis (also called y-axis), and the explanatory variable on the horizontal (or x-) axis.

This chart from the archives of the Economist has this reversed:

20160402_WOC883_icecream_PISA

The title of the accompanying article is "Ice Cream and IQ"...

In a Trifecta Checkup (link), it's a Type DV chart. It's preposterous to claim eating ice cream makes one smarter without more careful studies. The chart also carries the xyopia fallacy: by showing just two variables, readers are unwittingly led to explain differences in "IQ" using differences in per-capita ice-cream consumption when lots of other stronger variables will explain any gaps in IQ.

In this post, I put aside my objections to the analysis, and focus on the issue of assigning variables to axes. Notice that this chart reverses the convention: the outcome variable (IQ) is shown on the horizontal, and the explanatory variable (ice cream) is shown on the vertical.

Here is a reconstruction of the above chart, showing only the dots that were labeled with country names. I fitted a straight regression line instead of a curve. (I don't understand why the red line in the original chart bends upwards when the data for Japan, South Korea, Singapore and Hong Kong should be dragging it down.)

Redo_econ_icecreamIQ_1A

Note that the interpretation of the regression line raises eyebrows because the presumed causality is reversed. For each 50 points increase in PISA score (IQ), this line says to expect ice cream consumption to raise by about 1-2 liters per person per year. So higher IQ makes people eat more ice cream.

***

If the convention is respected, then the following scatter plot results:

Redo_econ_icecreamIQ_2

The first thing to note is that the regression analysis is different here from that shown in the previous chart. The blue regression line is not equivalent to the black regression line from the previous chart. You cannot reverse the roles of the x and y variables in a regression analysis, and so neither should you reverse the roles of the x and y variables in a scatter plot.

The blue regression line can be interpreted as having two sections, roughly, for countries consuming more than or less than 6 liters of ice cream per person per year. In the less-ice-cream countries, the correlation between ice cream and IQ is stronger (I don't endorse the causal interpretation of this statement).

***

When you make a scatter plot, you have two variables for which you want to analyze their correlation. In most cases, you are exploring a cause-effect relationship.

Higher income households cares more on politics.
Less educated citizens are more likely to not register to vote.
Companies with more diverse workforce has better business performance.

Frequently, the reverse correlation does not admit a causal interpretation:

Caring more about politics does not make one richer.
Not registering to vote does not make one less educated.
Making more profits does not lead to more diversity in hiring.

In each of these examples, it's clear that one variable is the outcome, the other variable is the explanatory factor. Always put the outcome in the vertical axis, and the explanation in the horizontal axis.

The justification is scientific. If you are going to add a regression line (what Excel calls a "trendline"), you must follow this convention, otherwise, your regression analysis will yield the wrong result, with an absurd interpretation!

 

[PS. 11/3/2019: The comments below contain different theories that link the two variables, including theories that treat PISA score ("IQ") as the explanatory variable and ice cream consumption as the outcome. Also, I elaborated that the rule does not dictate which variable is the outcome - the designer effectively signals to the reader which variable is regarded as the outcome by placing it in the vertical axis.]


Announcement: Advancing your data skills, Fall 2019

Interrupting the flow of dataviz with the following announcement.

If you're looking to shore up your data skills, modernize your skill set, or know someone looking for hands-on, high-touch instruction in Machine Learning, R, Cloud Computing, Data Quality, Digital Analytics,  A/B Testing and Financial Analysis, Principal Analytics Prep is offering evening classes this Fall. Click here to learn about our courses. 

Our instructors are industry veterans with 10+ years of practical industry experience. And class size is capped to 10, ensuring a high-touch learning environment.

Facebook_pap_parttimeimmersive_tree

 


Tightening the bond between the message and the visual: hello stats-cats

The editors of ASA's Amstat News certainly got my attention, in a recent article on school counselling. A research team asked two questions. The first was HOW ARE YOU FELINE?

Stats and cats. The pun got my attention and presumably also made others stop and wonder. The second question was HOW DO YOU REMEMBER FEELING while you were taking a college statistics course? Well, it's hard to imagine the average response to that question would be positive.

What also drew me to the article was this pair of charts:

Counselors_Figure1small

Surely, ASA can do better. (I'm happy to volunteer my time!)

Rotate the chart, clean up the colors, remove the decimals, put the chart titles up top, etc.

***

The above remedies fall into the V corner of my Trifecta checkup.

Trifectacheckup_junkcharts_imageThe key to fixing this chart is to tighten the bond between the message and the visual. This means working that green link between the Q and V corners.

This much became clear after reading the article. The following paragraphs are central to the research (bolding is mine):

Responses indicated the majority of school counselors recalled experiences of studying statistics in college that they described with words associated with more unpleasant affect (i.e., alarm, anger, distress, fear, misery, gloom, depression, sadness, and tiredness; n = 93; 66%). By contrast, a majority of counselors reported same-day (i.e., current) emotions that appeared to be associated with more pleasant affect (i.e., pleasure, happiness, excitement, astonishment, sleepiness, satisfaction, and calm; n = 123; 88%).

Both recalled emotive experiences and current emotional states appeared approximately balanced on dimensions of arousal: recalled experiences associated with lower arousal (i.e., pleasure, misery, gloom, depression, sadness, tiredness, sleepiness, satisfaction, and calm, n = 65, 46%); recalled experiences associated with higher arousal (i.e., happiness, excitement, astonishment, alarm, anger, distress, fear, n = 70, 50%); current emotions associated with lower arousal (n = 60, 43%); current experiences associated with higher arousal (i.e., n = 79, 56%).

These paragraphs convey two crucial pieces of information: the structure of the analysis, and its insights.

The two survey questions measure two states of experiences, described as current versus recalled. Then the individual affects (of which there were 16 plus an option of "other") are scored on two dimensions, pleasure and arousal. Each affect maps to high or low pleasure, and separately to high or low arousal.

The research insight is that current experience was noticably higher than recalled experience on the pleasure dimension but both experiences were similar on the arousal dimension.

Any visualization of this research must bring out this insight.

***

Here is an attempt to illustrate those paragraphs:

Redo_junkcharts_amstat_feline

The primary conclusion can be read from the four simple pie charts in the middle of the page. The color scheme shines light on which affects are coded as high or low for each dimension. For example, "distressed" is scored as showing low pleasure and high arousal.

A successful data visualization for this situation has to bring out the conclusion drawn at the aggregated level, while explaining the connection between individual affects and their aggregates.


A chart makes an appearance in my new video

Been experimenting with short videos recently. My latest is a short explainer on why some parents are willing to spend over a million dollars to open back doors to college admissions. I even inserted a chart showing some statistics. Click here to see the video.

 

Also, subscribe to my channel to see future episodes of Inside the Black Box.

***

Here are a couple of recent posts related to college admissions.

  • About those so-called adversity scores (link)
  • A more detailed post on various college admissions statistics (link)

Visually exploring the relationship between college applicants and enrollment

In a previous post, we learned that top U.S. colleges have become even more selective over the last 15 years, driven by a doubling of the number of applicants while class sizes have nudged up by just 10 to 20 percent. 

Redo_pewcollegeadmissions

The top 25 most selective colleges are included in the first group. Between 2002 and 2017, their average rate of admission dropped from about 20% to about 10%, almost entirely explained by applicants per student doubling from 10 to almost 20. A similar upward movement in selectivity is found in the first four groups of colleges, which on average accept at least half of their applicants.

Most high school graduates however are not enrolling in colleges in the first four groups. Actually, the majority of college enrollment belongs to the bottom two groups of colleges. These groups also attracted twice as many applicants in 2017 relative to 2002 but the selectivity did not change. They accepted 75% to 80% of applicants in 2002, as they did in 2017.

***

In this post, we look at a different view of the same data. The following charts focus on the growth rates, indexed to 2002. 

Collegeadmissions_5

To my surprise, the number of college-age Americans  grew by about 10% initially but by 2017 has dropped back to the level of 2002. Meanwhile, the number of applications to the colleges continues to climb across all eight groups of colleges.

The jump in applications made selectivity surge at the most selective colleges but at the less selective colleges, where the vast majority of students enroll, admission rate stayed put because they gave out many more offers as applications mounted. As the Pew headline asserted, "the rich gets richer."

Enrollment has not kept up. Class sizes expanded about 10 to 30 percent in those 15 years, lagging way behind applications and admissions.

How do we explain the incremental applications?

  • Applicants increasing the number of schools they apply to
  • The untapped market: applicants who in the past would not have applied to college
  • Non-U.S. applicants: this is part of the untapped market, but much larger

An exercise in decluttering

My friend Xan found the following chart by Pew hard to understand. Why is the chart so taxing to look at? 

Pew_collegeadmissions

It's packing too much.

I first notice the shaded areas. Shading usually signifies "look here". On this chart, the shading is highlighting the least important part of the data. Since the top line shows applicants and the bottom line admitted students, the shaded gap displays the rejections.

The numbers printed on the chart are growth rates but they confusingly do not sync with the slopes of the lines because the vertical axis plots absolute numbers, not rates. 

Pew_collegeadmissions_growthThe vertical axis presents the total number of applicants, and the total number of admitted students, in each "bucket" of colleges, grouped by their admission rate in 2017. On the right, I drew in two lines, both growth rates of 100%, from 500K to 1 million, and from 1 to 2 million. The slopes are not the same even though the rates of growth are.

Therefore, the growth rates printed on the chart must be read as extraneous data unrelated to other parts of the chart. Attempts to connect those rates to the slopes of the corresponding lines are frustrated.

Another lurking factor is the unequal sizes of the buckets of colleges. There are fewer than 10 colleges in the most selective bucket, and over 300 colleges in the largest bucket. We are unable to interpret properly the total number of applicants (or admissions). The quantity of applications in a bucket depends not just on the popularity of the colleges but also the number of colleges in each bucket.

The solution isn't to resize the buckets but to select a more appropriate metric: the number of applicants per enrolled student. The most selective colleges are attracting about 20 applicants per enrolled student while the least selective colleges (those that accept almost everyone) are getting 4 applicants per enrolled student, in 2017.

As the following chart shows, the number of applicants has doubled across the board in 15 years. This raises an intriguing question: why would a college that accepts pretty much all applicants need more applicants than enrolled students?

Redo_pewcollegeadmissions

Depending on whether you are a school administrator or a student, a virtuous (or vicious) cycle has been realized. For the top four most selective groups of colleges, they have been able to progressively attract more applicants. Since class size did not expand appreciably, more applicants result in ever-lower admit rate. Lower admit rate reduces the chance of getting admitted, which causes prospective students to apply to even more colleges, which further suppresses admit rate. 

 

 

 


No Latin honors for graphic design

Paw_honors_2018This chart appeared on a recent issue of Princeton Alumni Weekly.

If you read the sister blog, you'll be aware that at most universities in the United States, every student is above average! At Princeton,  47% of the graduating class earned "Latin" honors. The median student just missed graduating with honors so the honors graduate is just above average! The 47% number is actually lower than at some other peer schools - at one point, Harvard was giving 90% of its graduates Latin honors.

Side note: In researching this post, I also learned that in the Senior Survey for Harvard's Class of 2018, two-thirds of the respondents (response rate was about 50%) reported GPA to be 3.71 or above, and half reported 3.80 or above, which means their grade average is higher than A-.  Since Harvard does not give out A+, half of the graduates received As in almost every course they took, assuming no non-response bias.

***

Back to the chart. It's a simple chart but it's not getting a Latin honor.

Most readers of the magazine will not care about the decimal point. Just write 18.9% as 19%. Or even 20%.

The sequencing of the honor levels is backwards. Summa should be on top.

***

Warning: the remainder of this post is written for graphics die-hards. I go through a bunch of different charts, exploring some fine points.

People often complain that bar charts are boring. A trendy alternative when it comes to count or percentage data is the "pictogram."

Here are two versions of the pictogram. On the left, each percent point is shown as a dot. Then imagine each dot turned into a square, then remove all padding and lines, and you get the chart on the right, which is basically an area chart.

Redo_paw_honors_2018

The area chart is actually worse than the original column chart. It's now much harder to judge the areas of irregularly-shaped pieces. You'd have to add data labels to assist the reader.

The 100 dots is appealing because the reader can count out the number of each type of honors. But I don't like visual designs that turn readers into bean-counters.

So I experimented with ways to simplify the counting. If counting is easier, then making comparisons is also easier.

Start with this observation: When asked to count a large number of objects, we group by 10s and 5s.

So, on the left chart below, I made connectors to form groups of 5 or 10 dots. I wonder if I should use different line widths to differentiate groups of five and groups of ten. But the human brain is very powerful: even when I use the same connector style, it's easy to see which is a 5 and which is a 10.

Redo_paw_honors_2

On the left chart, the organizing principles are to keep each connector to its own row, and within each category, to start with 10-group, then 5-group, then singletons. The anti-principle is to allow same-color dots to be separated. The reader should be able to figure out Summa = 10+3, Magna = 10+5+1, Cum Laude = 10+5+4.

The right chart is even more experimental. The anti-principle is to allow bending of the connectors. I also give up on using both 5- and 10-groups. By only using 5-groups, readers can rely on their instinct that anything connected (whether straight or bent) is a 5-group. This is powerful. It relieves the effort of counting while permitting the dots to be packed more tightly by respective color.

Further, I exploited symmetry to further reduce the counting effort. Symmetry is powerful as it removes duplicate effort. In the above chart, once the reader figured out how to read Magna, reading Cum Laude is simplified because the two categories share two straight connectors, and two bent connectors that are mirror images, so it's clear that Cum Laude is more than Magna by exactly three dots (percentage points).

***

Of course, if the message you want to convey is that roughly half the graduates earn honors, and those honors are split almost even by thirds, then the column chart is sufficient. If you do want to use a pictogram, spend some time thinking about how you can reduce the effort of the counting!

 

 

 

 

 


Crazy rich Asians inspire some rich graphics

On the occasion of the hit movie Crazy Rich Asians, the New York Times did a very nice report on Asian immigration in the U.S.

The first two graphics will be of great interest to those who have attended my free dataviz seminar (coming to Lyon, France in October, by the way. Register here.), as it deals with a related issue.

The first chart shows an income gap widening between 1970 and 2016.

Nyt_crazyrichasians_incomegap1

This uses a two-lines design in a small-multiples setting. The distance between the two lines is labeled the "income gap". The clear story here is that the income gap is widening over time across the board, but especially rapidly among Asians, and then followed by whites.

The second graphic is a bumps chart (slopegraph) that compares the endpoints of 1970 and 2016, but using an "income ratio" metric, that is to say, the ratio of the 90th-percentile income to the 10th-percentile income.

Nyt_crazyrichasians_incomeratio2

Asians are still a key story on this chart, as income inequality has ballooned from 6.1 to 10.7. That is where the similarity ends.

Notice how whites now appears at the bottom of the list while blacks shows up as the second "worse" in terms of income inequality. Even though the underlying data are the same, what can be seen in the Bumps chart is hidden in the two-lines design!

In short, the reason is that the scale of the two-lines design is such that the small numbers are squashed. The bottom 10 percent did see an increase in income over time but because those increases pale in comparison to the large incomes, they do not show up.

What else do not show up in the two-lines design? Notice that in 1970, the income ratio for blacks was 9.1, way above other racial groups.

Kudos to the NYT team to realize that the two-lines design provides an incomplete, potentially misleading picture.

***

The third chart in the series is a marvellous scatter plot (with one small snafu, which I'd get t0).

Nyt_crazyrichasians_byethnicity

What are all the things one can learn from this chart?

  • There is, as expected, a strong correlation between having college degrees and earning higher salaries.
  • The Asian immigrant population is diverse, from the perspectives of both education attainment and median household income.
  • The largest source countries are China, India and the Philippines, followed by Korea and Vietnam.
  • The Indian immigrants are on average professionals with college degrees and high salaries, and form an outlier group among the subgroups.

Through careful design decisions, those points are clearly conveyed.

Here's the snafu. The designer forgot to say which year is being depicted. I suspect it is 2016.

Dating the data is very important here because of the following excerpt from the article:

Asian immigrants make up a less monolithic group than they once did. In 1970, Asian immigrants came mostly from East Asia, but South Asian immigrants are fueling the growth that makes Asian-Americans the fastest-expanding group in the country.

This means that a key driver of the rapid increase in income inequality among Asian-Americans is the shift in composition of the ethnicities. More and more South Asian (most of whom are Indians) arrivals push up the education attainment and household income of the average Asian-American. Not only are Indians becoming more numerous, but they are also richer.

An alternative design is to show two bubbles per ethnicity (one for 1970, one for 2016). To reduce clutter, the smaller ethnicites can be aggregated into Other or South Asian Other. This chart may help explain the driver behind the jump in income inequality.