## An exercise in decluttering

##### Apr 18, 2019

My friend Xan found the following chart by Pew hard to understand. Why is the chart so taxing to look at?

It's packing too much.

I first notice the shaded areas. Shading usually signifies "look here". On this chart, the shading is highlighting the least important part of the data. Since the top line shows applicants and the bottom line admitted students, the shaded gap displays the rejections.

The numbers printed on the chart are growth rates but they confusingly do not sync with the slopes of the lines because the vertical axis plots absolute numbers, not rates.

The vertical axis presents the total number of applicants, and the total number of admitted students, in each "bucket" of colleges, grouped by their admission rate in 2017. On the right, I drew in two lines, both growth rates of 100%, from 500K to 1 million, and from 1 to 2 million. The slopes are not the same even though the rates of growth are.

Therefore, the growth rates printed on the chart must be read as extraneous data unrelated to other parts of the chart. Attempts to connect those rates to the slopes of the corresponding lines are frustrated.

Another lurking factor is the unequal sizes of the buckets of colleges. There are fewer than 10 colleges in the most selective bucket, and over 300 colleges in the largest bucket. We are unable to interpret properly the total number of applicants (or admissions). The quantity of applications in a bucket depends not just on the popularity of the colleges but also the number of colleges in each bucket.

The solution isn't to resize the buckets but to select a more appropriate metric: the number of applicants per enrolled student. The most selective colleges are attracting about 20 applicants per enrolled student while the least selective colleges (those that accept almost everyone) are getting 4 applicants per enrolled student, in 2017.

As the following chart shows, the number of applicants has doubled across the board in 15 years. This raises an intriguing question: why would a college that accepts pretty much all applicants need more applicants than enrolled students?

Depending on whether you are a school administrator or a student, a virtuous (or vicious) cycle has been realized. For the top four most selective groups of colleges, they have been able to progressively attract more applicants. Since class size did not expand appreciably, more applicants result in ever-lower admit rate. Lower admit rate reduces the chance of getting admitted, which causes prospective students to apply to even more colleges, which further suppresses admit rate.

## No Latin honors for graphic design

##### Oct 03, 2018

This chart appeared on a recent issue of Princeton Alumni Weekly.

If you read the sister blog, you'll be aware that at most universities in the United States, every student is above average! At Princeton,  47% of the graduating class earned "Latin" honors. The median student just missed graduating with honors so the honors graduate is just above average! The 47% number is actually lower than at some other peer schools - at one point, Harvard was giving 90% of its graduates Latin honors.

Side note: In researching this post, I also learned that in the Senior Survey for Harvard's Class of 2018, two-thirds of the respondents (response rate was about 50%) reported GPA to be 3.71 or above, and half reported 3.80 or above, which means their grade average is higher than A-.  Since Harvard does not give out A+, half of the graduates received As in almost every course they took, assuming no non-response bias.

***

Back to the chart. It's a simple chart but it's not getting a Latin honor.

Most readers of the magazine will not care about the decimal point. Just write 18.9% as 19%. Or even 20%.

The sequencing of the honor levels is backwards. Summa should be on top.

***

Warning: the remainder of this post is written for graphics die-hards. I go through a bunch of different charts, exploring some fine points.

People often complain that bar charts are boring. A trendy alternative when it comes to count or percentage data is the "pictogram."

Here are two versions of the pictogram. On the left, each percent point is shown as a dot. Then imagine each dot turned into a square, then remove all padding and lines, and you get the chart on the right, which is basically an area chart.

The area chart is actually worse than the original column chart. It's now much harder to judge the areas of irregularly-shaped pieces. You'd have to add data labels to assist the reader.

The 100 dots is appealing because the reader can count out the number of each type of honors. But I don't like visual designs that turn readers into bean-counters.

So I experimented with ways to simplify the counting. If counting is easier, then making comparisons is also easier.

Start with this observation: When asked to count a large number of objects, we group by 10s and 5s.

So, on the left chart below, I made connectors to form groups of 5 or 10 dots. I wonder if I should use different line widths to differentiate groups of five and groups of ten. But the human brain is very powerful: even when I use the same connector style, it's easy to see which is a 5 and which is a 10.

On the left chart, the organizing principles are to keep each connector to its own row, and within each category, to start with 10-group, then 5-group, then singletons. The anti-principle is to allow same-color dots to be separated. The reader should be able to figure out Summa = 10+3, Magna = 10+5+1, Cum Laude = 10+5+4.

The right chart is even more experimental. The anti-principle is to allow bending of the connectors. I also give up on using both 5- and 10-groups. By only using 5-groups, readers can rely on their instinct that anything connected (whether straight or bent) is a 5-group. This is powerful. It relieves the effort of counting while permitting the dots to be packed more tightly by respective color.

Further, I exploited symmetry to further reduce the counting effort. Symmetry is powerful as it removes duplicate effort. In the above chart, once the reader figured out how to read Magna, reading Cum Laude is simplified because the two categories share two straight connectors, and two bent connectors that are mirror images, so it's clear that Cum Laude is more than Magna by exactly three dots (percentage points).

***

Of course, if the message you want to convey is that roughly half the graduates earn honors, and those honors are split almost even by thirds, then the column chart is sufficient. If you do want to use a pictogram, spend some time thinking about how you can reduce the effort of the counting!

## Crazy rich Asians inspire some rich graphics

##### Sep 18, 2018

On the occasion of the hit movie Crazy Rich Asians, the New York Times did a very nice report on Asian immigration in the U.S.

The first two graphics will be of great interest to those who have attended my free dataviz seminar (coming to Lyon, France in October, by the way. Register here.), as it deals with a related issue.

The first chart shows an income gap widening between 1970 and 2016.

This uses a two-lines design in a small-multiples setting. The distance between the two lines is labeled the "income gap". The clear story here is that the income gap is widening over time across the board, but especially rapidly among Asians, and then followed by whites.

The second graphic is a bumps chart (slopegraph) that compares the endpoints of 1970 and 2016, but using an "income ratio" metric, that is to say, the ratio of the 90th-percentile income to the 10th-percentile income.

Asians are still a key story on this chart, as income inequality has ballooned from 6.1 to 10.7. That is where the similarity ends.

Notice how whites now appears at the bottom of the list while blacks shows up as the second "worse" in terms of income inequality. Even though the underlying data are the same, what can be seen in the Bumps chart is hidden in the two-lines design!

In short, the reason is that the scale of the two-lines design is such that the small numbers are squashed. The bottom 10 percent did see an increase in income over time but because those increases pale in comparison to the large incomes, they do not show up.

What else do not show up in the two-lines design? Notice that in 1970, the income ratio for blacks was 9.1, way above other racial groups.

Kudos to the NYT team to realize that the two-lines design provides an incomplete, potentially misleading picture.

***

The third chart in the series is a marvellous scatter plot (with one small snafu, which I'd get t0).

What are all the things one can learn from this chart?

• There is, as expected, a strong correlation between having college degrees and earning higher salaries.
• The Asian immigrant population is diverse, from the perspectives of both education attainment and median household income.
• The largest source countries are China, India and the Philippines, followed by Korea and Vietnam.
• The Indian immigrants are on average professionals with college degrees and high salaries, and form an outlier group among the subgroups.

Through careful design decisions, those points are clearly conveyed.

Here's the snafu. The designer forgot to say which year is being depicted. I suspect it is 2016.

Dating the data is very important here because of the following excerpt from the article:

Asian immigrants make up a less monolithic group than they once did. In 1970, Asian immigrants came mostly from East Asia, but South Asian immigrants are fueling the growth that makes Asian-Americans the fastest-expanding group in the country.

This means that a key driver of the rapid increase in income inequality among Asian-Americans is the shift in composition of the ethnicities. More and more South Asian (most of whom are Indians) arrivals push up the education attainment and household income of the average Asian-American. Not only are Indians becoming more numerous, but they are also richer.

An alternative design is to show two bubbles per ethnicity (one for 1970, one for 2016). To reduce clutter, the smaller ethnicites can be aggregated into Other or South Asian Other. This chart may help explain the driver behind the jump in income inequality.

## Education deserts: places without schools still serve pies and story time

##### Aug 22, 2018

I very much enjoyed reading The Chronicle's article on "education deserts" in the U.S., defined as places where there are no public colleges within reach of potential students.

In particular, the data visualization deployed to illustrate the story is superb. For example, this map shows 1,500 colleges and their "catchment areas" defined as places within 60 minutes' drive.

It does a great job walking through the logic of the analysis (even if the logic may not totally convince - more below). The areas not within reach of these 1,500 colleges are labeled "deserts". They then take Census data and look at the adult population in those deserts:

This leads to an analysis of the racial composition of the people living in these "deserts". We now arrive at the only chart in the sequence that disappoints. It is a pair of pie charts:

The color scheme makes it hard to pair up the pie slices. The focus of the chart should be on the over or under representation of races in education deserts relative to the U.S. average. The challenge of this dataset is the coexistence of one large number, and many small numbers.

Here is one solution:

***

The Chronicle made a commendable effort to describe this social issue. But the analysis has a lot of built-in assumptions. Readers should look at the following list and see if you agree with the assumptions:

• Only public colleges are considered. This restriction requires the assumption that the private colleges pretty much serve the same areas as public colleges.
• Only non-competitive colleges are included. Precisely, the acceptance rate must be higher than 30 percent. The underlying assumption is that the "local students" won't be interested in selective colleges. It's not clear how the 30 percent threshold was decided.
• Colleges that are more than 60 minutes' driving distance away are considered unreachable. So the assumption is that "local students" are unwilling to drive more than 60 minutes to attend college. This raises a couple other questions: are we only looking at commuter colleges with no dormitories? Is the 60 minutes driving distance based on actual roads and traffic speeds, or some kind of simple model with stylized geometries and fixed speeds?
• The demographic analysis is based on all adults living in the Census "blocks" that are not within 60 minutes' drive of one of those colleges. But if we are calling them "education deserts" focusing on the availability of colleges, why consider all adults, and not just adults in the college age group? One further hidden assumption here is that the lack of colleges in those regions has not caused young generations to move to areas closer to colleges. I think a map of the age distribution in the "education deserts" will be quite telling.
• Not surprisingly, the areas classified as "education deserts" lag the rest of the nation on several key socio-economic metrics, like median income, and proportion living under the poverty line. This means those same areas could be labeled income deserts, or job deserts.

At the end of the piece, the author creates a "story time" moment. Story time is when you are served a bunch of data or analyses, and then when you are about to doze off, the analyst calls story time, and starts making conclusions that stray from the data just served!

Story time starts with the following sentence: "What would it take to make sure that distance doesn’t prevent students from obtaining a college degree? "

The analysis provided has nowhere shown that distance has prevented students from obtaining a college degree. We haven't seen anything that says that people living in the "education deserts" have fewer college degrees. We don't know that distance is the reason why people in those areas don't go to college (if true) - what about poverty? We don't know if 60 minutes is the hurdle that causes people not to go to college (if true).We know the number of adults living in those neighborhoods but not the number of potential students.

The data only showed two things: 1) which areas of the country are not within 60 minutes' driving of the subset of public colleges under consideration, 2) the number of adults living in those Census blocks.

***

So we have a case where the analysis is incomplete but the visualization of the analysis is superb. So in our Trifecta analysis, this chart poses a nice question and has nice graphics but the use of data can be improved. (Type QV)

## Steel tariffs, and my new dataviz seminar

##### Mar 12, 2018

I am developing a new seminar aimed at business professionals who want to improve their ability to communicate using charts. I want any guidance to be tool-agnostic, so that attendees can implement them using Excel if that’s their main charting software. Over the 12+ years that I’ve been blogging, certain ideas keep popping up; and I have collected these motifs and organized them for the seminar. This post is about a recent chart that brings up a few of these motifs.

This chart has been making the rounds in articles about the steel tariffs.

The chart shows the Top 10 nations that sell steel to the U.S., which together account for 78% of all imports.

The chart shows a few signs of design. These things caught my eye:

1. the pie chart on the left delivers the top-line message that 10 countries account for almost 80% of all U.S. steel imports
2. the callout gives further information about which 10 countries and how much each nation sells to the U.S. This is a nice use of layering
3. on the right side, progressive tints of blue indicate the respective volumes of imports

On the negative side of the ledger, the chart is marred by three small problems. Each of these problems concerns inconsistency, which creates confusion for readers.

1. Inconsistent use of color: on the left side, the darker blue indicates lower volume while on the right side, the darker blue indicates higher volume
2. Inconsistent coding of pie slices: on the right side, the percentages add up to 78% while the total area of the pie is 100%
3. Inconsistent scales: the left chart carrying the top-line message is notably smaller than the right chart depicting the secondary message. Readers’ first impression is drawn to the right chart.

Easy fixes lead to the following chart:

***

The central idea of the new dataviz seminar is that there are many easy fixes that are often missed by the vast majority of people making Excel charts. I will present a stack of these motifs. If you're in the St. Louis area, you get to experience the seminar first. Register for a spot here.

Send this message to your friends and coworkers in the area. Also, contact me if you'd like to bring this seminar to your area.

***

I also tried the following design, which brings out some other interesting tidbits, such as that Canada and Brazil together sell the U.S. about 30% of its imported steel, the top 4 importers account for about 50% of all steel imports, etc. Color is introduced on the chart via a stylized flag coloring.

## Saying no thanks to a box of donuts

##### Feb 26, 2018

As I reported last week, the Department of Education for Delaware is running a survey on dashboard design. The survey link is here.

One of the charts being evaluated is a box of donuts, as shown below:

I have written before about the problem with donut charts (see here). A box of donuts is worse than one donut. Here, each donut references a school year. The composition by race/ethnicity of the student body is depicted. In aggregate, the composition has not changed drastically although there are small changes from year to year.

In the following alternative, I use a side-by-side line charts, sometimes called slopegraphs, to illustrate the change by race/ethnicity.

The key decisions are:

• using slopes to encode the year-to-year changes, as opposed to having readers compute those changes by measuring and dividing
• using color to show insights (whether the race/ethnicity has expanded, contracted or remained stable across the three years) as opposed to definitions of the data
• not showing that the percentages within each year summing to 100% as opposed to explicitly presenting this fact in a circular arrangement
• placing annual data side by side on the same plot region as opposed to separating them in three charts

***

There is still a further question of how big a change from year to year is considered material.

This is a good example of why there is never "complete data." In theory, the numbers on this chart are "complete," and come from administrative records. Even when ignoring the possibility that some of the records are missing or incorrect, you still have the issue that the students in the system from year to year varies, so a 1 percent increase in the proportion of Hispanic students can indicate a real demographic trend, or it does not.

##### Feb 16, 2018

Shane C. asked me to fill out a survey hosted by the Delaware Department of Education. This is a survey about designing their dashboard. And I'm very happy to see that they are doing this. In the survey, you are asked to comment on different ways of presenting certain data, and they want to know which version is "easier to understand". It takes about 5-10 minutes to complete it.

The link to the survey is here, and some background information is here (although you don't really need it if you are just interested in the dataviz side).

I'd highly encourage you to leave text comments at the end if you think - for example - that there are even better ways to show the data.

## Looking above the waist, dataviz style

##### Feb 12, 2018

I came across this chart on NYU's twitter feed.

Growth has indeed been impressive; the dataviz less so. Here's the problem with not starting the vertical scale of a column chart at zero:

In a column chart, the heights of the columns should be proportional to the data. Here they are misaligned because an equal amount has been chopped off below 30,000 from all columns. The light purple that I layered on top of the chart presents the correct heights of the columns, assuming that the first column for 2007 indeed properly encoded the data.

The dark purple top of each column represents the "lie factor." It is the amount of exaggeration created by chopping off those legs. The lie factor is of Ed Tufte coinage.

***

The designer probably wanted to show the year-to-year trend more starkly. Doubling the number of applications in 10 years is pretty impressive. The solution is not to chop off the legs but to look above the waist. You can't fix the column chart but you can switch to a line chart, as follows:

In a line chart, we are mostly concerned with the changing slope of the line segments going from year to year. The slopes encode the year-on-year growth rates.

## A chart Hans Rosling would have loved

##### Jan 17, 2018

I came across this chart from the OurWorldinData website, and this one would make the late Hans Rosling very happy.

If you went to Professor Rosling's talk, he was bitter that the amazing gains in public health, worldwide (but particularly in less developed nations) during the last few decades have been little noticed. This chart makes it clear: note especially the dramatic plunge in extreme poverty, rise in vaccinations, drop in child mortality, and improvement in education and literacy, mostly achived in the last few decades.

This set of charts has a simple but powerful message. It's the simplicity of execution that really helps readers get that powerful message.

The text labels on the left and right side of the charts are just perfect.

***

Little things that irk me:

I am not convinced by the liberal use of colors - I would make the "other" category of each chart consistently gray so 6 colors total. Having different colors does make the chart more interesting to look at.

Even though the gridlines are muted, I still find them excessive.

There is a coding bug in the Vaccination chart right around 1960.

## Announcing a new venture

##### Jun 01, 2017

This is a great time for people in the data business. If you go on Linkedin and look for data jobs, there are several thousand open positions, just in the New York area. Every department within any business is accumulating data, and they need people to help them get value out of the data.

There are also lots of people I meet who would like to transition their careers to take advantage of these open positions but too many of them are being turned away. Many of these people have great backgrounds in other fields (economics, chemistry, psychology, engineering, IT, etc.), and have the analytical smarts to excel in these new data jobs. They are not getting hired. That's because as hiring managers, we prefer hiring the experienced person who doesn't need additional training. We also poach experienced people from other employers, instead of training new talent, creating a vicious cycle.

This is the problem that I am trying to solve by launching my new venture - Principal Analytics Prep.

The single biggest complaint about the talent pool by hiring managers is that people's skills are too narrow, sometimes too technical, sometimes too "soft". Hiring managers in the business units outside engineering/software development, for example, marketing, operations, finance, customer service, want to hire people who can analyze and interpret data in the business context, communicate findings to non-technical audiences, as well as contribute to inter-departmental working teams to solve business problems.

For Principal Analytics Prep, I have assembled a group of passionate instructors - who are in director or above positions in industry, and hiring managers for their teams - to design a broad-based curriculum that helps people upgrade their skills to meet industry needs. Our courses range from coding to statistical reasoning to business skills. The faculty have worked at places such as American Express, Cisco, Goldman Sachs, HBO, McKinsey, Mount Sinai, SiriusXM Radio, and Vimeo, with an average of 10 years in industry.

We are not a pure coding academy, therefore we want to assemble people from all disciplines.

We will be launching the first class of students this summer in NYC.

***

Blog readers, you can help me in the following ways:

• If you know anyone who's looking to upgrade their skills and get into the business analytics/data science field, tell them about the program
• If you are interested in teaching a course, contact me
• I am also looking for part-time help with administration and operations, so if you believe in my vision, contact me