Announcing a new venture

This is a great time for people in the data business. If you go on Linkedin and look for data jobs, there are several thousand open positions, just in the New York area. Every department within any business is accumulating data, and they need people to help them get value out of the data.

There are also lots of people I meet who would like to transition their careers to take advantage of these open positions but too many of them are being turned away. Many of these people have great backgrounds in other fields (economics, chemistry, psychology, engineering, IT, etc.), and have the analytical smarts to excel in these new data jobs. They are not getting hired. That's because as hiring managers, we prefer hiring the experienced person who doesn't need additional training. We also poach experienced people from other employers, instead of training new talent, creating a vicious cycle.

This is the problem that I am trying to solve by launching my new venture - Principal Analytics Prep.




The single biggest complaint about the talent pool by hiring managers is that people's skills are too narrow, sometimes too technical, sometimes too "soft". Hiring managers in the business units outside engineering/software development, for example, marketing, operations, finance, customer service, want to hire people who can analyze and interpret data in the business context, communicate findings to non-technical audiences, as well as contribute to inter-departmental working teams to solve business problems.

For Principal Analytics Prep, I have assembled a group of passionate instructors - who are in director or above positions in industry, and hiring managers for their teams - to design a broad-based curriculum that helps people upgrade their skills to meet industry needs. Our courses range from coding to statistical reasoning to business skills. The faculty have worked at places such as American Express, Cisco, Goldman Sachs, HBO, McKinsey, Mount Sinai, SiriusXM Radio, and Vimeo, with an average of 10 years in industry.

We are not a pure coding academy, therefore we want to assemble people from all disciplines.

We will be launching the first class of students this summer in NYC.


Blog readers, you can help me in the following ways:

  • If you know anyone who's looking to upgrade their skills and get into the business analytics/data science field, tell them about the program
  • If you are interested in teaching a course, contact me
  • I am also looking for part-time help with administration and operations, so if you believe in my vision, contact me

If you have suggestions, please leave a comment. Thank you.



Sorting out what's meaningful and what's not

A few weeks ago, the New York Times Upshot team published a set of charts exploring the relationship between school quality, home prices and commute times in different regions of the country. The following is the chart for the New York/New Jersey region. (The article and complete data visualization is here.)


This chart is primarily a scatter plot of home prices against school quality, which is represented by average test scores. The designer wants to explore the decision to live in the so-called central city versus the decision to live in the suburbs, hence the centering of the chart about New York City. Further, the colors of the dots represent the average commute times, which are divided into two broad categories (under/over 30 minutes). The dots also have different sizes, which I presume measures the populations of each district (but there is no legend for this).

This data visualization has generated some negative reviews, and so has the underlying analysis. In a related post on the sister blog, I discuss the underlying statistical issues. For this post, I focus on the data visualization.


One positive about this chart is the designer has a very focused question in mind - the choice between living in the central city or living in the suburbs. The line scatter has the effect of highlighting this particular question.

Boy, those lines are puzzling.

Each line connects New York City to a specific school district. The slope of the line is, nominally, the trade-off between home price and school quality. The slope is the change in home prices for each unit shift in school quality. But these lines don't really measure that tradeoff because the slopes span too wide a range.

The average person should have a relatively fixed home-price-to-school-quality trade-off. If we could estimate this average trade-off, it should be represented by a single slope (with a small cone of error around it). The wide range of slopes actually undermines this chart, as it demonstrates that there are many other variables that factor into the decision. Other factors are causing the average trade-off coefficient to vary so widely.


The line scatter is confusing for a different reason. It reminds readers of a flight route map. For example:


The first instinct may be to interpret the locations on the home-price-school-quality plot as geographical. Such misinterpretation is reinforced by the third factor being commute time.

Additionally, on an interactive chart, it is typical to hide the data labels behind mouseovers or clicks. I like the fact that the designer identifies some interesting locales by name without requiring a click. However, one slight oversight is the absence of data labels for NYC. There is nothing to click on to reveal the commute/population/etc. data for central cities.


In the sister blog post, I mentioned another difficulty - most of the neighborhoods are situated to the right and below New York City, challenging the notion of a "trade-off" between home price and school quality. It appears as if most people can spend less on housing and also send kids to better schools by moving out of NYC.

In the New York region, commute times may be the stronger factor relative to school quality. Perhaps families chose NYC because they value shorter commute times more than better school quality. Or, perhaps the improvement in school quality is not sufficient to overcome the negative of a much longer commute. The effect of commute times is hard to discern on the scatter plot as it is coded into the colors.


A more subtle issue can be seen when comparing San Francisco and Boston regions:


One key insight is that San Francisco homes are on average twice as expensive as Boston homes. Also, the variability of home prices is much higher in San Francisco. By using the same vertical scale on both charts, the designer makes this insight clear.

But what about the horizontal scale? There isn't any explanation of this grade-level scale. It appears that the central cities have close to average grade level in each chart so it seems that each region is individually centered. Otherwise, I'd expect to see more variability in the horizontal dots across regions.

If one scale is fixed across regions, and the other scale is adapted to each region, then we shouldn't compare the slopes across regions. The fact that the lines are generally steeper in the San Francisco chart may be an artifact of the way the scales are treated.


Finally, I'd recommend aggregating the data, and not plot individual school districts. The obsession with magnifying little details is a Big Data disease. On a chart like this, users are encouraged to click on individual districts and make inferences. However, as I discussed in the sister blog (link), most of the differences in school quality shown on these charts are not statistically meaningful (whereas the differences on the home-price scale are definitely notable). 


If you haven't already, see this related post on my sister blog for a discussion of the data analysis.





Visualizing citation impact

Michael Bales and his associates at Cornell are working on a new visual tool for citations data. This is an area that is ripe for some innovation. There is a lot of data available but it seems difficult to gain insights from them. The prototypical question is how authoritative is a particular researcher or research group, judging from his or her or their publications.

A proxy for "quality" is the number of times the paper is cited by others. More sophisticated metrics take into account the quality of the researchers who cite one's work. There are various summary statistics e.g. h-index that attempts to capture the data distribution but reducing to a single number may remove too much context.

Contextual information is very important for interpretation: certain disciplines might enjoy higher average numbers of citations because researchers tend to list more references, or that papers typically have large numbers of co-authors; individual researchers may have a few influential papers, or a lot of rarely-cited papers or anything in between.

A good tool should be able to address a number of such problems.

Michael was a former student who attended the Data Visualization workshop at NYU (syllabus here), and the class spent some time discussing his citations impact tool. He contacted me to let me know that what we did during the workshop has now reached the research conferences.

Here is a wireframe of the visual form we developed:


This particular chart shows the evolution in citations data over three time periods for a specific sub-field of study. The vertical scale is a percentile ranking based on some standard used in the citations industry. We grouped the data into deciles (and within each deciles, into thirds) to facilitate understanding. The median rank is highlighted - we can see that in this sub-field, the publications have both increased in quantity but also in quality with the median rank showing improvement over the three periods of time. Because "review articles" are interpreted differently by some, those are highlighted in purple.

One of the key strengths of this design is the filter mechanism shown on the right. The citations researcher can customize comparisons. This is really important because the citations data are meaningless by themselves; they only attain meaning when compared to peer groups.

Here is an even rougher sketch of the design:


For a single researcher, this view will list all of his or her papers, ordered by each paper's percentile rank, with review papers given a purple color.

The entire VIVO dashboard project by Weill Cornell Medicine has a github page, but the citation impact tool does not seem to be there at the moment.
Michael tells me the citation impact tool is found here.



Race to the top, Erasmus edition

(This is a submission from reader Lawrence Mayes. Thank you Lawrence!)

I came across this unusual graphical representation of the destinations of scholarship students:


[Kaiser here: The charts are hidden inside an annoying Flash app and it seems that the bottom half of the chart is cropped out.]

(the original can be seen here:

The question is: what parameter is used to illustrate the figures? - Line length or angle?

The answer is - line length. But the eye is likely to use the angle as the measure and this is where an error may arise. It's almost an optical illusion - the smaller number of students lie on the circumferences of smaller circles - and a smaller length goes further around the circle. Thus, for example, Turkey attracts about one fifth of those students attracted by Germany but it looks like it's nearer half (45 degrees vs 90 degrees).

(The case of Spain is really bizarre - it looks like it's gone round the circle by over 280 degrees but actually what they've done is to break off the line at 90 degrees and stick the bit they broke off back on the diagram at the left.)

I have never seen this type of 'bar chart' before but it is really misleading.


Long-time readers may remember my discussion of the "race-track graph." (here) The "optical illusion" Lawrence mentions above is well known to any track runner. The inside lanes are shorter than outside lanes, so you stagger the starting positions.

Sorting out the data, and creating the head-shake manual

Yesterday's post attracted a few good comments.

Several readers don't like the data used in the NAEP score chart. The authors labeled the metric "gain in NAEP scale scores" which I interpreted to be "gain scores," a popular way of evaluating educational outcomes. A gain score is the change in test score between (typically consecutive) years. I also interpreted the label "2000-2009" as the average of eight gain scores, in other words, the average year-on-year change in test scores during those 10 years.

After thinking about what reader mankoff wrote, which prompted me to download the raw data, I realized that the designer did not compute gain scores. "2000-2009" really means the difference between the 2009 score and the 2000 score, ignoring all values between those end points. So mankoff is correct in saying that the 2009 number was used in both "2000-2009" and "2009-2015" computations.

This treatment immediately raises concerns. Why is a 10-year period compared to a 7-year period?

Andrew prefers to see the raw scores ("scale scores") instead of relative values. Here is the corresponding chart:


I placed a line at 2009, just to see if there is a reason for that year to be a special year. (I don't think so.) The advantage of plotting raw scores is that it is easier to interpret. As Andrew said, less abstraction. It also soothes the nerves of those who are startled that the lines for white students appear at the bottom of the chart of gain scores.

I suppose the reason why the original designer chose to use score differentials is to highlight their message concerning change in scores. One can nitpick that their message isn't particularly cogent because if you look at 8th grade math or reading scores, comparing 2009 and 2015, there appeared to be negligible change, and yet between those end-points, the scores did spike and then drop back to the 2009 level.

One way to mitigate the confusion that mankoff encountered in interpreting my gain-score graphic is to use "informative" labels, rather than "uninformative" labels.


Instead of saying the vertical axis plots "gain scores" or "change in scores," directly label one end as "no progress" and the other end as "more progress."

Everything on this chart is progress over time, and the stalling of progress is their message. This chart requires more upfront learning, after which the message jumps out. The chart of raw scores shown above has almost no perceptive overhead but the message has to be teased out. I prefer the chart of raw scores in this case.


Let me now address another objection, which pops up every time I convert a bar chart to a line chart (a type of Bumps chart, which has been called slope graphs by Tufte followers). The objection is that the line chart causes readers to see a trend when there isn't one.

So let me make the case one more time.

Start with the original column chart. If you want to know that Hispanic students have seen progress in their 4th grade math scores grind to a halt, you have to shake your head involuntarily in the following manner:


(Notice how the legend interferes with your line of sight.)

By the time you finish interpreting this graphic, you would have shaken your head in all of the following directions:


Now, I am a scavenger. I collect all these lines and rearrange them into four panels of charts. That becomes the chart I showed in yesterday's post. All I have done is to bring to the surface the involuntary motions readers were undertaking. I didn't invent any trends.

Involuntary head-shaking is probably not an intended consequence of data visualization

This chart is in the Sept/Oct edition of Harvard Magazine:

Naep scores - Nov 29 2016 - 4-21 PM

Pretty standard fare. It even is Tufte-sque in the sparing use of axes, labels, and other non-data-ink.

Does it bug you how much work you need to do to understand this chart?

Here is the junkchart version:


In the accompanying article, the journalist declared that student progress on NAEP tests came to a virtual standstill, and this version highlights the drop in performance between the two periods, as measured by these "gain scores."

The clarity is achieved through proximity as well as slopes.

The column chart form has a number of deficiencies when used to illustrate this data. It requires too many colors. It induces involuntary head-shaking.

Most unforgivingly, it leaves us with a puzzle: does the absence of a column means no progress or unknown?


PS. The inclusion of 2009 on both time periods is probably an editorial oversight.



Statistics report raises mixed emotions

It's gratifying to live through the incredible rise of statistics as a discipline. In a recent report by the American Statistical Association (ASA), we learned that enrollment at all levels (bachelor, master and doctorate) has exploded in the last 5-10 years, as "Big Data" gather momentum.

But my sense of pride takes a hit while looking at the charts that appear in the report. These graphs demonstrate again the hegemony of Excel defaults in the world of data visualization.

Here are all five charts organized in a panel:


Chart #5 (bottom right) catches the eye because it is the only chart with two lines instead of three. You then flip to the prior page to find the legend. The legend tells you the red line is Bachelor and the green line is PhD. That seems wrong, unless biostats departments do not give out Master degrees.

This is confirmed by chart #2, where we find the blue line (Master) hugging zero.

Presumably the designer removed the blue line from chart #5 because the low counts mean that it fluctuates wildly between 0 and 100 percent and so disrupts the visual design. But the designer forgets to tell readers why the blue line is missing.


It turns out the article itself contradicts all of the above:

For biostatistics degrees, for which NCES started providing data specifically in 1992, master’s degrees track the overall increase from 2010– 2014 at 47%...The number of undergraduate degrees in biostatistics remains below 30.

Asa_enrollment_legendIn other words, the legend is mislabeled. The blue line represents Bachelor while the red line, Master. (The error was noticed after the print edition went out because the online version has the correct legend.)


There is another mystery. Charts #2, #3, and #5, all dealing with biostats, have time starting from 1992, while Charts #1 and #4 starts from 1987. The charts aren't lined up in a way that would allow comparisons across time.

Similarly, the vertical scale of each chart is different (aside from Charts #3 and #4). This design choice impairs comparison across charts.

In the article, it is explained that 1992 was when the agency started collecting data about biostatistics degrees. Between 1987 and 1992, were there no biostatistics majors? were biostatistics majors lumped into the counts of statistics majors? It's hard to tell.


While Excel is a powerful tool that has served our community well, its flexibility is often a source of errors. The remedy to this problem is to invest ample time in over-riding pretty much every default decision in the system.

For example:


This chart, a reproduction of Chart #1 above, was entirely produced in Excel.







A data visualization that is invariant to the data

This map appeared in Princeton Alumni Weekly:


Here is another map I created:


If you think they look basically the same, you got the point. Now look at the data on the maps. The original map displays the proportion of graduates who ended up in different regions of the country. The second map displays the proportion of land mass in different regions of the country.

The point is that this visual design is not self-sufficient. If you cover up the data printed on the map, there is nothing else to see. Further, if you swap in other data series (anything at all), nothing on the map changes. Yes, this map is invariant to the data!

This means the only way to read this map is to read the data directly.


Maps also have the other issue. The larger land areas draw the most attention. However, the sizes of the regions are in inverse proportion to the data being depicted. The smaller the values, the larger the areas on the map. This is the scatter plot of the proportion of graduates (the data) versus the proportion of land mass:



One quick fix is to use a continuous color scale. In this way, the colors encode the data. For example:


The dark color now draws attention to itself.

Of course, one should think twice before using a map.


One note of curiosity: Given the proximity to NYC, it is not surprising that NYC is the most popular destination for Princeton graduates. Strangely enough, a move from Princeton to New York is considered out of region, by the way the regions are defined. New Jersey is lumped with Pennsylvania, Maryland, Virginia, etc. into the Mid-Atlantic region while New York is considered Northeast.


Visualizing survey results excellently

Surveys generate a lot of data. And, if you have used a survey vendor, you know they generate a ton of charts.

I was in Germany  to attend the Data Meets Viz workshop organized by Antony Unwin. Paul and Sascha from Zeit Online presented some of their work at the German publication, and I was highly impressed by this effort to visualize survey results. (I hope the link works for you. I found that the "scroll" fails on some platforms.)

The survey questions attempted to assess the gap between West and East Germans 25 years after reunification.

The best feature of this presentation is the maintenance of one chart form throughout. This is the general format:



The survey asks whether working mothers is a good thing or not. They choose to plot how the percent agreeing that working mothers is good changes over time. The blue line represents the East German average and the yellow line the West German average. There is a big gap in attitude between the two sides on this issue although both regions have experienced an increase in acceptance of working mothers over time.

All the other lines in the background indicate different subgroups of interest. These subgroups are accessible via the tabs on top. They include gender, education level, and age.

The little red "i" conceals some text explaining the insight from this chart.

Hovering over the "Men" tab leads to the following visual:


Both lines for men sit under the respective average but the shape is roughly the same. (Clicking on the tab highlights the two lines for men while moving the aggregate lines to the background.)

The Zeit team really does an amazing job keeping this chart clean while still answering a variety of questions.

They did make an important choice: not to put every number on this chart. We don't see the percent disagreeing or those who are ambivalent or chose not to answer the question.


Like I said before, what makes this set of charts is the seamless transitions between one question and the next. Every question is given the same graphical treatment. This eliminates learning time going from one chart to the next.

Here is one using a Likert scale, and accordingly, the vertical axis goes from 1 to 7. They plotted the average score within each subgroup and the overall average:


Here is one where they combined the top categories into a "Bottom 2 Box" type metric:



Finally, I appreciate the nice touch of adding tooltips to the series of dots used to aid navigation.


The theme of the workshop was interactive graphics. This effort by the Zeit team is one of the best I have seen. Market researchers take note!


Summer dataviz workshop to start July 1

Registration is open for my dataviz workshop at NYU. (link)

This is a workshop in the sense of a creative writing workshop. Your "writing" are sketches of data visualization based on your selected datasets. In class, we critique all of the work and produce revisions. You will learn to appreciate good dataviz, to offer constructive and insightful commentary on visualization, and be discriminating in receiving feedback.

Last term, half the class worked on datasets that are related to their jobs. The data sources were diverse, ranging from scholarly citation data, World Bank data, commercial sales and market share data, mountaineering accidents data, standardized testing item data, speeches by death row inmates, juvenile convicts, etc.

Students pick their own tools. They used Excel, Powerpoint, Tableau, d3, etc.

Here is a past syllabus.

The course runs from July 1 to Aug 5. Register here.