Scott Klein's team at Propublica published a worthy news application, called "Hell and High Water" (link) I took some time taking in the experience. It's a project that needs room to breathe.
The setting is Houston Texas, and the subject is what happens when the next big hurricane hits the region. The reference point was Hurricane Ike and Galveston in 2008.
This image shows the depth of flooding at the height of the disaster in 2008.
The app takes readers through multiple scenarios. This next image depicts what would happen (according to simulations) if something similar to Ike plus 15 percent stronger winds hits Galveston.
One can also speculate about what might happen if the so-called "Mid Bay" solution is implemented:
This solution is estimated to cost about $3 billion.
I am drawn to this project because the designers liberally use some things I praised in my summer talk at the Data Meets Viz conference in Germany.
Here is an example of hover-overs used to annotate text. (My mouse is on the words "Nassau Bay" at the bottom of the paragraph. Much of the Bay would be submerged at the height of this scenario.)
The design has a keen awareness of foreground/background issues. The map uses sparse static labels, indicating the most important landmarks. All other labels are hidden unless the reader hovers over specific words in the text.
I think plotting population density would have been more impactful. With the current set of labels, the perspective is focused on business and institutional impact. I think there is a missed opportunity to highlight the human impact. This can be achieved by coding population density into the map colors. I believe the colors on the map currently represent terrain.
This is a successful interactive project. The technical feats are impressive (read more about them here). A lot of research went into the articles; huge amounts of details are included in the maps. A narrative flow was carefully constructed, and the linkage between the text and the graphics is among the best I've seen.
Here is a map that attracted my attention on the NY Times (link):
The counties are given shades of blue with darker shades meaning more economic distress. According to the label (Newark), the 10 red dots are the top 10 most distressed large cities in the U.S. It appears that almost all of these cities are in regions of light blue shade. That's the puzzle.
A separate issue with this map is that it presents a static image of distress. The first paragraph of the article states: "As the most prosperous communities in the United States have gotten richer since the end of the Great Recession in 2009, economic conditions in many distressed areas have deteriorated even further." It would be great if we can see a before and after 2009 comparison.
Between teaching two classes, and a seminar, and logging two coast-to-coast flights, I was able to find time to rethink the following chart from the Wall Street Journal: (link to article)
I like the right side of this chart, which helps readers interpret what the alcohol consumption guidelines really mean. When we go out and drink, we order beers, or wine, or drinks - we don't think in terms of grams of alcohol.
The left side is a bit clumsy. The biggest message is that the UK has tightened its guidelines. This message is delivered by having U.K. appear twice in the chart, the only country to repeat. In order to make this clear, the designer highlights the U.K. rows. But the style of highlighting used for the two rows differs, because the current U.K. row has to point to the right side, but not the previous U.K. row. This creates a bit of confusion.
In addition, since the U.K. rows are far apart, figuring out how much the guidelines have changed is more work than desired.
The placement of the bars by gender also doesn't help. A side message is that most countries allow men to drink more than women but the U.K., in revising its guidelines, has followed Netherlands and Guyana in having the same level for both genders.
After trying a few ideas, I think the scatter plot works out pretty well. One advantage is that it does not arbitrarily order the data men first, women second as in the original chart. Another advantage is that it shows the male-female balance more clearly.
An afterthought: I should have added the words "Stricter", "Laxer" on the two corners of the chart. This chart shows both the U.K. getting stricter but also that it joins Guyana and Netherlands as countries which treat men and women equally when it comes to drinking.
It's awfully quiet here lately as I am trying to manage a tight schedule. The problem with a tight schedule is the absence of "slack." Without slack, just one little unexpected event ruins your schedule. Like dominoes, everything gets pushed back. That event arrived in drips and drabs a couple of weeks ago as a major water leak broke out two floors above my apartment. I am still picking up the pieces.
Last week, I crossed the pond and gave a talk about visual story-telling at the SAS headquarters in UK. The audience was wonderful and the organizers assembled a great crowd. The event was streamed live to over a thousand viewers all across Europe. Thanks for attending!
Here's me pointing to one of the charts in my presentation:
In the next few weeks, people in the U.S. have a chance to hear a similar presentation. Please come meet me and let me know you read my blog!
New York City, 3/24, 9 a.m. Free registration here
In addition, I will be speaking about the ethics of data science at the INFORMS Analytics Conference, in April, in Orlando. The talk will be followed by a panel discussion.
On a related note, rSQUAREedge is hosting a webinar next week by Augustine Fou, who is a digital advertising fraud investigator. This is also free. Fou will talk about the techniques he uses to uncover "bad" data. In this case, "bad" data are data inserted by adversaries to inflate statistics. This is one of the unspoken, and worrisome issues in modern data analysis. One can be very naive in assuming that the observational, "found" data are free from manipulation.
Long-time reader Daniel L. isn't a fan of this chart, especially when it is made to spin, as you can see at this link:
Like other 3D charts, this one is hard to read. The vertical lines are both good and bad: They make the one dimension very easy to read but their very existence makes one realize the challenges of reading the other dimensions without guidelines.
This dataset allows me to show a ternary plot. The ternary plot is an ingenious way of putting three dimensions onto a flat surface. I have found few good uses of this chart type, though.
Let's get to the core of the issue: the analyst started with 25 skills that are frequently required by data science and analytics jobs, and his goal is to classify these skills into three groups. The underlying method used to create these groups is factor analysis.
Each dot above is a skill. The HQ of each grouping of skills (known as a factor) is a corner of the plot. The closer the dot is to the corner, the more relevant that skill is to the skill group.
In the above chart, I highlighted four skills that are not clearly in one or another skill group. For example, Commuication straddles the Math/Stats and Business dimensions but scores lowly on the Technology/Programming dimension.
The ternary plot has a few problems. Like any scatter plot, once you have 10 or more dots, it is hard to fit all the data labels. Further, the axis labels must be carefully done to help readers understand the plot.
Before long, the chart looks very cluttered. There just isn't enough room to get all your words in. Here is another version of the same chart -- wiht a different set of annotation.
Instead of drawing attention to those skills that have no clear home, this version of the chart focuses on the dots close to each corner.
In two cases, I classified two of the skills differently from the original. The Machine Learning skill is part of Math/Stats on my charts but it is part of Technology/Programming on the original.
The ternary plot is interesting and unusual but is only useful in selected problems.
Twitter user @glennrice called out a "journalist" for producing the following chart:
You can't say the Columbia Heartbeat site doesn't deserve a beating over this graph. I don't recognize the software but my guess is one of these business intelligence (BI) tools that produce canned reports with a button click.
Until I read the article, I kept thinking that there are several overlapping lines being plotted. But it's really a 3D plus color effect!
Wait there's more. This software treats years as categories rather than a continuous number. So it made equal-sized intervals of 2 years, 1 year, 2 years, and 8 years. I am still not sure how this happened because the data set given at the bottom of the article contains annual data.
The y-axis labels, the gridlines, the acronym in the chart title, the unnecessary invocation of start-at-zero, etc. almost make this feel like a parody.
Aside from visual design issues, I am not liking the analysis either. The claim is that taxes have been increasing every year in Columbia, Missouri, and that the additional revenue ended up sitting in banks as cash.
We need to see a number of other data series in order to accept this conclusion. What was the growth in tax revenues relative to the increase in cash? What was the growth in population in Columbia during this period? Did the cash holding per capita increase or decrease? What were the changes in expenditure on schools, public works, etc.?
This is a Type DV chart. There is an interesting question being asked but the analysis must be sharpened and the graphing software must be upgraded asap.
PS. On second thought, I think the time axis might be deliberately distorted. Judging from the slope of the line, the cumulative increase in the last 8 years equals the increase in past two-year increments so if the proper scale is used, the line would flatten out significantly, demolishing the thesis of the article. Thus, it is a case of printing cash, graphically.
On Twitter, someone pointed me to the following map of journalists who were killed between 1993 and 2015.
I wasn't sure if the person who posted this liked or disliked this graphic. We see a clear metaphor of gunshots and bloodshed. But in delivering the metaphor, a number of things are sacrificed:
the number of deaths is hard to read
the location of deaths is distorted, both in large countries (Russia) where the deaths are too concentrated, and in small countries (Philippines) where the deaths are too dispersed
despite the use of a country-level map, it is hard to learn the deaths by country
The Committee to Protect Journalists (CPJ), which publishes the data, used a more conventional choropleth map, which was reproduced and enhanced by Global Post:
They added country names and death counts via a list at the bottom. There is also now a color scale. (Note the different sets of dates.)
In a Trifecta Checkup, I would give this effort a Type DV. While the map is competently produced, it doesn't get at the meat of the data. In addition, these raw counts of deaths do not reveal much about the level of risk experienced by journalists working in different countries.
The limitation of the map can be seen in the following heatmap:
While this is not a definitive visualization of the dataset, I use this heatmap to highlight the trouble with hiding the time dimension. Deaths are correlated with particular events that occurred at particular times.
Iraq is far and away the most dangerous but only after the Iraq War and primarily during the War and its immediate aftermath. Similarly, it is perfectly safe to work in Syria until the last few years.
A journalist can use this heatmap as a blueprint, and start annotating it with various events that are causes of heightened deaths.
Now the real question in this dataset is the risk faced by journalists in different countries. The death counts give a rather obvious and thus not so interesting answer: more journalists are killed in war zones.
A denominator is missing. How many journalists are working in the respective countries? How many non-journalists died in the same countries?
Also, separating out the causes of death can be insightful.