I discussed the rose chart used in the Environmental Performance Index (EPI) report last week. This type of data is always challenging to visualize.
One should start with an objective. If the goal is a data dump, that is to say, all you want is to deliver the raw data in its full glory to the user, then you should just print a set of data tables. This has traditionally been the delivery mechanism of choice.
If, on the other hand, your interest is communicating insights, then you need to ask some interesting questions. One such question is how do different regions and/or countries compare with each other, not just in the overall index but also in the major sub-indices?
Learning to ask such a question requires first understanding the structure of the data. As described in the previous post, the EPI is a weighted average of a bunch of sub-indices. Each sub-index measures "distance to a target," which is then converted into a scale from 0 to 100. This formula guarantees that at the aggregate level, the EPI is not going to be 0 or 100: a country would have to score 100 on all sub-indices to attain EPI perfection!
Here is a design sketch to address the question posed above:
For a print version, I chose several reference countries listed at the bottom that span the range of common values. In the final product, hovering over a stripe should disclose a country and its EPI. Then the reader can construct comparisons of the type: "Thailand has a value of 53, which places it between Brazil and China."
The chart reveals a number of insights. Each region stakes out its territory within the EPI scale. There are no European countries with EPI lower than 45 while there are no South Asian countries with EPI higher than 50 or so. Within each region, the distribution is very wide, and particularly so in the East Asia and Pacific region. Europe is clearly the leading region, followed by North America.
The same format can be replicated for every sub-index.
This type of graph addresses a subset of the set of all possible questions and it does so in a clear way. Modesty in your goals often helps.
At the conference in Bavaria, Jay Emerson asked participants to provide comments on the data visualization of the 2014 Environmental Performance Index (link). We looked at the country profiles in particular. Here is one for Singapore:
The main object of interest here is the "rose chart." To understand it, we need to know the methodology behind the index. The index is a weighted average of nine sub-indices, as shown in the table at the bottom. In many cases, the sub-index is itself an average of sub-sub-indices. These lower-level indices measure the distance between a country's performance and some target performance, typically set at the international level. But those distances are converted into a scale between 0 and 100 so the country with a score of zero did the worst in terms of meeting the target while the country with 100 did the best.
In the rose chart, the circle is divided evenly into nine sectors, each representing a sub-index. The data are encoded in the radius of the sectors. Colors map to the sub-index, and the legend is provided in two ways: a hover-over on the Web, and the table below.
Here is the equation that connects the data (EPI) to the area of the sectors:
There are a number of issues with this representation. First, because of the squaring of the EPI, the area is distorted. If one country is twice the EPI of another, the area is four times as large. Another way to see this is to notice that as the EPI increases, the curved edge of the sector moves outwards, tracing a larger circumference.
Another issue is the one-ninth factor, which implies that each of those nine sub-indices are equally important. The diagram below shows that interpretation to be incorrect. (The nine sub-indices are shown in the second layer from the outside in.)
A third issue is illustrated in the Singapore rose. Notice from the table below that Singapore scored zero on Fisheries. But in the rose, Fisheries has a non-zero area. Think of this practice as coring an apple. The middle circle of radius k should be ignored. If the sector that has the color of Fisheries has zero area, then the entire red circle shown below should have zero area.
With these three adjustments, the encoding formula becomes rather more complicated:
where x depends on the weight of the sub-index, and k is the radius of the sector that represents value zero.
*** The rose/radar/spider type charts are more useful when placed side by side to compare countries. But even then, this chart form doesn't work well for this dataset. This is because the spacing of countries within each sub-index is not uniform.
The site has a visualization of the distribution of sub-index scores by issue:
We can see that in cases of water resources, most countries are not doing very well at all. In terms of air quality, most countries except for those in the right tail have performed quite well. It is hard to interpret the indices unless one has an idea of the full distribution.
Finally, one wrinkle that the EPI people did makes me happy. They have created PDF and images of their data visualization so it is quite easy to save and keep some of this work. All too often, browser-based technologies create visualization that can't be saved.
I found this chart on a Munich publication called Süddeutsche Zeitung. This appeared during the most recent Greek/Euro crisis.
The bags of money were financial obligations that were coming due from June 2015 to December 2015. There were three creditors, indicated by red, blue and gray.
This graphic answers one question well: individual debt obligations for a given month and given creditor. However, by privileging these details, the chart fails to convey cumulative totals well - readers have to make calculations in their heads.
In the revision, I wanted to convey two key messages: the total amount of debt that was coming due in those seven months, and the relative proportion of debt owed to the three creditors. An area chart brings this out better.
Conversely, it is much harder to figure out individual debt obligations by month and creditor from this version.
This points to the importance of determining your key message(s) before choosing a form.
The past week in Seattle, I was blessed with amazing weather. The city has great coffee and restaurants, so pleased me alright.
But Seattle-ites, please tell your government to burn your transit map presto! I tried looking at the map three or four times, and each time, my eyes were burning so much from the colors, the details, the lack of labels, the general confusion that I gave up. Yes, that's the worst thing an information graphics designer wants to hear - the reader waves the white flag.
How do you make sense of that? In the excerpt below, I labeled with black boxes my desired origin and destination.
There are many obstacles to figuring out a route. Firstly, the precise locations of bus stops are not indicated on the map. From the black box up top, if I wanted to catch a bus, I wasn't even sure which corner to go to! Seattle, by the way, is full of one-way streets. Eventually, you realize that different lines have different operators, and they don't use a common ticket.
I ended up at the Westlake Station wanting to take public transit to the International District. I purchased a ticket from the machine. Then I boarded a bus seemingly heading in the right direction. The bus driver stared me down as if I just stepped into disputed territory. She told me my ticket was for a train. I asked her how I'd catch a train. Her eyes told me to get off quickly or else...
I too thought I bought a train ticket but it turns out the train and buses share the same platform.
Back to the map, it would appear that the green line labeled 40 would be useful to me. I tried to trace the green line but it started looping around and I gave up.
The reason for the infrequent posting is my travel schedule. I spent the past week in Seattle at JSM. This is an annual meeting of statisticians. I presented some work on fantasy football data that I started while writing Numbersense.
For my talk, I wanted to present the ubiquitous league table in a more useful way. The league table is a table of results and relevant statistics, at the team level, in a given sports league, usually ordered by the current winning percentage. Here is an example of ESPN's presentation of the NFL end-of-season league table from 2014.
If you want to know weekly results, you have to scroll to each team's section, and look at this format:
For the graph that I envisioned for the talk, I wanted to show the correlation between Points Scored and winning/losing. Needless to say, the existing format is not satisfactory. This format is especially poor if I want my readers to be able to compare across teams.
The graph that I ended up using is this one:
The teams are sorted by winning percentage. One thing should be pretty clear... the raw Points Scored are only weakly associated with winning percentage. Especially in the middle of the Points distribution, other factors are at play determining if the team wins or loses.
The overlapping dots present a bit of a challenge. I went through a few other drafts before settling on this.
The same chart but with colored dots, and a legend:
Only one line of dots per team instead of two, and also requiring a legend:
Jittering is a popular solution to separating co-located dots but the effect isn't very pleasing to my eye:
Small multiples is another frequently prescribed solution. Here I separated the Wins and Losses in side-by-side panels. The legend can be removed.
As usual, sketching is one of the most important skills in data visualization; and you'd want to have a tool that makes sketching painless and quick.
Via Twitter, Andrew B. (link) asked if I could comment on the following chart, published by PC Magazine as part of their ISP study. (link)
This chart is decent, although it can certainly be improved. Here is a better version:
A couple of little things are worth pointing out. The choice of red and green to indicate down and up speed respectively is baffling. Red and green are loaded terms which I often avoid. A red dot unfortunately signifies STOP, but ISP users would definitely not want to stop on their broadband superhighway!
In terms of plot symbols, up and down arrows are natural for this data.
Using the Trifecta checkup (link), I am most concerned about the D(ata) corner.
The first sign of trouble is the arbitrary construction of an "Index". This index isn't really an index because there is no reference level. The s0-called index is really a weighted average of the download and upload speeds, with 80% weight given to the former. In reality, the download speeds are even weighted higher because download speeds are multiples of the upload speeds, in their original units.
Besides, putting these ISPs side by side gives an impression that they are comparable things. But direct comparison here is an invitation to trouble. For example, Verizon is represented only by its FIOS division (fiber optics). We have Comcast and Cox which are cable providers. The geographical footprints of these providers are also different.
This is not a trivial matter. Midcontinent operates primarily in North and South Dakota. Some other provider may do better than Midcontinent on average but within those two states, the other provider may perform much worse.
Note that the data came from the Speedtest website (over 150,000 speed tests). In my OCCAM framework (link), this dataset is Observational, without Controls, seemgingly Complete, and Adapted (from speed testing for technical support).
Here is the author's disclosure, which should cause concern:
We require at least 50 tests from unique IP addresses for any vendor to receive inclusion. That's why, despite a couple of years of operation, we still don't have information on Google Fiber (to name one such vendor). It simply doesn't have enough users who took our test in the past year.
So, the selection of providers is based on the frequency of Speedtest queries. Is that really a good way to select samples? The author presents one possible explanation for why Google Fiber is absent - that it has too few users (without any evidence). In general, there are many reasons for such an absence. One might be that a provider is so good that few customers complain about speeds and therefore they don't do speed tests. Another might be that a provider has a homegrown tool for measuring speeds. Or any number of other reasons. These reasons create biases in various directions, which makes the analysis confusing.
Think about your own behavior. When was the last time you did a speed test? Did you use Speedtest.com? How did you hear about them? For me, I was pointed to the site by the tech support person at my ISP. Of course, the reason why I called them was that I was experiencing speed issues with my connection.
Given the above, do you think the set of speed measurements used in this study gives us accurate estimates of the speeds delivered by ISPs?
While the research question is well worth answering, and the visual form is passable, it is hard to take the chart seriously because of how this data was collected.
This post is primarily intended for those who are planning a visit.
One of the smartest design decisions is to line everything up along one street (the Decumano). It will take some genius to get lost even though there are many dozens of buildings. Once you get to the far end of the Decumano, there is a smaller road that runs perpendicular to it, which houses the buildings that showcase individual regions of Italy. This smaller road leads to the Tree of Life structure, where I found those delightful, swirling chairs. Here they are again:
The EXPO site is in the Milan suburbs. It is easily accessible by the Metro (subway) or by train. Either means of transportation takes about 20 minutes. The train takes riders right to the entrance, saving 10 minutes of walking from the subway stop, but depending on your origin, the train may be inconvenient. I later discovered that there are two subway exits: one exit links to an overpass while the other one to an underpass. Choose carefully if under/over makes a difference for you.
You need to carry a printed copy of your ticket. Your bags will be scanned. Liquids are allowed and are also scanned. This process is painless unless you fight with the crowds that appear at 7 pm because of reduced-price entry. Most pavilions close by 9 pm, leaving only restaurants open.
The food is great if you bring realistic expectations. You’re at a fair, not a gourmet food market. I was very happy with what I ate, and here are some highlights.
Eataly is there in a big way. They have 10 or 12 restaurants, representing different regions in Italy. Eataly is this high-end supermarket / restaurant chain that started in Italy and also now have stores in New York, Boston and Chicago. Not spectacular but way better than your average meal. If you want Italian food, you won’t go wrong here. I particularly like the Tuscany (Toscana) menu, serving two of my favorites: panzarella (bread salad), and pici (an extra-thick spaghetti) with duck ragu. You have to walk all the way to the back of the Eataly row to find the Toscana section.
Inside the Pavilions. You can fill yourself by sampling snacks as you run around the pavilions. I recommend this strategy because your schedule will be dominated by trying to get into certain pavilions (or more pavilions). The food is going to be hit or miss. Austria (left) has great stuff. France looks good. Belgium serves pub grub and beer. Holland has food trucks, mostly fast food. I liked the summer rolls in the Vietnam pavilion (right).
Vietnam and Belgium
Russia was giving away caviar on toast, which attracted a mob. Heard Chile has good food. Mexico has a food line. If you like cannoli, go to the “Civil Society” building and visit the Sicilian vendor.
You can always go to McDonald’s for American fast food. There are also various places where you can get Italian fast food, such as simple pastas and pizzas.
Several pavilions have proper sit-down restaurants. I can’t vouch for them as I didn’t try them. The French pavilion for example has a restaurant upstairs. I think Russia also has a restaurant.
Gelato. When I am in Italy, I am eating gelato every day. Gelato is godsend on these hot summer days. There are many places to get gelato at the EXPO. My favorite is Pernigotti, which has a booth in the chocolate area. I also got gelato behind the Israel pavilion. There is a small stand outside the Italy Pavilion. Also across from the Italy Pavilion, the Love It food store serves gelato on the far side. Granita (slushed ice drinks) would have been even better but I didn’t find any worth mentioning here.
Espresso. The safe and great options include Lavazza and Illy. Lavazza is in the Italian regions street, which runs perpendicular to the Decumano. Lavazza has some great-looking tarts and cakes, in addition to coffee. Illy is in the coffee exhibition area.
I also enjoyed France (most on-subject), Morocco, Slow Food, and especially the chocolate area.
I didn’t make it to Japan, Kazakhstan, China and Italy. Those attracted excellent reviews but the lines were too long. Several countries (Japan, Kazakhstan, etc.) produce staged experiences, which means once you are inside, you have to spend at least 30-45 minutes.
One of the smart things Noah (at WNYC) showed to my class was his NFL fan map, based on Facebook data.
This is the "home" of the visualization:
The fun starts by clicking around. Here are the Green Bay fans on Facebook:
Also, you can see these fans relative to other teams in the same division:
A team like Jacksonville has a tiny footprint:
What makes this visualization work?
Notice the "home" image and those straight black lines. They are the "natural" regions of influence, if you assume that all fans root for the team that they are physcially closest to.
To appreciate this, you have to look at a more generic NFL fan map (this is one from Deadspin):
This map is informative but not as informative as it ought to be. The reference point provided here are the state boundaries but we don't have one NFL team per state. Those "Voronoi" boundaries Noah added are more reasonable reference points to compare to the Facebook fan data.
When looking at the fan map, the most important question you have is what is each team's region of influence. This work reminds me of what I wrote before about the Beer Map (link). Putting all beer labels (or NFL teams) onto the same map makes it hard to get quick answers to that question. A small-multiples presentation is more direct, as the reader can see the brands/teams one at a time.
Here, Noah makes use of interactivity to present these small multiples on the same surface. It's harder to compare multiple teams but that is a secondary question. He does have two additions in case readers want to compare multiple teams. If you click instead of mousing over a team, the team's area of influence sticks around. Also, he created tabs so you can compare teams within each division.
I usually hate hover-over effects. They often hide things that readers want (creating what Noah calls "scavenger hunts"). The hover-over effect is used masterfully here to organize the reader's consumption of the data.
Moving to the D corner of the Trifecta checkup. Here is Noah's comment on the data:
Facebook likes are far from a perfect method for measuring NFL fandom. In sparsely-populated areas of the country, counties are likely to have a very small sample size. People who like things on Facebook are also not a perfect cross-section of football fans (they probably skew younger, for example). Other data sources that could be used as proxies for fan interest (but are subject to their own biases) are things like: home game attendance, merchandise sales, TV ratings, or volume of tweets about a team.
The #WSD2015 challenge is to build an info-graphic or dynamic visualization featuring the latest data from the 2015 Millennium Development Goals report. This challenge is particularly geared towards designers, programmers and data scientists who are passionate about contributing open source software and data visualizations to promote peace, development, human rights and environmental sustainability. The winners will be announced during the United Nations World Statistics Day 2015 and featured on some UN websites. The deadline for submissions is Sunday, 20 September 2015.
In addition, please note that the Unite Ideas website will feature other challenges in the months to come. I will personally send you an e-mail to let you know when these new challenges are posted.
Feel free to share this invitation with your students, colleagues and friends, and to contact me with any questions or ideas related to data science and visualization at the United Nations.