« July 2015 | Main | September 2015 »

Losing count of money bags

I found this chart on a Munich publication called Süddeutsche Zeitung. This appeared during the most recent Greek/Euro crisis.


The bags of money were financial obligations that were coming due from June 2015 to December 2015. There were three creditors, indicated by red, blue and gray.

This graphic answers one question well: individual debt obligations for a given month and given creditor. However, by privileging these details, the chart fails to convey cumulative totals well - readers have to make calculations in their heads.

In the revision, I wanted to convey two key messages: the total amount of debt that was coming due in those seven months, and the relative proportion of debt owed to the three creditors. An area chart brings this out better.


Conversely, it is much harder to figure out individual debt obligations by month and creditor from this version.

This points to the importance of determining your key message(s) before choosing a form.



When in Seattle, don't look for the bus map

The past week in Seattle, I was blessed with amazing weather. The city has great coffee and restaurants, so pleased me alright.

But Seattle-ites, please tell your government to burn your transit map presto!  I tried looking at the map three or four times, and each time, my eyes were burning so much from the colors, the details, the lack of labels, the general confusion that I gave up. Yes, that's the worst thing an information graphics designer wants to hear - the reader waves the white flag.


How do you make sense of that? In the excerpt below, I labeled with black boxes my desired origin and destination.


There are many obstacles to figuring out a route. Firstly, the precise locations of bus stops are not indicated on the map. From the black box up top, if I wanted to catch a bus, I wasn't even sure which corner to go to! Seattle, by the way, is full of one-way streets. Eventually, you realize that different lines have different operators, and they don't use a common ticket.

I ended up at the Westlake Station wanting to take public transit to the International District. I purchased a ticket from the machine. Then I boarded a bus seemingly heading in the right direction. The bus driver stared me down as if I just stepped into disputed territory. She told me my ticket was for a train. I asked her how I'd catch a train. Her eyes told me to get off quickly or else...

I too thought I bought a train ticket but it turns out the train and buses share the same platform.

Back to the map, it would appear that the green line labeled 40 would be useful to me. I tried to trace the green line but it started looping around and I gave up.


Reimagining the league table

The reason for the infrequent posting is my travel schedule. I spent the past week in Seattle at JSM. This is an annual meeting of statisticians. I presented some work on fantasy football data that I started while writing Numbersense.

For my talk, I wanted to present the ubiquitous league table in a more useful way. The league table is a table of results and relevant statistics, at the team level, in a given sports league, usually ordered by the current winning percentage. Here is an example of ESPN's presentation of the NFL end-of-season league table from 2014.


If you want to know weekly results, you have to scroll to each team's section, and look at this format:


For the graph that I envisioned for the talk,  I wanted to show the correlation between Points Scored and winning/losing. Needless to say, the existing format is not satisfactory. This format is especially poor if I want my readers to be able to compare across teams.


The graph that I ended up using is this one:


 The teams are sorted by winning percentage. One thing should be pretty clear... the raw Points Scored are only weakly associated with winning percentage. Especially in the middle of the Points distribution, other factors are at play determining if the team wins or loses.

The overlapping dots present a bit of a challenge. I went through a few other drafts before settling on this.

The same chart but with colored dots, and a legend:


Only one line of dots per team instead of two, and also requiring a legend:


 Jittering is a popular solution to separating co-located dots but the effect isn't very pleasing to my eye:


Small multiples is another frequently prescribed solution. Here I separated the Wins and Losses in side-by-side panels. The legend can be removed.



As usual, sketching is one of the most important skills in data visualization; and you'd want to have a tool that makes sketching painless and quick.

Is something rotten behind there?

Via Twitter, Andrew B. (link) asked if I could comment on the following chart, published by PC Magazine as part of their ISP study. (link)



This chart is decent, although it can certainly be improved. Here is a better version:


A couple of little things are worth pointing out. The choice of red and green to indicate down and up speed respectively is baffling. Red and green are loaded terms which I often avoid. A red dot unfortunately signifies STOP, but ISP users would definitely not want to stop on their broadband superhighway!

In terms of plot symbols, up and down arrows are natural for this data.


Using the Trifecta checkup (link), I am most concerned about the D(ata) corner.

The first sign of trouble is the arbitrary construction of an "Index". This index isn't really an index because there is no reference level. The s0-called index is really a weighted average of the download and upload speeds, with 80% weight given to the former. In reality, the download speeds are even weighted higher because download speeds are multiples of the upload speeds, in their original units.

Besides, putting these ISPs side by side gives an impression that they are comparable things. But direct comparison here is an invitation to trouble. For example, Verizon is represented only by its FIOS division (fiber optics). We have Comcast and Cox which are cable providers. The geographical footprints of these providers are also different.

This is not a trivial matter. Midcontinent operates primarily in North and South Dakota. Some other provider may do better than Midcontinent on average but within those two states, the other provider may perform much worse.


Note that the data came from the Speedtest website (over 150,000 speed tests). In my OCCAM framework (link), this dataset is Observational, without Controls, seemgingly Complete, and Adapted (from speed testing for technical support).

Here is the author's disclosure, which should cause concern:

We require at least 50 tests from unique IP addresses for any vendor to receive inclusion. That's why, despite a couple of years of operation, we still don't have information on Google Fiber (to name one such vendor). It simply doesn't have enough users who took our test in the past year.

So, the selection of providers is based on the frequency of Speedtest queries. Is that really a good way to select samples? The author presents one possible explanation for why Google Fiber is absent - that it has too few users (without any evidence). In general, there are many reasons for such an absence.  One might be that a provider is so good that few customers complain about speeds and therefore they don't do speed tests. Another might be that a provider has a homegrown tool for measuring speeds. Or any number of other reasons. These reasons create biases in various directions, which makes the analysis confusing.

Think about your own behavior. When was the last time you did a speed test? Did you use Speedtest.com? How did you hear about them? For me, I was pointed to the site by the tech support person at my ISP. Of course, the reason why I called them was that I was experiencing speed issues with my connection.

Given the above, do you think the set of speed measurements used in this study gives us accurate estimates of the speeds delivered by ISPs?

While the research question is well worth answering, and the visual form is passable, it is hard to take the chart seriously because of how this data was collected.