Ridings, polls, elections, O Canada

Stephen Taylor reached out to me about his work to visualize Canadian elections data. I took a look. I appreciate the labor of love behind this project.

He led with a streamgraph, which presents a quick overview of relative party strengths over time.

Stephentaylor_canadianelections_streamgraph

I am no Canadian election expert, and I did a bare minimum of research in writing this blog. From this chart, I learn that:

  • the Canadians have an irregular election schedule
  • Canada has a two party plus breadcrumbs system
  • The two dominant parties are Liberals and Conservatives. The Liberals currently hold just less than half of the seats. The Conservatives have more than half of the seats not held by Liberals
  • The Conservative party (maybe) rebranded as "progressive conservative" for several decades. The Reform/Alliance party was (maybe) a splinter movement within the Conservatives as well.
  • Since the "width" of the entire stream increased over time, I'm guessing the number of seats has expanded

That's quite a bit of information obtained at a glance. This shows the power of data visualization. Notice Stephen didn't even have to include a "how to read this" box.

The streamgraph form has its limitations.

The feature that makes it more attractive than an area chart is its middle anchoring, resulting in a form of symmetry. The same feature produces erroneous intuition - the red patch draws out a declining trend; the reader must fight the urge to interpret the lines and focus on the areas.

The breadcrumbs are well hidden. The legend below discloses that the Green Party holds 3 seats currently. The party has never held enough seats to appear on the streamgraph though.

The bars showing proportions in the legend is a very nice touch. (The numbers appear messed up - I have to ask Stephen whether the seats shown are current values, or some kind of historical average.) I am a big fan of informative legends.

***

The next featured chart is a dot plot of polling results since 2020.

Stephentaylor_canadianelections_streamgraph_polls_dotplot

One can see a three-tier system: the two main parties, then the NDP (yellow) is the clear majority of the minority, and finally you have a host of parties that don't poll over 10%.

It looks like the polls are favoring the Conservatives over the Liberals in this election but it may be an election-day toss-up.

The purple dots represent "PPC" which is a party not found elsewhere on the page.

This chart is clear as crystal because of the structure of the underlying data. It just amazes me that the polls are so highly correlated. For example, across all these polls, the NDP has never once polled better than either the Liberals or the Conservatives, and in addition, it has never polled worse than any of the small parties.

What I'd like to see is a chart that merges the two datasets, addressing the question of how well these polls predicted the actual election outcomes.

***

The project goes very deep as Stephen provides charts for individual "ridings" (perhaps similar to U.S. precincts).

Here we see population pyramids for Vancouver Center, versus British Columbia (Province), versus Canada.

Stephentaylor_canadianelections_riding_populationpyramids

This riding has a large surplus of younger people in their twenties and thirties. Be careful about the changing scales though. The relative difference in proportions are more drastic than visually displayed because the maximum values (5%) on the Province and Canada charts are half that on the Riding chart (10%). Imagine squashing the Province and Canada charts to half their widths.

Analyses of income and rent/own status are also provided.

This part of the dashboard exhibits a problem common in most dashboards - they present each dimension of the data separately and miss out on the more interesting stuff: the correlation between dimensions. Do people in their twenties and thirties favor specific parties? Do richer people vote for certain parties?

***

The riding-level maps are the least polished part of the site. This is where I'm looking for a "how to read it" box.

Stephentaylor_canadianelections_ridingmaps_pollwinner

It took me a while to realize that the colors represent the parties. If I haven't come in from the front page, I'd have been totally lost.

Next, I got confused by the use of the word "poll". Clicking on any of the subdivisions bring up details of an actual race, with party colors, candidates and a donut chart showing proportions. The title gives a "poll id" and the name of the riding in parentheses. Since the poll id changes as I mouse over different subdivisions, I'm wondering whether a "poll" is the term for a subdivision of a riding. A quick wiki search indicates otherwise.

Stephentaylor_canadianelections_ridingmaps_donut

My best guess is the subdivisions are indicated by the numbers.

Back to the donut charts, I prefer a different sorting of the candidates. For this chart, the two most logical orderings are (a) order by overall popularity of the parties, fixed for all ridings and (b) order by popularity of the candidate, variable for each riding.

The map shown above gives the winner in each subdivision. This type of visualization dumps a lot of information. Stephen tackles this issue by offering a small multiples view of each party. Here is the Liberals in Vancouver.

Stephentaylor_canadianelections_ridingmaps_partystrength

Again, we encounter ambiguity about the color scheme. Liberals have been associated with a red color but we are faced with abundant yellow. After clicking on the other parties, you get the idea that he has switched to a divergent continuous color scale (red - yellow - green). Is red or green the higher value? (The answer is red.)

I'd suggest using a gray scale for these charts. The hardest decision is going to be the encoding between values and shading. Should each gray scale be different for each riding and each party?

If I were to take a guess, Stephen must have spent weeks if not months creating these maps (depending on whether he's full-time or part-time). What he has published here is a great start. Fine-tuning the issues I've mentioned may take more weeks or months more.

****

Stephen is brave and smart to send this project for review. For one thing, he's got some free consulting. More importantly, we should always send work around for feedback; other readers can tell us where our blind spots are.

To read more, start with this post by Stephen in which he introduces his project.


Ranking data provide context but can also confuse

This dataviz from the Economist had me spending a lot of time clicking around - which means it is a success.

Econ_usaexcept_hispanic

The graphic presents four measures of wellbeing in society - life expectancy, infant mortality rate, murder rate and prison population. The primary goal is to compare nations across those metrics. The focus is on comparing how certain nations (or subgroups) rank against each other, as indicated by the relative vertical position.

The Economist staff has a particular story to tell about racial division in the US. The dotted bars represent the U.S. average. The colored bars are the averages for Hispanic, white and black Americans. The wider the gap between the colored bars, the more variant is the experiences between American races.

The chart shows that the racial gap of life expectancy is the widest. For prison population, the U.S. and its racial subgroups occupy many of the lowest (i.e. least desirable) ranks, with the smallest gap in ranking.

***

The primary element of interactivity is hovering on a bar, which then highlights the four bars corresponding to the particular nation selected. Here is the picture for Thailand:

Econ_usaexcept_thailand

According to this view of the world, Thailand is a close cousin of the U.S. On each metric, the Thai value clings pretty near the U.S. average and sits within the range by racial groups. I'm surprised to learn that the prison population in Thailand is among the highest in the world.

Unfortunately, this chart form doesn't facilitate comparing Thailand to a country other than the U.S as one can highlight only one country at a time.

***

While the main focus of the chart is on relative comparison through ranking, the reader can extract absolute difference by reading the lengths of the bars.

This is a close-up of the bottom of the prison population metric:

Econ_useexcept_prisonpop_bottomThe length of each bar displays the numeric data. The red line is an outlier in this dataset. Black Americans suffer an incarceration rate that is almost three times the national average. Even white Americans (blue line) is imprisoned at a rate higher than most countries around the world.

As noted above, the prison population metric exhibits the smallest gap between racial subgroups. This chart is a great example of why ranking data frequently hide important information. The small gap in ranking masks the extraordinary absolute difference in incareration rates between white and black America.

The difference between rank #1 and rank #2 is enormous.

Econ_useexcept_lifeexpect_topThe opposite situation appears for life expectancy. The life expectancy values are bunched up especially at the top of the scale. The absolute difference between Hispanic and black America is 82 - 75 = 7 years, which looks small because the axis starts at zero. On a ranking scale, Hispanic is roughly in the top 15% while black America is just above the median. The relative difference is huge.

For life expectancy, ranking conveys the view that even a 7-year difference is a big deal because the countries are tightly bunched together. For prison population, ranking shows the view that a multiple fold difference is "unimportant" because a 20-0 blowout and a 10-0 blowout are both heavy defeats.

***

Whenever you transform numeric data to ranks, remember that you are artificially treating the gap between each value and the next value as a constant, even when the underlying numeric gaps show wide variance.

 

 

 

 

 


Stumped by the ATM

The neighborhood bank recently installed brand new ATMs, with tablet monitors and all that jazz. Then, I found myself staring at this screen:

Banknote_picker_us

I wanted to withdraw $100. I ordinarily love this banknote picker because I can get the $5, $10, $20 notes, instead of $50 and $100 that come out the slot when I don't specify my preference.

Something changed this time. I find myself wondering which row represents which note. For my non-U.S. readers, you may not know that all our notes are the same size and color. The screen resolution wasn't great and I had to squint really hard to see the numbers of those banknote images.

I suppose if I grew up here, I might be able to tell the note values from the figureheads. This is an example of a visualization that makes my life harder!

***
I imagine that the software developer might be a foreigner. I imagine the developer might live in Europe. In this case, the developer might have this image in his/her head:

Banknote_picker_euro

Euro banknotes are heavily differentiated - by color, by image, by height and by width. The numeric value also occupies a larger proportion of the area. This makes a lot of sense.

I like designs to be adaptable. Switching data from one country to another should not alter the design. Switching data at different time scales should not affect the design. This banknote picker UI is not adaptable across countries.

***

Once I figured out the note values, I learned another reason why I couldn't tell which row is which note. It's because one note is absent.

Banknote_us_2

Where is the $10 note? That and the twenty are probably the most frequently used. I am also surprised people want $1 notes from an ATM. But I assume the bank knows something I don't.


Dreamy Hawaii

I really enjoyed this visual story by ProPublica and Honolulu Star-Advertiser about the plight of beaches in Hawaii (link).

The story begins with a beautiful invitation:

Propublica_hawaiibeachesfrontimage

This design reminds me of Vimeo's old home page. (It no longer looks like this today but this screenshot came from when I was the data guy there.) In both cases, the images are not static but moving.

Vimeo-homepage

The tour de force of this visual story is an annotated walk along the Lanikai Beach. Here is a snapshot at one of the stops:

Propublica_hawaiibeaches_1368MokuluaDr_small

This shows a particular homeowner who, according to documents, was permitted to rebuild a destroyed seawall even though officials were supposed to disallow reconstruction in order to protect beaches from eroding. The property is marked on the map above. The image inside the box is a gif showing waves smashing the seawall.

As the reader scrolls down, the image window runs through a carousel of gifs of houses along the beach. The images are synchronized to the reader's progress along the shore. The narrative makes stops at specific houses at which point a text box pops up to provide color commentary.

***

The erosion crisis is shown in this pair of maps.

Propublica_hawaiibeaches_oldnewshoreline-sm

There's some fancy work behind the scenes to patch together images, and estimate the boundaries of th beaches.

***

The following map is notable for its simplicity. There are no unnecessary details and labels. We don't need to know the name of every street or a specific restaurant. Removing excess details makes readers focus on the informative parts. 

Propublica_hawaiibeaches_simplemap-sm

Clicking on the dots brings up more details.

***

Enjoy the entire story here.


Election visuals 4: the snake pit is the best election graphic ever

This is the final post on the series of data visualization deployed by FiveThirtyEight to explain their election forecasting model. The previous posts are here, here and here.

I'm saving the best for last.

538_snakepit

This snake-pit chart brings me great joy - I wish I came up with it!

This chart wins by focusing on a limited set of questions, and doing so excellently. As with many election observers, we understand that the U.S. presidential election will turn on so-called "swing states," and the candidates' strength in these swing states are variable, as the name suggests. Thus, we like to know which states are in play, and within these states, which ones are most unpredictable.

This chart lines up all the states from the reddest of red up top to the bluest of blue at the bottom. Each state is ranked by the voting margin predicted by 538's election forecasting model. The swing states are found in the middle.

Since each state confers a fixed number of electoral votes, and a candidate must amass 270 to win, there is a "tipping" state. In the diagram above, it's Pennsylvania. This pivotal state is neatly foregrounded as the one crossing the line in the middle.

The lengths of the segments correspond to the number of electoral votes and so do not change with the data. What change are the sequencing of the segments, and the color shading.

This data visualization is a gem of visual story-telling. The form lends itself to a story.

***

The snake-pit chart succeeds by not doing too much. There are many items that the chart does not directly communicate.

The exact number of electoral votes by state is not explicit, nor is it easy to compare the lengths of bending segments. The color scale for conveying the predicted voting margins is crude, and it's not clear what is the difference between a deep color and a light color. It's also challenging to learn the electoral vote split; the actual winning margin is not even stated.

The reality is the average reader doesn't care. I got everything I wanted from the chart, and I ain't got the time to explore every state.

There is a hover-over effect that reveals some of the additional information:

538_snakepitchart_detail

One can keep going on. I have no idea how the 40,000 scenarios presented in the other graphics in this series have been reduced to the forecast shown in the inset. But again, those omissions did not lessen my enjoyment. The point is: let your graphics breathe.

***

I'm thinking of potential variations even though I'm fully satisfied with this effort.

I wonder if the color shading should be reversed. The light shading encodes a smaller voting margin, which indicates a tighter race. But our attention is typically drawn first to the darker shades. If the shading scheme is reversed, the color should be described as how tight the race is.

I also wonder if a third color (purple) should be introduced. Doing so would require the editors to make judgment calls on which set of states are swing states.

One strange thing about election day is the specific sequence of when TV stations (!) call the state results, which not only correlates with voting margin but also with time zones. I wonder if the time zone information can be worked into the sequencing of segments.

Let me know what you think of these ideas, or leave your own ideas, in the comments below.

***

I have already praised this graphic when it first came out in 2016. (link)

A key improvement is tilting the chart, which avoids vertical state labels.

The previous post was written around election day 2016. The snake pit further cements its status as a story-telling device. As states are called, they are taken out of the picture. So it works very well as a dynamic chart on election day.

I'm nominating this snake-pit chart as the best election graphic ever. Kudos to the FiveThirtyEight team.


Election visuals 2: informative and playful

In yesterday's post, I reviewed one section of 538's visualization of its election forecasting model, specifically, the post focuses on the probability plot visualization.

The visualization, technically called  a pdf, is a mainstay of statistical graphics. While every one of 40,000 scenarios shows up on this chart, it doesn't offer a direct answer to our topline question. What is Nate's call at this point in time? Elsewhere in their post, we learn that the 538 model currently gives Biden a 75% chance of winning, thrice that of Trump's.

538_pdf_pair

In graphical terms, the area to the right of the 270-line is three times the size of the left area (on the bottom chart). That's not apparent in the pdf representation. Addressing this, statisticians may convert the pdf into a cdf, which depicts the cumulative area as we sweep from the left to the right along the horizontal axis.  

The cdf visualization rarely leaves the pages of a scientific journal because it's not easy for a novice to understand. Not least because the relevant probability is 1 minus the cumulative probability. The cdf for the bottom chart will show 25% at the 270-line while the chance of Biden winning is 1 - 25% = 75%.

The cdf presentation is also wasteful for the election scenario. No one cares about any threshold other than the 270 votes needed to win, but the standard cdf shows every possible threshold.

The second graphical concept in the 538 post (link) is an attempt to solve this problem.

538_dotplot

If you drop all the dots to an imaginary horizontal baseline, the above dotplot looks like this:

Redo_junkcharts_538electionforecast_dotplot_1

There is a recent trend toward centering dots to produce symmetry. It's actually harder to perceive the differences in heights of the band.

The secret sauce is to put down 100 dots, with a 75-25 blue-red split that conveys the 75% chance of a Biden win. Imposing the pdf line from the other visualization, I find that the density of dots roughly mimics the probability of outcomes.

Redo_junkcharts_538electionforecast_dotplot_2

It's easier to estimate the blue vs red areas using those dots than the lines.

The dots are stuffed toys. Clicking on each dot reveals a map showing one of the 40,000 scenarios. It displays which candidate wins which state. For example, the most extreme example of a Trump win is:

538_dotplot_redextreme

Here is a scenario of a razor-tight election won by Trump:

538_dotplot_redmiddle

This presentation has a weakness as well. It gives the impression that each of the dots is equally important because they are the same size. In reality, the importance of each dot is proportional to the height of the band. Since the band is generally wider near the middle, the dots near the middle are more likely scenarios than the dots shown on the two edges.

On balance, I like this visualization that is both informative and playful.

As before, what strikes me about the simulation result is the flatness of the probability surface. This feature is obscured when we summarize the result as 75% chance of a Biden victory.


Conceptualizing a chart using Trifecta: a practical example

In response to the reader who left a comment asking for ideas for improving the "marginal abatements chart" that was discussed here, I thought it might be helpful to lay out the process I go through when conceptualizing a chart. (Just a reminder, here is the chart we're dealing with.)

Ar_submit_Fig-3-2-The-policy-cost-curve-525

First, I'm very concerned about the long program names. I see their proper placement in a horizontal orientation as a hard constraint on the design. I'd reject every design that displays the text vertically, at an angle, or hides it behind some hover effect, or abbreviates or abridges the text.

Second, I strongly suggest re-thinking the "cost-effectiveness" metric on the vertical axis. Flipping the sign of this metric makes a return-on-investment-type metric, which is much more intuitive. Just to reiterate a prior point, it feels odd to be selecting more negative projects before more positive projects.

Third, I'd like to decide what metrics to place on the two axes. There are three main possibilities: a) benefits (that is, the average annual emissions abatement shown on the horizontal axis currently), b) costs, and c) some function that ties together costs and benefits (currently, this design uses cost per unit benefit, and calls it cost effectivness but there are a variety of similar metrics that can be defined).

For each of these metrics, there is a secondary choice. I can use the by-project value or the cumulative value. The cumulative value is dependent on a selection order, in this case, determined by the criterion of selecting from the most cost-effective program to the least (regardless of project size or any other criteria).

This is where I'd bring in the Trifecta Checkup framework (see here for a guide).

Trifectacheckup_junkcharts_image
The decision of which metrics to use on the axes means I'm operating in the "D" corner. But this decision must be made with respect to the "Q" corner, thus the green arrow between the two. Which two metrics are the most relevant depends on what we want the chart to accomplish. That in turn depends on the audience and what specific question we are addressing for them.

Fourth, if the purpose of the chart is exploratory - that is to say, we use it to guide decision-makers in choosing a subset of programs, then I would want to introduce an element of interactivity. Imagine an interface that allows the user to move programs in and out of the chart, while the chart updates itself to compute the total costs and total benefits.

This last point ties together the entire Trifacta Checkup framework (link). The Question being exploratory in nature suggests a certain way of organizing and analyzing the Data as well as a Visual form that facilitates interacting with the information.

 

 


SCMP's fantastic infographic on Hong Kong protests

In the past month, there have been several large-scale protests in Hong Kong. The largest one featured up to two million residents taking to the streets on June 16 to oppose an extradition act that was working its way through the legislature. If the count was accurate, about 25 percent of the city’s population joined in the protest. Another large demonstration occurred on July 1, the anniversary of Hong Kong’s return to Chinese rule.

South China Morning Post, which can be considered the New York Times of Hong Kong, is well known for its award-winning infographics, and they rose to the occasion with this effort.

This is one of the rare infographics that you’d not regret spending time reading. After reading it, you have learned a few new things about protesting in Hong Kong.

In particular, you’ll learn that the recent demonstrations are part of a larger pattern in which Hong Kong residents express their dissatisfaction with the city’s governing class, frequently accused of acting as puppets of the Chinese state. Under the “one country, two systems” arrangement, the city’s officials occupy an unenviable position of mediating the various contradictions of the two systems.

This bar chart shows the growth in the protest movement. The recent massive protests didn't come out of nowhere. 

Scmp_protestsovertime

This line chart offers a possible explanation for burgeoning protests. Residents’ perceived their freedoms eroding in the last decade.

Scmp_freedomsurvey

If you have seen videos of the protests, you’ll have noticed the peculiar protest costumes. Umbrellas are used to block pepper sprays, for example. The following lovely graphic shows how the costumes have evolved:

Scmp_protestcostume

The scale of these protests captures the imagination. The last part in the infographic places the number of protestors in context, by expressing it in terms of football pitches (as soccer fields are known outside the U.S.) This is a sort of universal measure due to the popularity of football almost everywhere. (Nevertheless, according to Wikipedia, the fields do not have one fixed dimension even though fields used for international matches are standardized to 105 m by 68 m.)

Scmp_protestscale_pitches

This chart could be presented as a bar chart. It’s just that the data have been re-scaled – from counting individuals to counting football pitches-ful of individuals. 

***
Here is the entire infographics.


Pay levels in the U.S.

The Wall Street Journal published a graphic showing the median pay levels at "most" public companies in the U.S. here.

Wsj_mediancompanypay

People who attended my dataviz seminar might recognize the similarity with the graphic showing internet download speeds by different broadband technologies. It's a clean, clear way of showing multiple comparisons on the same chart.

You can see the distribution of pay levels of companies within each industry grouping, and the vertical lines showing the sector medians allow comparison across sectors. The median pay levels are quite similar with the energy sector leaning higher, and consumer sector leaning lower.

The consumer sector is extremely heavy on the low side of the pay range. Companies like Universal, Abercrombie, Skechers, Mattel, Gap, etc. all pay at least half their employees less than $6,000. The data is sourced to MyLogIQ. I have no knowledge of how reliable or valid the data are. It's curious to me that Dunkin Brands showed a median of $110K while Starbucks showed $13K.

Wsj_medianpay_dunkinstarbucks

***

I like the interactive features.

The window control lets the user zoom in to different parts of the pay range. This is necessary because of the extremely high salaries. The control doubles as a presentation of the overall distribution of median salaries.

The text box can be used to add data labels to specific companies.

***

See previous discussion of WSJ Graphics.

 


Fantastic visual, but the Google data need some pre-processing

Another entry in the Google Newslab data visualization project that caught my eye is the "How to Fix It" project, illustrating search queries across the world that asks "how." The project web page is here.

The centerpiece of the project is an interactive graphic showing queries related to how to fix home appliances. Here is what it looks like in France (It's always instructive to think about how they would count "France" queries. Is it queries from google.fr? queries written in French? queries from an IP address in France? A combination of the above?)

Howtofixit_france_appliances

I particularly appreciate the lack of labels. When we see the pictures, we don't need to be told this is a window and that is a door. The search data concern the relative sizes of the appliances. The red dotted lines show the relative popularity of searches for the respective appliances in aggregate.

By comparison, the Russian picture looks very different:

Howtofixit_russia_appliances

Are the Russians more sensible? Their searches are far and away about the washing machine, which is the most complicated piece of equipment on the graphic.

At the bottom of the page, the project looks at other queries, such as those related to cooking. I find it fascinating to learn what people need help making:

Howtofixit_world_cooking

I have to confess that I searched for "how to make soft boiled eggs". That led me to a lot of different webpages, mostly created for people who search for how to make a soft boiled egg. All of them contain lots of advertising, and the answer boils down to cook it for 6 minutes.

***

The Russia versus France comparison brings out a perplexing problem with the "Data" in this visualization. For competitive reasons, Google does not provide data on search volume. The so-called Search Index is what is being depicted. The Search Index uses the top-ranked item as the reference point (100). In the Russian diagram, the washing machine has Search Index of 100 and everything else pales in comparison.

In the France example, the window is the search item with the greatest number of searches, so it has Search Index of 100; the door has Index 96, which means it has 96% of the search volume of the window; the washing machine with Index 49 has about half the searches of the window.

The numbers cannot be interpreted as proportions. The Index of 49 does not mean that washing machines account for 49% of all France queries about fixing home appliances. That is really the meaning of popularity we want to have but we don't have. We can obtain true popularity measures by "normalizing" the Search Index: just sum up the Index Values of all the appliances and divide the Search Index by the sum of the Indices. After normalizing, the numbers can be interpreted as proportions and they add up to 100% for each country. When not normalized, the indices do not add to 100%.

Take the case in which we have five appliances, and let's say all five appliances are equally popular, comprising 20% of searches each. The five Search Indices will all be 100 because the top-ranked item is given the value of 100. Those indices add to 500!

By contrast, in the case of Russia (or a more extreme case), the top-ranked query is almost 100% of all the searches, so the sum of the indices will be only slightly larger than 100.

If you realize this, then you'd understand that it is risky to compare Search Indices across countries. The interpretation is clouded by how much of the total queries accounted for by the top query.

In our Trifecta Checkup, this is a chart that does well in the Question and Visual corners, but there is a problem with the Data.