Approaching the Paris Olympics

If you're looking for dataviz about the upcoming Paris Olympics, I recommend this one by the great SCMP team.

Scmp_parisianolympics100years

The impact of this piece starts with picking an engaging topic: how have the disciplines changed over the last 100 years? It capitalizes on the fact that the Games are returning to Paris after a century.

Most of the infographics contain illustrations, with the interactive device of a slider that makes it easier to compare two graphics, one for each year. Without the slider, the graphics have to be placed top and bottom, or side by side, both of which require a lot of eye movements.

Here are some bits that I particularly enjoyed:

Scmp_olympics_medaldesign

Not surprisingly, the 2024 medal is much larger and heavier than the 1924 one. The old one emphasizes sportsmanship while the new medal frontlines victory.

Scmp_olympics_polevault

Having only seen pole vaulting on modern equipment, I find it fascinating to imagine athletes using rigid wooden poles, and then having to land on their feet in the sawdust pit. Moving the slide to the left reveals the current setup, with fiberglass poles that bend, and landing mattresses. Cheekily, they also tell us where the cameras are placed. Quite a bit of the performance gain (from 3.95 to 6.22 m) can be attributed to equipment improvements.

These illustrations convince me that a lot of the performance gains over time can be attributed to better technologies, better equipment, and rule changes (that accommodate these modern innovations). For example, swimmers starting off a jumping block versus from the side of the pool.

Scmp_olympics_roadrace

Yes, and they have some statistical graphics. This one about the cycling road race is really nice. It shows that the total distance of the 2024 race is about 1/3 longer than the 1924 race. It also shows that the new route features a lot more ups and downs than the original route. The highest point of the 1924 route is higher than the new route, though. This is a great example of the conciseness of visual language.

Scmp_olympics_womenfencing

I chuckled at this one. This was the gear worn by women fencers back at the 1924 Olympics.

***

There's a lot more at SCMP (SCMP). Go take a look!


Losing the plot while stacking up the bars

I came across this chart from an infographics that claims to show which zip codes in the U.S. are the "dirtiest" (link). I won't go into the data analysis in this post - it's the usual "open data" style analysis that takes whatever data they could find (in this case, 311 calls) and make some hay out of it.

03_Dirtiest-Zip-Codes-in-New-York

It's amazing how such analyses frequently land on the Top N, Bottom N table. Top/Bottom N is euphemistically called "insights". But "insights" should answer at least one of these following questions: Where are these zip codes? What's the reason why 11216 has the highest rate of complaints while 11040 has the lowest? What measures can be taken to make the city cleaner?

***

The basic form chosen for this graphic is the bar chart. The data concerns the number of complaints per 100,000 people (about sanitation - they didn't disclose how they classified a complaint as about sanitation).

To mitigate the "boredom" of bar charts, the designer made the edges of the bars swiggly, and added icons of items found in trash inside the bars. These are thankfully not too intrusive.

Why are all the data printed on the chart? Try mentally wiping the data labels, and you'll understand why the designer did it.

If readers look at data labels rather than the bars, then the data visualization surely has failed. I'd prefer to use an axis

If you spend a few more minutes on the chart, you may notice the gray parts. This is not the simple bar chart but a stacked bar chart. In effect, every bar is referenced to the first bar, which shows the maximum number of complaints per 100K people. For example, zip code 10474 has about 90% of the complaints experienced in zip code 11216, the "dirtiest" place in New York.

***

The infographic then moves on to Los Angeles, and repeats the Top N/Bottom N presentation:

04_Dirtiest-Zip-Codes-in-Los-Angeles

With this, the plot is lost.

For an inexplicable reason, the dirtiest zip code in LA does not occupy the entire length of the bar. The worst zip code here fills out 87% of the bar length, implying that the entire bar represents the value of 34,978 complaints per 100K people. How did the designer decide on this number?

As a result, every other value is referenced to 34,978 and not to the rate of complaints in the dirtiest zip code!

***

The infographic eventually covers Houston. Here are the dirtiest two zip codes in Houston:

Housefresh_houston_dirtiest2

How does one interpret the orange section of the second bar? The original intention is for us to see that this zip code is about 80% as dirty as the dirtiest zip code. However, the full length of the bar does not here represent the dirtiest zip code.

***

We also got a hint as to why this entire analysis is problematic. The values in LA are way bigger than those in NY, about 4 times higher at the top of the table. Is LA really that much dirtier than NY? Or perhaps the data have not been properly aligned between cities?

 

P.S. [8-26-2023] Added link to the infographic.

 


Parsons Student Projects

I had the pleasure of attending the final presentations of this year's graduates from Parsons's MS in Data Visualization program. You can see the projects here.

***

A few of the projects caught my eye.

A project called "Authentic Food in NYC" explores where to find "authentic" cuisine in New York restaurants. The project is notable for plowing through millions of Yelp reviews, and organizing the information within. Reviews mentioning "authentic" or "original" were extracted.

During the live presentation, the student clicked on Authentic Chinese, and the name that popped up was Nom Wah Tea Parlor, which serves dim sum in Chinatown that often has lines out the door.

Shuyaoxiao_authenticfood_parsons

Curiously, the ranking is created from raw counts of authentic reviews, which favors restaurants with more reviews, such as restaurants that have been operating for a longer time. It's unclear what rule is used to transfer authenticity from reviews to restaurants: does a single review mentioning "authentic" qualify a restaurant as "authentic", or some proportion of reviews?

Later, we see a visualization of the key words found inside "authentic" reviews for each cuisine. Below are words for Chinese and Italian cuisines:

Shuyaoxiao_authenticcuisines_parsons_words

These are word clouds with a twist. Instead of encoding the word counts in the font sizes, she places each word inside a bubble, and uses bubble sizes to indicate relative frequency.

Curiously, almost all the words displayed come from menu items. There isn't any subjective words to be found. Algorithms that extract keywords frequently fail in the sense that they surface the most obvious, uninteresting facts. Take the word cloud for Taiwanese restaurants as an example:

Shuyaoxiao_authenticcuisines_parsons_taiwan

The overwhelming keyword found among reviews of Taiwanese restaurants is... "taiwanese". The next most important word is "taiwan". Among the remaining words, "886" is the name of a specific restaurant, "bento" is usually associated with Japanese cuisine, and everything else is a menu item.

Getting this right is time-consuming, and understandably not a requirement for a typical data visualization course.

The most interesting insight is found in this data table.

Shuyaoxiao_authenticcuisines_ratios

It appears that few reviewers care about authenticity when they go to French, Italian, and Japanese restaurants but the people who dine at various Asian restaurants, German restaurants, and Eastern European restaurants want "authentic" food. The student concludes: "since most Yelp reviewers are Americans, their pursuit of authenticity creates its own trap: Food authenticity becomes an americanized view of what non-American food is."

This hits home hard because I know what authentic dim sum is, and Nom Wah Tea Parlor it ain't. Let me check out what Yelpers are saying about Nom Wah:

  1. Everything was so authentic and delicious - and cheap!!!
  2. Your best bet is to go around the corner and find something more authentic.
  3. Their dumplings are amazing everything is very authentic and tasty!
  4. The food was delicious and so authentic, and the staff were helpful and efficient.
  5. Overall, this place has good authentic dim sum but it could be better.
  6. Not an authentic experience at all.
  7. this dim sum establishment is totally authentic
  8. The onions, bean sprouts and scallion did taste very authentic and appreciated that.
  9. I would skip this and try another spot less hyped and more authentic.
  10. I would have to take my parents here the next time I visit NYC because this is authentic dim sum.

These are the most recent ten reviews containing the word "authentic". Seven out of ten really do mean authentic, the other three are false friends. Text mining is tough business! The student removed "not authentic" which helps. As seen from above, "more authentic" may be negative, and there may be words between "not" and "authentic". Also, think "not inauthentic", "people say it's authentic, and it's not", etc.

One thing I learned from this project is that "authentic" may be a synonym for "I like it" when these diners enjoy the food at an ethnic restaurant. I'm most curious about what inauthentic onions, bean sprouts and scallion taste like.

I love the concept and execution of this project. Nice job!

***

Another project I like is about tourism in Venezuela. The back story is significant. Since a dictatorship took over the country, the government stopped reporting tourism statistics. It's known that tourism collapsed, and that it may be gradually coming back in recent years.

This student does not have access to ready-made datasets. But she imaginatively found data to pursue this story. Specifically, she mentioned grabbing flight schedules into the country from the outside.

The flow chart is a great way to explore this data:

Ibonnet_parsons_dataviz_flightcities

A map gives a different perspective:

Ibonnet_parsons_dataviz_flightmap

I'm glad to hear the student recite some of the limitations of the data. It's easy to look at these visuals and assume that the data are entirely reliable. They aren't. We don't know that what proportion of the people traveling on those flights are tourists, how full those planes are, or the nationalities of those on board. The fact that a flight originated from Panama does not mean that everyone on board is Panamanian.

***

The third project is interesting in its uniqueness. This student wants to highlight the effect of lead in paint on children's health. She used the weight of lead marbles to symbolize the impact of lead paint. She made a dress with two big pockets to hold these marbles.

Scherer_parsons_dataviz_leaddress sm

It's not your standard visualization. One can quibble that dividing the marbles into two pockets doesn't serve a visualziation purpose, and so on. But at the end, it's a memorable performance.


Showing both absolute and relative values on the same chart 1

Visual Capitalist has a helpful overview on the "uninsured" deposits problem that has become the talking point of the recent banking crisis. Here is a snippet of the chart that you can see in full at this link:

Visualcapitalist_uninsureddeposits_top

This is in infographics style. It's a bar chart that shows the top X banks. Even though the headline says "by uninsured deposits", the sort order is really based on the proportion of deposits that are uninsured, i.e. residing in accounts that exceed $250K.  They used a red color to highlight the two failed banks, both of which have at least 90% of deposits uninsured.

The right column provides further context: the total amounts of deposits, presented both as a list of numbers as well as a column of bubbles. As readers know, bubbles are not self-sufficient, and if the list of numbers were removed, the bubbles lost most of their power of communication. Big, small, but how much smaller?

There are little nuggets of text in various corners that provide other information.

Overall, this is a pretty good one as far as infographics go.

***

I'd prefer to elevate information about the Too Big to Fail banks (which are hiding in plain sight). Addressing this surfaces the usual battle between relative and absolute values. While the smaller banks have some of the highest concentrations of uninsured deposits, each TBTF bank has multiples of the absolute dollars of uninsured deposits as the smaller banks.

Here is a revised version:

Redo_visualcapitalist_uninsuredassets_1

The banks are still ordered in the same way by the proportions of uninsured value. The data being plotted are not the proportions but the actual deposit amounts. Thus, the three TBTF banks (Citibank, Chase and Bank of America) stand out of the crowd. Aside from Citibank, the other two have relatively moderate proportions of uninsured assets but the sizes of the red bars for any of these three dominate those of the smaller banks.

Notice that I added the gray segments, which portray the amount of deposits that are FDIC protected. I did this not just to show the relative sizes of the banks. Having the other part of the deposits allow readers to answer additional questions, such as which banks have the most insured deposits? They also visually present the relative proportions.

***

The most amazing part of this dataset is the amount of uninsured money. I'm trying to think who these account holders are. It would seem like a very small collection of people and/or businesses would be holding these accounts. If they are mostly businesses, is FDIC insurance designed to protect business deposits? If they are mostly personal accounts, then surely only very wealthy individuals hold most of these accounts.

In the above chart, I'm assuming that deposits and assets are referring to the same thing. This may not be the correct interpretation. Deposits may be only a portion of the assets. It would be strange though that the analysts only have the proportions but not the actual deposit amounts at these banks. Nevertheless, until proven otherwise, you should see my revision as a sketch - what you can do if you have both the total deposits and the proportions uninsured.


Longest life, shortest length

Racetrack charts refuse to die. For old time's sake, here is a blog post from 2005 in which I explain why they don't make good dataviz.

Our latest example comes from Visual Capitalist (link), which publishes a fair share of nice dataviz. In this infographics, they feature a racetrack chart, just because the topic is the lifespan of cars.

Visualcapitalist_lifespan_cars_top

The whole infographic has four parts, each a racetrack chart. I'll focus on the first racetrack chart (shown above), which deals with the product category of sedans and hatchbacks.

The first thing I noticed is the reference value of 100,000 miles, which is described as the expected lifespan of a typical car made in the 1970s. This is of dubious value since the top of the page informs us the current relevant reference value is 200,000 miles, which is unlabeled. We surmise that 200,000 miles is indicated by the end of the grey sections of the racetrack. (This is eventually confirmed in the next racettrack chart for SUVs in the second sectiotn of the infographic.)

Now let's zoom in on the brown section of the track. Each of the four sections illustrates the same datum = 100,000 miles and yet they exhibit different lengths. From this, we learn that the data are not encoded in the lengths of these tracks -- but rather the data are to be found in the angle sustained at the centre of the concentric circles. The problem with racetrack charts is that readers are drawn to the lengths of the tracks rather than the angles at the center, which are not explicitly represented.

The Avalon model has the longest life span on this chart, and yet it is shown as the shortest curve.

***

The most baffling part of this chart is not the visual but the analysis methodology.

I quote:

iSeeCars analyzed over 2M used cars on the road between Jan. and Oct. 2022. Rankings are based on the mileage that the top 1% of cars within each model obtained.

According to this blurb, the 245,710 miles number for Avalon is the average mileage found in the top 1% of Avalons within the iSeeCars sample of 2M used cars.

The word "lifespan" strikes me as incorporating a date of death, and yet nothing in the above text indicates that any of the sampled cars are at end of life. The cars they really need are not found in their sample at all.

I suppose taking the top 1% is meant to exclude younger cars but why 1%? Also, this sample completely misses the cars that prematurely died, e.g. the cars that failed after 100,000 miles but before 200,000 miles. This filtering also ensures that newer models are excluded from the sample.

_trifectacheckup_imageIn the Trifecta Checkup, this qualifies as Type DV. The dataset does not answer the question of concern while the visual form distorts the data.


Modern design meets dataviz

This chart was submitted via Twitter (thanks John G.).

OptimisticEstimatingHomeValue

Perhaps the designer is inspired by this:

Royalontariomuseum

That's the Royal Ontario Museum, one of the beautiful landmarks in Toronto.

***

The chart addresses an interesting question - how much do home buyers over or under-estimate home value?  That said, gathering data to answer this question is challenging. I won't delve into this issue in this post.

Let's ask where readers are looking for data on the chart. It appears that we should use the right edge of each triangle. While the left edge of the red triangle might be useful, the left edges of the other triangles definitely would not contain data.

Note that, like modern architecture, the designer is playing with edges. None of the four right edges is properly vertical - none of the lines cuts the horizontal axis at a right angle. So the data actually reside in the imaginary vertical lines from the apexes to the horizontal baseline.

Where is the horizontal baseline? It's not where it is drawn either. The last number in the series is a negative number and so the real baseline is in the middle of the plot area, where the 0% value is.

The following chart shows (left side) the misleading signals sent to readers and (right side) the proper way to consume the data.

Redo_rockethomes_priceprojection

The degree of distortion is quite extreme. Only the fourth value is somewhat accurate, albeit by accident.

The design does not merely perturb the chart; it causes a severe adverse reaction.

 

P.S. [9/19/2022] Added submitter name.

 

 

 


A German obstacle course

Tagesschau_originalA twitter user sent me this chart from Germany.

It came with a translation:

"Explanation: The chart says how many car drivers plan to purchase a new state-sponsored ticket for public transport. And of those who do, how many plan to use their car less often."

Because visual language should be universal, we shouldn't be deterred by not knowing German.

The structure of the data can be readily understood: we expect three values that add up to 100% from the pie chart. The largest category accounts for 58% of the data, followed by the blue category (40%). The last and smallest category therefore has 2% of the data.

The blue category is of the most interest, and the designer breaks that up into four sub-groups, three of which are roughly similarly popular.

The puzzle is the identities of these categories.

The sub-categories are directly labeled so these are easy for German speakers. From a handy online translator, these labels mean "definitely", "probably", "rather not", "definitely not". Well, that's not too helpful when we don't know what the survey question is.

According to our correspondent, the question should be "of those who plan to buy the new ticket, how many plan to use their car less often?"

I suppose the question is found above the column chart under the car icon. The translator dutifully outputs "Thus rarer (i.e. less) car use". There is no visual cue to let readers know we are supposed to read the right hand side as a single column. In fact, for this reader, I was reading horizontally from top to bottom.

Now, the two icons on the left and the middle of the top row should map to not buying and buying the ticket. The check mark and cross convey that message. But... what do these icons map to on the chart below? We get no clue.

In fact, the will-buy ticket group is the 40% blue category while the will-not group is the 58% light gray category.

What about the dark gray thin sector? Well, one needs to read the fine print. The footnote says "I don't know/ no response".

Since this group is small and uninformative, it's fine to push it into the footnote. However, the choice of a dark color, and placing it at the 12-o'clock angle of the pie chart run counter to de-emphasizing this category!

Another twitter user visually depicts the journey we take to understand this chart:

Tagesschau_reply

The structure of the data is revealed better with something like this:

Redo_tagesschau_newticket

The chart doesn't need this many colors but why not? It's summer.

 

 

 

 


Multicultural, multicolor, manufactured outrage

Twitter users were incensed by this chart:

Twitter_worstpiechart

It's being slammed as one of the most outrageous charts ever.

Mollywhite_twitter_outrageous

***

An image search reveals this chart form has international appeal.

In Kazakh:

Eurasianbank_piechart_kazakh

In Turkish:

Medirevogrupperformans_piechart_turkey

In Arabic, but the image source is a Spanish company:

Socialpubli_piechart_spain

In English, from an Indian source:

Panipatinstitute_piechart_india

In Russian:

Russian_piechart

***

Some people are calling this a pie chart.

But it isn't a pie chart since the slices clearly add up to more than one full circle.

It may be a graph template from an infographics website. You see people are applying data labels without changing the sizes or orientation or even colors of the slices. So the chart form is used as a container for data, rather than an encoder.

***

The Twitter user who called this "outrageous" appears to want to protect the designer, as the words have been deliberately snipped from the chart.

Mollywhite_twitter_outrageous_tweet

Nevertheless, Molly White coughed up the source in a subsequent tweet.

Mollywhite_twitter_outrageous_source

A bit strange, if you stop and think a little. Why would Molly shame the designer 20 hours later after she decided not to?

 

 

According to Molly, the chart appeared on the website of an NFT company. [P.S. See note below]

Here's the top of the page that Molly White linked to:

Mollywhite_twitter_outrageous_web3isgoinggreat

Notice the author of this page. That's "Molly White",  who is the owner of this NFT company! [See note below: she's the owner of a satire website who was calling out the owner of this company.]

Who's more outrageous?

Someone creating the most outrageous chart in order to get clout from outraged Twitter users and drive traffic to her new NFT venture? Or someone creating the template for the outrageous chart form, spawning an international collection?

 

[P.S. 3/17/2022 The answer is provided by other Twitter users, and the commentors. The people spreading this chart form is more ourageous. I now realized that Molly runs a sarcastic site. When she linked to the "source", she linked to her own website, which I interpreted as the source of the image. The page did contain that image, which added to the confusion. I must also add her work looks valuable, as it assesses some of the wild claims in Web3 land.

Mollywhite_site
]

[P.S. 3/17/2022 Molly also pointed out that her second tweet about the source came around 45 minutes after the first tweet. Twitter showed "20 hours" because it was 20 hours from the time I read the tweet.]


How does the U.K. vote in the U.N.?

Through my twitter feed, I found my way to this chart, made by jamie_bio.

Jamie_bio_un_votes25032021

This is produced using R code even though it looks like a slide.

The underlying dataset concerns votes at the United Nations on various topics. Someone has already classified these topics. Jamie looked at voting blocs, specifically, countries whose votes agree most often or least often with the U.K.

If you look at his Github, this is one in a series of works he produced to hone his dataviz skills. Ultimately, I think this effort can benefit from some re-thinking. However, I also appreciate the work he has put into this.

Let's start with the things I enjoyed.

Given the dataset, I imagine the first visual one might come up with is a heatmap that shows countries in rows and topics in columns. That would work ok, as any standard chart form would but it would be a data dump that doesn't tell a story. There are almost 200 countries in the entire dataset. The countries can only be ordered in one way so if it's ordered for All Votes, it's not ordered for any of the other columns.

What Jamie attempts here is story-telling. The design leads the reader through a narrative. We start by reading the how-to-read-this box on the top left. This tells us that he's using a lunar eclipse metaphor. A full circle in blue indicates 0% agreement while a full circle in white indicates 100% agreement. The five circles signal that he's binning the agreement percentages into five discrete buckets, which helps simplify our understanding of the data.

Then, our eyes go to the circle of circles, labelled "All votes". This is roughly split in half, with the left side showing mostly blue and the right showing mostly white. That's because he's extracting the top 5 and bottom 5 countries, measured by their vote alignment with the U.K. The countries names are clearly labelled.

Next, we see the votes broken up by topics. I'm assuming not all topics are covered but six key topics are highlighted on the right half of the page.

What I appreciate about this effort is the thought process behind how to deliver a message to the audience. Selecting a specific subset that addresses a specific question. Thinning the materials in a way that doesn't throw the kitchen sink at the reader. Concocting the circular layout that presents a pleasing way of consuming the data.

***

Now, let me talk about the things that need more work.

I'm not convinced that he got his message across. What is the visual telling us? Half of the cricle are aligned with the U.K. while half aren't so the U.K. sits on the fence on every issue? But this isn't the message. It's a bit of a mirage because the designer picked out the top 5 and bottom 5 countries. The top 5 are surely going to be voting almost 100% with the U.K. while the bottom 5 are surely going to be disagreeing with the U.K. a lot.

I did a quick sketch to understand the whole distribution:

Redo_junkcharts_ukvotes_overview_2

This is not intended as a show-and-tell graphic, just a useful way of exploring the dataset. You can see that Arms Race/Disarmament and Economic Development are "average" issues that have the same form as the "All issues" line. There are a small number of countries that are extremely aligned with the UK, and then about 50 countries that are aligned over 50% of the time, then the other 150 countries are within the 30 to 50% aligned. On human rights, there is less alignment. On Palestine, there is more alignment.

What the above chart shows is that the top 5 and bottom 5 countries both represent thin slithers of this distribution, which is why in the circular diagrams, there is little differentiation. The two subgroups are very far apart but within each subgroup, there is almost no variation.

Another issue is the lunar eclipse metaphor. It's hard to wrap my head around a full white circle indicating 100% agreement while a full blue circle shows 0% agreement.

In the diagrams for individual topics, the two-letter acronyms for countries are used instead of the country names. A decoder needs to be provided, or just print the full names.

 

 

 

 

 

 


To explain or to eliminate, that is the question

Today, I take a look at another project from Ray Vella's class at NYU.

Rich Get Richer Assigment 2 top

(The above image is a honeypot for "smart" algorithms that don't know how to handle image dimensions which don't fit their shadow "requirement". Human beings should proceed to the full image below.)

As explained in this post, the students visualized data about regional average incomes in a selection of countries. It turns out that remarkable differences persist in regional income disparity between countries, almost all of which are more advanced economies.

Rich Get Richer Assigment 2 Danielle Curran_1

The graphic is by Danielle Curran.

I noticed two smart decisions.

First, she came up with a different main metric for gauging regional disparity, landing on a metric that is simple to grasp.

Based on hints given on the chart, I surmised that Danielle computed the change in per-capita income in the richest and poorest regions separately for each country between 2000 and 2015. These regional income growth values are expressed in currency, not indiced. Then, she computed the ratio of these growth rates, for each country. The end result is a simple metric for each country that describes how fast income has been growing in the richest region relative to the poorest region.

One of the challenges of this dataset is the complex indexing scheme (discussed here). Carlos' solution keeps the indices but uses design to facilitate comparisons. Danielle avoids the indices altogether.

The reader is relieved of the need to make comparisons, and so can focus on differences in magnitude. We see clearly that regional disparity is by far the highest in the U.K.

***

The second smart decision Danielle made is organizing the countries into clusters. She took advantage of the horizontal axis which does not encode any data. The branching structure places different clusters of countries along the axis, making it simple to navigate. The locations of these clusters are cleverly aligned to the map below.

***

Danielle's effort is stronger on communications while Carlos' effort provides more information. The key is to understand who your readers are. What proportion of your readers would want to know the values for each country, each region and each year?

***

A couple of suggestions

a) The reference line should be set at 1, not 0, for a ratio scale. The value of 1 happens when the richest region and the poorest region have identical per-capita incomes.

b) The vertical scale should be fixed.