Through Twitter, Danny H. submitted the following chart that shows a tiny 0.3 percent of Youtube creators generate almost 40 percent of all viewing on the platform. He asks for ideas about how to present lop-sided data that follow the "80/20" rule.
In the classic 80/20 rule, 20 percent of the units account for 80 percent of the data. The percentages vary, so long as the first number is small relative to the second. In the Youtube example, 0.3 percent is compared to 40 percent. The underlying reason for such lop-sidedness is the differential importance of the units. The top units are much more important than the bottom units, as measured by their contribution to the data.
I sense a bit of "loss aversion" on this chart (explained here). The designer color-coded the views data into blue, brown and gray but didn't have it in him/her to throw out the sub-categories, which slows down cognition and adds hardly to our understanding.
I like the chart title that explains what it is about.
Turning to the D corner of the Trifecta Checkup for a moment, I suspect that this chart only counts videos that have at least one play. (Zero-play videos do not show up in a play log.) For a site like Youtube, a large proportion of uploaded videos have no views and thus, many creators also have no views.
My initial reaction on Twitter is to use a mirrored bar chart, like this:
I ended up spending quite a bit of time exploring other concepts. In particular, I like to find an integrated way to present this information. Most charts, such as the mirrored bar chart, a Bumps chart (slopegraph), and Lorenz chart, keep the two series of percentages separate.
Also, the biggest bar (the gray bar showing 97% of all creators) highlights the least important Youtubers while the top creators ("super-creators") are cramped inside a slither of a bar, which is invisible in the original chart.
What I came up with is a bar-density plot, where I use density to encode the importance of creators, and bar lengths to encode the distribution of views.
Each bar is divided into pieces, with the number of pieces proportional to the number of creators in each segment. This has the happy result that the super-creators are represented by large (red) pieces while the least important creators by little (gray) pieces.
The embedded tessellation shows the structure of the data: the bottom third of the views are generated by a huge number of creators, producing a few views each - resulting in a high density. The top 38% of the views correspond to a small number of super-creators - appropriately shown by a bar of low density.
For those interested in technicalities, I embed a Voronoi diagram inside each bar, with randomly placed points. (There will be a companion post later this week with some more details, and R code.)
Here is what the bar-density plot looks like when the distribution is essentially uniform:
The density inside each bar is roughly the same, indicating that the creators are roughly equally important.
1) The next post on the bar-density plot, with some experimental R code, will be available here.
2) Check out my first Linkedin "article" on this topic.
Alberto Cairo introduces another one of his collaborations with Google, visualizing Google search data. We previously looked at other projects here.
The latest project, designed by Schema, Axios, and Google News Initiative, tracks the trending of popular news stories over time and space, and it's a great example of making sense of a huge pile of data.
The design team produced a sequence of graphics to illustrate the data. The top news stories are grouped by category, such as Politics & Elections, Violence & War, and Environment & Science, each given a distinct color maintained throughout the project.
The first chart is an area chart that looks at individual stories, and tracks the volume over time.
To read this chart, you have to notice that the vertical axis measuring volume is a log scale, meaning that each tick mark up represents a 10-fold increase. Log scale is frequently used to draw far-away data closer to the middle, making it possible to see both ends of a wide distribution on the same chart. The log transformation introduces distortion deliberately. The smaller data look disproportionately large because of it.
The time scrolls automatically so that you feel a rise and fall of various news stories. It's a great way to experience the news cycle in the past year. The overlapping areas show competing news stories that shared the limelight at that point in time.
Just bear in mind that you have to mentally reverse the distortion introduced by the log scale.
In the second part of the project, they tackle regional patterns. Now you see a map with proportional symbols. The top story in each locality is highlighted with the color of the topic. As time flows by, the sizes of the bubbles expand and contract.
Sometimes, the entire nation was consumed by the same story, e.g. certain obituaries. At other times, people in different regions focused on different topics.
In the last part of the project, they describe general shapes of the popularity curves. Most stories have one peak although certain stories like U.S. government shutdown will have multiple peaks. There is also variation in terms of how fast a story rises to the peak and how quickly it fades away.
The most interesting aspect of the project can be learned from the footnote. The data are not direct hits to the Google News stories but searches on Google. For each story, one (or more) unique search terms are matched, and only those stories are counted. A "control" is established, which is an excellent idea. The control gives meaning to those counts. The control used here is the number of searches for the generic term "Google News." Presumably this is a relatively stable number that is a proxy for general search activity. Thus, the "volume" metric is really a relative measure against this control.
A reader sent me the following chart. In addition to the graphical glitch, I was asked about the study's methodology.
I was able to trace the study back to this page. The study uses a line chart instead of the bar chart with axis not starting at zero. The line shows that web pages ranked higher by Google on the first page tend to have more words, i.e. longer content may help with Google ranking.
On the bar chart, Position 1 is more than 6 times as big as Position 10, if one compares the bar areas. But it's really only 20% larger in the data.
In this case, even the line chart is misleading. If we extend the Google Position to 20, the line would quickly dip below the horizontal axis if the same trend applies.
The line chart includes too much grid, one of Tufte's favorite complaints. The Google position is an integer and yet the chart's gridlines imply that 0.5 rank is possible.
Any chart of this data should supply information about the variance around these average word counts. Would like to see a side-by-side box plot, for example.
Another piece of context is the word counts for results on the second or third pages of Google results. Where are the short pages?
Turning to methodology, we learn that the research team analyzed 1 million pages of Google search results, and they also "removed outliers from our data (pages that contained fewer than 51 words and more than 9999 words)."
When you read a line like this, you have to ask some questions:
How do they define "outlier"? Why do they choose 51 and 9,999 as the cut-offs?
What proportion of the data was removed at either end of the distribution?
If these proportions are small, then the outliers are not going to affect that average word count by much, and thus there is no point to their removal. If they are large, we'd like to see what impact removing them might have.
In any case, the median is a better number to use here, or just show us the distribution, not just the average number.
It could well be true that Google's algorithm favors longer content, but we need to see more of the data to judge.
Vox featured the following chart when discussing the rise of resistance to President Trump within the GOP.
The chart is composed of mirrored bar charts. On the left side, with thicker pink bars that draw more attention, the design depicts the share of a particular GOP demographic segment that said they'd likely vote for a Trump challenger, according to a Morning Consult poll.
This is the primary metric of interest, and the entire chart is ordered by descending values from African Americans who are most likely (67%) to turn to a challenger to those who strongly support Trump and are the least likely (17%) to turn to someone else.
The right side shows the importance of each demographic, measured by the share of GOP. The relationship between importance and likelihood to defect from Trump is by and large negative but that fact takes a bit of effort to extract from this mirrored bar chart arrangement.
The subgroups are not complete. For example, the only ethnicity featured is African Americans. Age groups are somewhat more complete with under 18 being the only missing category.
The design makes it easy to pick off the most disaffected demographic segments (and the least, from the bottom) but these are disparate segments, possibly overlapping.
One challenge of this data is differentiating the two series of proportions. In this design, they use visual cues, like the height and width of the bars, colors, stacked vs not, data labels. Visual variety comes to the rescue.
Also note that the designer compensated for the lack of stacking on the left chart by printing data labels.
When reading this chart, I'm well aware that segments like urban residents, income more than $100K, at least college educated are overlapping, and it's hard to interpret the data the way it's been presented.
I wanted to place the different demographics into their natural groups, such as age, income, urbanicity, etc. Such a structure also surfaces demographic patterns, e.g. men are slightly more disaffected than women (not significant), people earning $100K+ are more unhappy than those earning $50K-.
Further, I'd like to make it easier to understand the importance factor - the share of GOP. Because the original form orders the demographics according to the left side, the proportions on the right side are jumbled.
Here is a draft of what I have in mind:
The widths of the line segments show the importance of each demographic segment. The longest line segments are toward the bottom of the chart (< 40% likely to vote for Trump challenger).
Reader LG found the following chart, tweeted by @EU_Justice.
This chart is a part of a larger infographic, which is found here.
The following points out a few issues with this effort:
The time axis is quite embarrassing. The first six months or so are squeezed into less than half the axis while the distance between Nov and Dec is not the same as that between Dec and Jan. So the slope of each line segment is what the designer wants it to be!
The straight edges of the area chart imply that there were only three data points, with straight lines drawn between each measurement. Sadly, the month labels are not aligned to the data on the line.
Lastly, the dots between May and November intended to facilitate reading this chart backfire. There are 6 dots dividing the May-Nov segment when there should only be five.
I'm delivering a quick-fire Webinar this Wednesday on how to make impactful data graphics for communication and persuasion. Registration is free, at this link.
In the meantime, I'm preparing a guest lecture for the Data Visualization class at Yeshiva University Sims School of Management. The goal of the lecture is to emphasize the importance of incorporating analytics into the data visualization process.
Here is the lesson plan:
Introduce the Trifecta checkup (link) which is the general framework for effective data visualizations
Provide examples of Type D data visualizations, i.e. graphics that have good production values but fail due to issues with the data or the analysis
Hands-on demo of an end-to-end data visualization process
Lessons from the demo including the iterative nature of analytics and visualization; and sketching
Overview of basic statistics concepts useful to visual designers
The following map accompanied an article in the Economist about China's drive to create a "digital silkroad," roughly defined as making a Silicon Valley.
The two variables plotted are the wealth of each province (measured by GDP per capita) and the level of Internet penetration. The designer made the following choices:
GDP per capita is presented with less precision than Internet penetration. The former is grouped into five large categories while the latter is given as a percentage to one decimal place.
The visual design favors GDP per capita which is encoded as the shade of color of each province. The Internet penetration data appeared added on as an afterthought.
If we apply the self-sufficiency test (i.e. by removing the printed data from the chart), it's immediately clear that the visual elements convey zero information about Internet penetration. This is a serious problem for a chart about the "digital silkroad"!
If those two variables are chosen, it would seem appropriate to convey to readers the correlation between the two variables. The following sketch is focused on surfacing the correlation.
(Click on the image to see it in full.) Here is the top of the graphic:
The individual maps are not strictly necessary. Just placing provincial names onto the grid is enough, because regional pattern isn't salient here.
The Internet penetration data were grouped into five categories as well, putting it on equal footing as GDP per capita.
The Newslab project takes aggregate data from Google's various services and finds imaginative ways to enliven the data. The Beautiful in English project makes a strong case for adding playfulness to your data visualization.
The data came from Google Translate. The authors look at 10 languages, and the top 10 words users ask to translate from those languages into English.
The first chart focuses on the most popular word for each language. The crawling snake presents the "worldwide" top words.
The crawling motion and the curvature are not required by the data but it inserts a dimension of playfulness into the data that engages the reader's attention.
The alternative of presenting a data table loses this virtue without gaining much in return.
Readers are asked to click on the top word in each country to reveal further statistics on the word.
For example, the word "good" leads to the following:
The second chart presents the top 10 words by language in a lollipop style:
The above diagram shows the top 10 Japanese words translated into English. This design sacrifices concise in order to achieve playful.
The standard format is a data table with one column for each country, and 10 words listed below each country header in order of decreasing frequency.
The creative lollipop display generates more extreme emotions - positive, or negative, depending on the reader. The data table is the safer choice, precisely because it does not engage the reader as deeply.