Announcement: Advancing your data skills, Fall 2019

Interrupting the flow of dataviz with the following announcement.

If you're looking to shore up your data skills, modernize your skill set, or know someone looking for hands-on, high-touch instruction in Machine Learning, R, Cloud Computing, Data Quality, Digital Analytics,  A/B Testing and Financial Analysis, Principal Analytics Prep is offering evening classes this Fall. Click here to learn about our courses. 

Our instructors are industry veterans with 10+ years of practical industry experience. And class size is capped to 10, ensuring a high-touch learning environment.

Facebook_pap_parttimeimmersive_tree

 


Choosing between individuals and aggregates

Friend/reader Thomas B. alerted me to this paper that describes some of the key chart forms used by cancer researchers.

It strikes me that many of the "new" charts plot granular data at the individual level. This heatmap showing gene expressions show one column per patient:

Jnci_genemap

This so-called swimmer plot shows one bar per patient:

Jnci_swimlanes

This spider plot shows the progression of individual patients over time. Key events are marked with symbols.

Jnci_spaghetti

These chart forms are distinguished from other ones that plot aggregated statistics: statistical averages, medians, subgroup averages, and so on.

One obvious limitation of such charts is their lack of scalability. The number of patients, the variability of the metric, and the timing of trends all drive up the amount of messiness.

I am left wondering what Question is being addressed by these plots. If we are concerned about treatment of an individual patient, then showing each line by itself would be clearer. If we are interested in the average trends of patients, then a chart that plots the overall average, or subgroup averages would be more accurate. If the interpretation of the individual's trend requires comparing with similar patients, then showing that individual's line against the subgroup average would be preferred.

When shown these charts of individual lines, readers are tempted to play the statistician - without using appropriate tools! Readers draw aggregate conclusions, performing the aggregation in their heads.

The authors of the paper note: "Spider plots only provide good visual qualitative assessment but do not allow for formal statistical inference." I agree with the second part. The first part is a fallacy - if the visual qualitative assessment is good enough, then no formal inference is necessary! The same argument is often made when people say they don't need advanced analysis because their simple analysis is "directionally accurate". When is something "directionally inaccurate"? How would one know?

Reference: Chia, Gedye, et. al., "Current and Evolving Methods to Visualize Biological Data in Cancer Research", JNCI, 2016, 108(8). (link)

***

Meteoreologists, whom I featured in the previous post, also have their own spider-like chart for hurricanes. They call it a spaghetti map:

Dorian_spaghetti

Compare this to the "cone of uncertainty" map that was featured in the prior post:

AL052019_5day_cone_with_line_and_wind

These two charts build upon the same dataset. The cone map, as we discussed, shows the range of probable paths of the storm center, based on all simulations of all acceptable models for projection. The spaghetti map shows selected individual simulations. Each line is the most likely trajectory of the storm center as predicted by a single simulation from a single model.

The problem is that each predictive model type has its own historical accuracy (known as "skill"), and so the lines embody different levels of importance. Further, it's not immediately clear if all possible lines are drawn so any reader making conclusions of, say, the envelope containing x percent of these lines is likely to be fooled. Eyeballing the "cone" that contains x percent of the lines is not trivial either. We tend to naturally drift toward aggregate statistical conclusions without the benefit of appropriate tools.

Plots of individuals should be used to address the specific problem of assessing individuals.


As Dorian confounds meteorologists, we keep our minds clear on hurricane graphics, and discover correlation as our friend

As Hurricane Dorian threatens the southeastern coast of the U.S., forecasters are fretting about the lack of consensus among various predictive models used to predict the storm’s trajectory. The uncertainty of these models, as reflected in graphical displays, has been a controversial issue in the visualization community for some time.

Let’s start by reviewing a visual design that has captured meteorologists in recent years, something known as the cone map.

Charley_oldconemap

If asked to explain this map, most of us trace a line through the middle of the cone understood to be the center of the storm, the “cone” as the areas near the storm center that are affected, and the warmer colors (red, orange) as indicating higher levels of impact. [Note: We will  design for this type of map circa 2000s.]

The above interpretation is complete, and feasible. Nevertheless, the data used to make the map are forward-looking, not historical. It is still possible to stick to the same interpretation by substituting historical measurement of impact with its projection. As such, the “warmer” regions are projected to suffer worse damage from the storm than the “cooler” regions (yellow).

After I replace the text that was removed from the map (see below), you may notice the color legend, which discloses that the colors on the map encode probabilities, not storm intensity. The text further explains that the chart shows the most probable path of the center of the storm – while the coloring shows the probability that the storm center will reach specific areas.

Charley_oldconemap

***

When reading a data graphic, we rarely first look for text about how to read the chart. In the case of the cone map, those who didn’t seek out the instructions may form one of these misunderstandings:

  1. For someone living in the yellow-shaded areas, the map does not say that the impact of the storm is projected to be lighter; it’s that the center of the storm has a lower chance of passing right through. If, however, the storm does pay a visit, the intensity of the winds will reach hurricane grade.
  2. For someone living outside the cone, the map does not say that the storm will definitely bypass you; it’s that the chance of a direct hit is below the threshold needed to show up on the cone map. Thee threshold is set to attain 66% accurate. The actual paths of storms are expected to stay inside the cone two out of three times.

Adding to the confusion, other designers have produced cone maps in which color is encoding projections of wind speeds. Here is the one for Dorian.

AL052019_wind_probs_64_F120

This map displays essentially what we thought the first cone map was showing.

One way to differentiate the two maps is to roll time forward, and imagine what the maps should look like after the storm has passed through. In the wind-speed map (shown below right), we will see a cone of damage, with warmer colors indicating regions that experienced stronger winds.

Projectedactualwinds_irma

In the storm-center map (below right), we should see a single curve, showing the exact trajectory of the center of the storm. In other words, the cone of uncertainty dissipates over time, just like the storm itself.

Projectedactualstormcenter_irma

 

After scientists learned that readers were misinterpreting the cone maps, they started to issue warnings, and also re-designed the cone map. The cone map now comes with a black-box health warning right up top. Also, in the storm-center cone map, color is no longer used. The National Hurricane Center even made a youtube pointing out the dos and donts of using the cone map.

AL052019_5day_cone_with_line_and_wind

***

The conclusion drawn from misreading the cone map isn’t as devastating as it’s made out to be. This is because the two issues are correlated. Since wind speeds are likely to be stronger nearer to the center of the storm, if one lives in a region that has a low chance of being a direct hit, then that region is also likely to experience lower average wind speeds than those nearer to the projected center of the storm’s path.

Alberto Cairo has written often about these maps, and in his upcoming book, How Charts Lie, there is a nice section addressing his work with colleagues at the University of Miami on improving public understanding of these hurricane graphics. I highly recommended Cairo’s book here.

P.S. [9/5/2019] Alberto also put out a post about the hurricane cone map.

 

 

 


Book review: Visualizing Baseball

I requested a copy of Jim Albert’s Visualizing Baseball book, which is part of the ASA-CRC series on Statistical Reasoning in Science and Society that has the explicit goal of reaching a mass audience.

Visualizingbaseball_coverThe best feature of Albert’s new volume is its brevity. For someone with a decent background in statistics (and grasp of basic baseball jargon), it’s a book that can be consumed within one week, after which one receives a good overview of baseball analytics, otherwise known as sabermetrics.

Within fewer than 200 pages, Albert outlines approaches to a variety of problems, including:

  • Comparing baseball players by key hitting (or pitching) metrics
  • Tracking a player’s career
  • Estimating the value of different plays, such as a single, a triple or a walk
  • Predicting expected runs in an inning from the current state of play
  • Analyzing pitches and swings using PitchFX data
  • Describing the effect of ballparks on home runs
  • Estimating the effect of particular plays on the outcome of a game
  • Simulating “fake” games and seasons in order to produce probabilistic forecasts such as X% chance that team Y will win the World Series
  • Examining whether a hitter is “streaky” or not

Most of the analyses are descriptive in nature, e.g. describing the number and types of pitches thrown by a particular pitcher, or the change in on-base percentage over the career of a particular hitter. A lesser number of pages are devoted to predictive analytics. This structure is acceptable in a short introductory book. In practice, decision-makers require more sophisticated work on top of these descriptive analyses. For example, what’s the value of telling a coach that the home run was the pivotal moment in a 1-0 game that has played out?

To appreciate the practical implications of the analyses included in this volume, I’d recommend reading Moneyball by Michael Lewis, or the more recent Astroball by Ben Reiter.

For the more serious student of sabermetrics, key omitted details will need to be gleaned from other sources, including other books by the same author – for years, I have recommended Curve Ball by Albert and Bennett to my students.

***

In the final chapters, Albert introduced the simulation of “fake” seasons that underlies predictions. An inquiring reader should investigate how the process is tied back to the reality of what actually happened; otherwise, the simulation will have a life of its own. Further, if one simulates 1,000 seasons of 2018 baseball, a large number of these fake seasons would crown some team other than the Red Sox as the 2018 World Series winner. Think about it: that’s how it is possible to make the prediction that the Red Sox has a say 60 percent chance of winning the World Series in 2018! A key to understanding the statistical way of thinking is to accept the logic of this fake simulated world. It is not the stated goal of Albert to convince readers of the statistical way of thinking – but you’re not going to be convinced unless you think about why we do it this way.

***

While there are plenty of charts included in the book, a more appropriate title for “Visualizing Baseball” would have been “Fast Intro to Baseball Analytics”. With several exceptions, the charts are not essential to understanding the analyses. The dominant form of exposition is first describe the analytical conclusion, then introduce a chart to illustrate that conclusion. The inverse would be: Start with the chart, and use the chart to explain the analysis.

The visualizations are generally of good quality, emphasizing clarity over prettiness. The choice of sticking to one software, ggplot2 in R, without post-production, constrains the visual designer to the preferences of the software designer. Such limitations are evident in chart elements like legends and titles. Here is one example (Chapter 5, Figure 5.8):

Albert_visualizingbaseball_chart

By default, the software prints the names of data columns in the titles. Imagine if the plot titles were Changeup, Fastball and Slider instead of CU, FF and SL. Or that the axis labels were “horizontal location” and “vertical location” (check) instead of px and pz. [Note: The chart above was taken from the book's github site; in the  Figure 5.8 in the printed book, the chart titles were edited as suggested.]

The chart analyzes the location relative to the strike zone of pitches that were missed versus pitches that were hit (not missed). By default, the software takes the name of the binary variable (“Miss”) as the legend title, and lists the values of the variable (“True” and “False”) as the labels of the two colors. Imagine if True appeared as “Miss” and False as “Hit” .

Finally, the chart exhibits over-plotting, making it tough to know how many blue or gray dots are present. Smaller dot size might help, or else some form of aggregation.

***

Visualizing Baseball is not the book for readers who learn by running code as no code is included in the book. A github page by the author hosts the code, but only the R/ggplot2 code for generating the data visualization. Each script begins after the analysis or modeling has been completed. If you already know R and ggplot2, the github is worth a visit. In any case, I don’t recommend learning coding from copying and pasting clean code.

All in all, I can recommend this short book to any baseball enthusiast who’s beginning to look at baseball data. It may expand your appreciation of what can be done. For details, and practical implications, look elsewhere.


Seeking simplicity in complex data: Bloomberg's dataviz on UK gender pay gap

Bloomberg featured a thought-provoking dataviz that illustrates the pay gap by gender in the U.K. The dataset underlying this effort is complex, and the designers did a good job simplifying the data for ease of comprehension.

U.K. companies are required to submit data on salaries and bonuses by gender, and by pay quartiles. The dataset is incomplete, since some companies are slow to report, and the analyst decided not to merge companies that changed names.

Companies are classified into industry groups. Readers who read Chapter 3 of Numbers Rule Your World (link) should ask whether these group differences are meaningful by themselves, without controlling for seniority, job titles, etc. The chapter features one method used by the educational testing industry to take a more nuanced analysis of group differences.

***

The Bloomberg visualization has two sections. In the top section, each company is represented by the percent difference between average female pay and average male pay. Then the companies within a given industry is shown in a histogram. The histograms provide a view of the disparity between companies within a given industry. The black line represents the relative proportion of companies in a given industry that have no gender pay gap but it’s the weight of the histogram on either side of the black line that carries the graphic’s message.

This is the histogram for arts, entertainment and recreation.

Bloomberg_genderpaygap_arts

The spread within this industry is very wide, especially on the left side of the black line. A large proportion of these companies pay women less on average than men, and how much less is highly variable. There is one extreme positive value: Chelsea FC Foundation that pays the average female about 40% more than the average male.

This is the histogram for the public sector.

Bloomberg_genderpaygap_public
It is a much tighter distribution, meaning that the pay gaps vary less from organization to organization (this statement ignores the possibility that there are outliers not visible on this graphic). Again, the vast majority of entities in this sector pay women less than men on average.

***

The second part of the visualization look at the quartile data. The employees of each company are divided into four equal-sized groups, based on their wages. Think of these groups as the Top 25% Earners, the Second 25%, etc. Within each group, the analyst looks at the proportion of women. If gender is independent of pay, then we should expect the proportions of women to be about the same for all four quartiles. (This analysis considers gender to be the only explainer for pay gaps. This is a problem I've called xyopia, that frames a complex multivariate issue as a bivariate problem involving one outcome and one explanatory variable. Chapter 3 of Numbers Rule Your World (link) discusses how statisticians approach this issue.)

Bloomberg_genderpaygap_public_pieOn the right is the chart for the public sector. This is a pie chart used as a container. Every pie has four equal-sized slices representing the four quartiles of pay.

The female proportion is encoded in both the size and color of the pie slices. The size encoding is more precise while the color encoding has only 4 levels so it provides a “binned” summary view of the same data.

For the public sector, the lighter-colored slice shows the top 25% earners, and its light color means the proportion of women in the top 25% earners group is between 30 and 50 percent. As we move clockwise around the pie, the slices represent the 2nd, 3rd and bottom 25% earners, and women form 50 to 70 percent of each of those three quartiles.

To read this chart properly, the reader must first do one calculation. Women represent about 60% of the top 25% earners in the public sector. Is that good or bad? This depends on the overall representation of women in the public sector. If the sector employs 75 percent women overall, then the 60 percent does not look good but if it employs 40 percent women, then the same value of 60% tells us that the female employees are disproportionately found in the top 25% earners.

That means the reader must compare each value in the pie chart against the overall proportion of women, which is learned from the average of the four quartiles.

***

In the chart below, I make this relative comparison explicit. The overall proportion of women in each industry is shown using an open dot. Then the graphic displays two bars, one for the Top 25% earners, and one for the Bottom 25% earners. The bars show the gap between those quartiles and the overall female proportion. For the top earners, the size of the red bars shows the degree of under-representation of women while for the bottom earners, the size of the gray bars shows the degree of over-representation of women.

Redo_junkcharts_bloombergukgendergap

The net sum of the bar lengths is a plausible measure of gender inequality.

The industries are sorted from the ones employing fewer women (at the top) to the ones employing the most women (at the bottom). An alternative is to sort by total bar lengths. In the original Bloomberg chart - the small multiples of pie charts, the industries are sorted by the proportion of women in the bottom 25% pay quartile, from smallest to largest.

In making this dataviz, I elected to ignore the middle 50%. This is not a problem since any quartile above the average must be compensated by a different quartile below the average.

***

The challenge of complex datasets is discovering simple ways to convey the underlying message. This usually requires quite a bit of upfront analytics, data transformation, and lots of sketching.

 

 


Form and function: when academia takes on weed

I have a longer article on the sister blog about the research design of a study claiming 420 "cannabis" Day caused more road accident fatalities (link). The blog also has a discussion of the graphics used to present the analysis, which I'm excerpting here for dataviz fans.

The original chart looks like this:

Harperpalayew-new-420-fig2

The question being asked is whether April 20 is a special day when viewed against the backdrop of every day of the year. The answer is pretty clear. From this chart, the reader can see:

  • that April 20 is part of the background "noise". It's not standing out from the pack;
  • that there are other days like July 4, Labor Day, Christmas, etc. that stand out more than April 20

It doesn't even matter what the vertical axis is measuring. The visual elements did their job. 

***

If you look closely, you can even assess the "magnitude" of the evidence, not just the "direction." While April 20 isn't special, it nonetheless is somewhat noteworthy. The vertical line associated with April 20 sits on the positive side of the range of possibilities, and appears to sit above most other days.

The chart form shown above is better at conveying the direction of the evidence than its strength. If the strength of the evidence is required, we use a different chart form.

I produced the following histogram, using the same data:

Redo_420day_2

The histogram is produced by first locating the midpoints# of the vertical lines into buckets, and then counting the number of days that fall into each bucket.  (# Strictly speaking, I use the point estimates.)

The midpoints# are estimates of the fatal crash ratio, which is defined as the excess crash fatalities reported on the "analysis day" relative to the "reference days," which are situated one week before and one week after the analysis day. So April 20 is compared to April 13 and 27. Therefore, a ratio of 1 indicates no excess fatalities on the analysis day. And the further the ratio is above 1, the more special is the analysis day. 

If we were to pick a random day from the histogram above, we will likely land somewhere in the middle, which is to say, a day of the year in which no excess car crashes fatalities could be confirmed in the data.

As shown above, the ratio for April 20 (about 1.12)  is located on the right tail, and at roughly the 94th percentile, meaning that there were 6 percent of analysis days in which the ratios would have been more extreme. 

This is in line with our reading above, that April 20 is noteworthy but not extraordinary.

 

P.S. [4/27/2019] Replaced the first chart with a newer version from Harper's site. The newer version contains the point estimates inside the vertical lines, which are used to generate the histogram.

 

 

 

 

 


The Bumps come to the NBA, courtesy of 538

The team at 538 did a post-mortem of their in-season forecasts of NBA playoffs, using Bumps charts. These charts have a long history and can be traced back to Cambridge rowing. I featured them in these posts from a long time ago (link 1, link 2). 

Here is the Bumps chart for the NBA West Conference showing all 15 teams, and their ranking by the 538 model throughout the season. 

Fivethirtyeight_nbawest_bumps

The highlighted team is the Kings. It's a story of ascent especially in the second half of the season. It's also a story of close but no cigar. It knocked at the door for the last five weeks but failed to grab the last spot. The beauty of the Bumps chart is how easy it is to see this story.

Now, if you'd focus on the dotted line labeled "Makes playoffs," and note that beyond the half-way point (1/31), there are no further crossings. This means that the 538 model by that point has selected the eight playoff teams accurately.

***

Now what about NBA East?

Fivethirtyeight_nbaeast_bumps

This chart highlights the two top teams. This conference is pretty easy to predict at the top. 

What is interesting is the spaghetti around the playoff line. The playoff race was heart-stopping and it wasn't until the last couple of weeks that the teams were settled. 

Also worthy of attention are the bottom-dwellers. Note that the chart is disconnected in the last four rows (ranks 12 to 15). These four teams did not ever leave the cellar, and the model figured out the final rankings around February.

Using a similar analysis, you can see that the model found the top 5 teams by mid December in this Conference, as there are no further crossings beyond that point. 

***
Go check out the FiveThirtyEight article for their interpretation of these charts. 

While you're there, read the article about when to leave the stadium if you'd like to leave a baseball game early, work that came out of my collaboration with Pravin and Sriram.


How to describe really small chances

Reader Aleksander B. sent me to the following chart in the Daily Mail, with the note that "the usage of area/bubble chart in combination with bar alignment is not very useful." (link)

Dailymail-image-a-35_1431545452562

One can't argue with that statement. This chart fails the self-sufficiency test: anyone reading the chart is reading the data printed on the right column, and does not gain anything from the visual elements (thus, the visual representation is not self-sufficient). As a quick check, the size of the risk for "motorcycle" should be about 30 times larger than that of "car"; the size of the risk for "car" should be 100 times larger than that of "airplane". The risk of riding motorcycles then is roughly 3,000 times that of flying in an airplane. 

The chart does not appear to be sized properly as a bubble chart:

Dailymail_travelrisk_bubble

You'll notice that the visible proportion of the "car" bubble is much larger than that of the "motorcycle" bubble, which is one part of the problem.

Nor is it sized as a bar chart:

Dailymail_travelrisk_bar

As a bar chart, both the widths and the heights of the bars vary; and the last row presents a further challenge as the bubble for the airplane does not touch the baseline.

***

Besides the Visual, the Data issues are also quite hard. This is how Aleksander describes it: "as a reader I don't want to calculate all my travel distances and then do more math to compare different ways of traveling."

The reader wants to make smarter decisions about travel based on the data provided here. Aleksandr proposes one such problem:

In terms of probability it is also easier to understand: "I am sitting in my car in strong traffic. At the end in 1 hour I will make only 10 miles so what's the probability that I will die? Is it higher or lower than 1 hour in Amtrak train?"

The underlying choice is between driving and taking Amtrak for a particular trip. This comparison is relevant because those two modes of transport are substitutes for this trip. 

One Data issue with the chart is that riding a motorcycle and flying in a plane are rarely substitutes. 

***

A way out is to do the math on behalf of your reader. The metric of deaths per 1 billion passenger-miles is not intuitive for a casual reader. A more relevant question is what's the chance of dying from the time I spend per year of driving (or riding a plane). Because the chance will be very tiny, it is easier to express the risk as the number of years of travel before I expect to see one death.

Let's assume someone drives 300 days per year, and 100 miles per day so that each year, this driver contributes 30,000 passenger-miles to the U.S. total (which is 3.2 trillion). We convert 7.3 deaths per 1 billion passenger-miles to 1 death per 137 million passenger-miles. Since this driver does 30K per year, it will take (137 million / 30K) = about 4,500 years to see one death on average. This calculation assumes that the driver drives alone. It's straightforward to adjust the estimate if the average occupancy is higher than 1. 

Now, let's consider someone who flies once a month (one outbound trip plus one return trip). We assume that each plane takes on average 100 passengers (including our protagonist), and each trip covers on average 1,000 miles. Then each of these flights contributes 100,000 passenger-miles. In a year, the 24 trips contribute 2.4 million passenger-miles. The risk of flying is listed at 0.07 deaths per 1 billion, which we convert to 1 death per 14 billion passenger-miles. On this flight schedule, it will take (14 billion / 2.4 million) = almost 6,000 years to see one death on average.

For the average person on those travel schedules, there is nothing to worry about. 

***

Comparing driving and flying is only valid for those trips in which you have a choice. So a proper comparison requires breaking down the average risks into components (e.g. focusing on shorter trips). 

The above calculation also suggests that the risk is not evenly spread out throughout the population, despite the use of an overall average. A trucker who is on the road every work day is clearly subject to higher risk than an occasional driver who makes a few trips on rental cars each year.

There is a further important point to note about flight risk, due to MIT professor Arnold Barnett. He has long criticized the use of deaths per billion passenger-miles as a risk metric for flights. (In Chapter 5 of Numbers Rule Your World (link), I explain some of Arnie's research on flight risk.) The problem is that almost all fatal crashes involving planes happen soon after take-off or not long before landing. 

 


Book Preview: How Charts Lie, by Alberto Cairo

Howchartslie_coverIf you’re like me, your first exposure to data visualization was as a consumer. You may have run across a pie chart, or a bar chart, perhaps in a newspaper or a textbook. Thanks to the power of the visual language, you got the message quickly, and moved on. Few of us learned how to create charts from first principles. No one taught us about axes, tick marks, gridlines, or color coding in science or math class. There is a famous book in our field called The Grammar of Graphics, by Leland Wilkinson, but it’s not a For Dummies book. This void is now filled by Alberto Cairo’s soon-to-appear new book, titled How Charts Lie: Getting Smarter about Visual Information.

As a long-time fan of Cairo’s work, I was given a preview of the book, and I thoroughly enjoyed it and recommend it as an entry point to our vibrant discipline.

In the first few chapters of the book, Cairo describes how to read a chart. Some may feel that there is not much to it but if you’re here at Junk Charts, you probably agree with Cairo’s goal. Indeed, it is easy to mis-read a chart. It’s also easy to miss the subtle and brilliant design decisions when one doesn’t pay close attention. These early chapters cover all the fundamentals to become a wiser consumer of data graphics.

***

How Charts Lie will open your eyes to how everyone uses visuals to push agendas. The book is an offshoot of a lecture tour Cairo took during the last year or so, which has drawn large crowds. He collected plenty of examples of politicians and others playing fast and loose with their visual designs. After reading this book, you can’t look at charts with a straight face!

***

In the second half of his book, Cairo moves beyond purely visual matters into analytical substance. In particular, I like the example on movie box office from Chapter 4, titled “How Charts Lie by Displaying Insufficient Data”. Visual analytics of box office receipts seems to be a perennial favorite of job-seekers in data-related fields.

The movie data is a great demonstration of why one needs to statistically adjust data. Cairo explains why Marvel’s Blank Panther is not the third highest-grossing film of all time in the U.S., as reported in the media. That is because gross receipts should be inflation-adjusted. A ticket worth $15 today cost $5 some time ago.

This discussion features a nice-looking graphic, which is a staircase chart showing how much time a #1 movie has stayed in the top position until it is replaced by the next higher grossing film.

Cairo_howchartslie_movies

Cairo’s discussion went further, exploring the number of theaters as a “lurking” variable. For example, Jaws opened in about 400 theaters while Star Wars: The Force Awakens debuted in 10 times as many. A chart showing per-screen inflation-adjusted gross receipts looks much differently from the original chart shown above.

***

Another highlight is Cairo’s analysis of the “cone of uncertainty” chart frequently referenced in anticipation of impending hurricanes in Florida.

Cairo_howchartslie_hurricanes

Cairo and his colleagues have found that “nearly everybody who sees this map reads it wrongly.” The casual reader interprets the “cone” as a sphere of influence, showing which parts of the country will suffer damage from the impending hurricane. In other words, every part of the shaded cone will be impacted to a larger or smaller extent.

That isn’t the designer’s intention! The cone embodies uncertainty, showing which parts of the country has what chance of being hit by the impending hurricane. In the aftermath, the hurricane would have traced one specific path, and that path would have run through the cone if the predictive models were accurate. Most of the shaded cone would have escaped damage.

Even experienced data analysts are likely to mis-read this chart: as Cairo explained, the cone has a “confidence level” of 68% not 95% which is more conventional. Areas outside the cone still has a chance of being hit.

This map clinches the case for why you need to learn how to read charts. And Alberto Cairo, who is a master visual designer himself, is a sure-handed guide for the start of this rewarding journey.

***

Here is Alberto introducing his book.


This chart advises webpages to add more words

A reader sent me the following chart. In addition to the graphical glitch, I was asked about the study's methodology.

Serp-iq-content-length

I was able to trace the study back to this page. The study uses a line chart instead of the bar chart with axis not starting at zero. The line shows that web pages ranked higher by Google on the first page tend to have more words, i.e. longer content may help with Google ranking.

Backlinko_02_Content-Total-Word-Count_line

On the bar chart, Position 1 is more than 6 times as big as Position 10, if one compares the bar areas. But it's really only 20% larger in the data.

In this case, even the line chart is misleading. If we extend the Google Position to 20, the line would quickly dip below the horizontal axis if the same trend applies.

The line chart includes too much grid, one of Tufte's favorite complaints. The Google position is an integer and yet the chart's gridlines imply that 0.5 rank is possible.

Any chart of this data should supply information about the variance around these average word counts. Would like to see a side-by-side box plot, for example.

Another piece of context is the word counts for results on the second or third pages of Google results. Where are the short pages?

***

Turning to methodology, we learn that the research team analyzed 1 million pages of Google search results, and they also "removed outliers from our data (pages that contained fewer than 51 words and more than 9999 words)."

When you read a line like this, you have to ask some questions:

How do they define "outlier"? Why do they choose 51 and 9,999 as the cut-offs?

What proportion of the data was removed at either end of the distribution?

If these proportions are small, then the outliers are not going to affect that average word count by much, and thus there is no point to their removal. If they are large, we'd like to see what impact removing them might have.

In any case, the median is a better number to use here, or just show us the distribution, not just the average number.

It could well be true that Google's algorithm favors longer content, but we need to see more of the data to judge.