Numbersense, in Chinese and Japanese

This is a cross-post on my two blogs.

The new year brings news that my second book, Numbersense: How to Use Big Data to Your Advantage has been translated into Chinese (simplified) and Japanese. Here are the book covers:

Chinese_edition_cover

In Chinese, the title reads: "Say No to Fake Big Data". Captures the sentiment of the book pretty well, I must say.

Numbersense_japanese_cover

I have no idea what the Japanese title means. Perhaps a reader can help me out here.

***

The Japanese version is available here or here.

The Chinese version is here.

The English version is here.


Circular but insufficient

One of my students analyzed the following Economist chart for her homework.

Economist_book_sales_printversion

I was looking for it online, and found an interactive version that is a bit different (link). Here are three screen shots from the online version for years 2009, 2013 and 2018. The first and last snapshots correspond to the years depicted in the print version.

  Economist_booksales_all

The online version is the self-sufficiency test for the print version. In testing self-sufficiency, we want to see if the visual elements (i.e. the circular sectors on the print version) pull their own weights. The quick answer is no. The reader can't tell how much sales are represented in each sector, nor can they reliably estimate the relative scales of print versus ebook (pink/red vs yellow/orange) or year-to-year growth rates.

As usual, when we see the entire data set printed on the chart itself, it is giveaway that the visual elements are mere ornaments.

The online version does not have labels unless you hover over the hemispheres. But again it is a challenge to learn anything from the picture.

In the Trifecta checkup, this is a Type V chart.

***

This particular dataset is made for the bumps-style chart:

Redo_economistbooksales

 

 

 


Book review: The Functional Art

Cairo_book_coverReading Alberto Cairo’s fabulous book, The Functional Art, feels like reading my own work. It’s staggering how closely aligned our sensibilities are, notwithstanding our disparate backgrounds, he a data journlist by training, and I a statistician. We probably can finish each other’s sentences—and did at this recent Analytically Speaking webcast (link to clip).

Cairo currently teaches data visualization at the University of Miami; this is after a distinguished career as a data/visual journalist, having won many awards.

The Functional Art is divided into halves, which can be read independently.

The front part is a terrific overview of data visualization concepts. Cairo’s interest is in principles, rather than recipes. The field of data visualization has developed separately under three academic disciplines: design, computer science, and statistics. Inevitably, the work products contain contradictions and much re-invention. Cairo achieves a synthesis of these schools of thought, and this book is the clarion call for more work on unifying the key intellectual threads of the field.

The second half contains a series of interviews with industry luminaries. This section is a unique contribution to the literature, glancing at behind-the-scenes of the craft. Practitioners will find these short pieces illuminating and profitable. It is often a long journey to arrive at the graphic in print. The selection of designers emphasizes mainstream media outlets although the interviewees have wide-ranging views.

Included in these pages are plenty of published data graphics, frequently work that Cairo produced while working for the Brazilian publication, Epoca. These graphics are elaborate and ambitious, and nicely reproduced in color images. They reward detailed study, with attention to composition, narrative structure, chart types, selection of statistics, etc.

There are plenty of books on the market about how to do graphics (Dona Wong, Naomi Robbins, Nathan Yau come to mind.) Cairo’s book is not about doing, but about thinking about charts. Trust me, time spent thinking about charts will make your charts much improved.

***

I will now describe some sections of the book that particularly hold my interest:

In Chapter 3, Cairo explains the “visualization wheel,” a nice way to visualize the decisions that designers make when creating charts. Each decision is presented as a trade-off between two extremes. For example, a chart can be “light” or “dense.” This axis evokes Tufte’s data-ink ratio. Devices such as this wheel are useful for integrating the diverse viewpoints that coexist in our field. Frequently, these trade-off decisions are made implicitly—but they can really benefit from explicit consideration.

Figure 4.11 is one of the Epoca charts narrating a Brazilian election. Just recently, I linked to Cairo’s blog post about a similar chart. In both, a spider (radar) plot features prominently. On the same chart, you’ll find a nice demonstration of the small-multiples principle. I applaud the publisher of Epoca for supporting such deep data graphics.

Chapter 8 is invaluable in documenting the chart-making process. Trial and error is a key element of this process. Here, Cairo shows some of the earlier drafts of projects that eventually went to publication. This material is similar to what Kevin Quealy shows at his ChartNThings blog about New York Times graphics.

Chapter 9 is one of the more mature discussions of interactive graphics I have seen. Too often, interactivity is reduced to a feature that is layered onto any dataset. It should rightfully be seen as a problem of design.

Figure 10.1 is not strictly speaking a “data” graphic but I love John Grimwade’s visual explanation of the “transatlantic superhighway”.

Cairo also writes a blog.


Update on Dataviz Workshop 1

Happy to report on the dataviz workshop, a first-time offering at NYU. I previously posted the syllabus here.

I made minor changes to the syllabus, adding Alberto Cairo's book, The Functional Art (link), as optional reading, some articles from the recent debate in the book review circle about the utility of "negative reviews" (start here), and some blog posts by Stephen Few.

The Cairo and Few readings, together with Tufte, are closest to what I want to accomplish in the first two classes, before we start discussing individual projects: encouraging students to adopt the mentality of the course, that is to say, to think of dataviz as an artform. An artform implies many things, one of which is a seriousness about the output, and another is the recognition that the work has an audience. 

The field of data visualization is sorely lacking high-level theory, immersed as so many of us are in tools, data, and rules of thumb. It is my hope that these workshop discussions will lead to a crytallization of the core principles of the field.

We went on a tour of many dataviz blogs, and documented various styles of criticism. In the next class, we will discuss what style we'd adopt in the course.

***

The composition of the class brings me great excitement. There are 12 enrolled students, which is probably the maximum for a class of this type.  One student subsequently dropped out, after learning that the workshop is really not for true beginners.

The workshop participants come from all three schools of dataviz: computer science, statistics, and design. Amongst us are an academic economist trained in statistical methods, several IT professionals, and an art director. This should make for rewarding conversation, as inevitably there will be differences in perspective.

***

REQUEST FOR HELP: A variety of projects have been proposed; several are using this opportunity to explore data sets from their work. That said, some participants are hoping to find certain datasets. If you know of good sources for the following, please write a comment below and link to them:

  • Opening-day ratings from sites like Rotten Tomatoes
  • New York City water quality measures by county (or other geographical unit), probably from an environmental agency
  • Data about donors/donations to public media companies

***

Since this is a dataviz blog, I want to include a chart with this post. I did a poll of the enrolled students, and one of the questions was about what dataviz tools they use to generate charts. I present here two views of the same data.

The first is a standard column chart, plotting the number of students who include a particular tool in his or her toolset (each student is allowed to name more than one tools). This presents a simple piece of information simply: Excel is the most popular although the long tail indicates the variety of tools people use in practice.

Class_tools_2

What the first option doesn't bring out is the correlation between tools, indicated by several tools used by the same participant. The second option makes this clear, with each column representing a student. This chart is richer as it also provides information on how many tools the average student uses, and the relationship between different tools.

Class_tools_1

The tradeoff is that the reader has to work a little more to understand the relative importance of the different tools, a message that is very clear in the first option. 

This second option is also not scalable. If there are thousands of students, the chart will lose its punch (although it will undoubtedly be called beautiful).

Which version do you like? Are there even better ways to present this information?

 


Book review: Data Points by Nathan Yau

DataPointsOne of my summer projects is to develop the curriculum for a new Certificate in Analytics and Data Visualization, offered at NYU (link). (If you are interested in teaching these courses, please contact me.) The program aims to give students a balanced training, covering datavis from the perspectives of statistics, graphical design and computer science.

Nathan Yau's new book, Data Points, landed on my desk at just the right time. It is a nice overview of the subject of data visualization, and it can serve nicely in our introductory course. The book sits closer to the statistical and design perspectives. Instructors will need to supplement the computer science topics such as interactivity, networks, and online graphics. It is of course difficult to teach interactive graphics from a static textbook. (Yau's previous book, Visualize This, has detailed tutorials of most of these techniques. My issue with that book is trying to be too many things at once.)

Data Points is a concepts and examples book. It's not a how-to book. There are figures on almost every page, and unlike Visualize This, most figures are actual published data visualization projects.

Just for fun, I classified the figures and plotted the result. (Some purely instructive figures are skipped.)

Datapoint_fig

Running from left to right is the order of appearance of the chart within the book. I classified a total of 135 charts. For each chart, I considered whether one or more of 12 adjectives apply. I labeled about 40 charts "useful", "banal", "silly", and/or "engaging".

You can see from this graph that I enjoy the charts in the initial chapters. Up till chart number 50 or so, I find few "banal" charts, and many "engaging" or "amusing" or "artistic" charts. In the second part of the book, there are not many "surprising" or "amusing" charts.

As for "silly" and "baffling" charts, they appear at an even clip throughout. But that represents just my own bias. I also find "useful" charts throughout the book.

***

PS. I received a review copy of Data Points. Nathan's blog is Flowing Data.


Can information be beautiful when information doesn't exist?

Reader Steve S. sent in this article that displays nominations for the "Information is Beautiful" award (link). I see "beauty" in many of these charts but no "information". Several of these charts have appeared on our blog before.

Junkcharts_trifecta_checkupLet's use the Trifecta checkup on these charts. (More about the Trifecta checkup here.)

 

Info_beaut_plot_linesThe topic of this chart is both tangible and interesting. As someone who loves books, I do want to know what genres of books typically win awards.

However, both the data collection and graphical design make no sense.

The data collection problem presents a huge challenge and it's easy to get wrong. The problem is how narrow should a theme be. If it's too narrow, you can imagine every book has its own set of themes. If it's too wide, each theme maps to lots of books. The challenge is how to select the themes such that they have similar "widths". For example, "death" is a very wide theme and lots of books contain it, as indicated by the black lines. "Nanny trust issues" is a very narrow theme, and only one of those books deals with this theme. When there is such a theme, is its lack of popularity due to its narrow definition or due to writers not being interested in it?

***

Info_beaut_coversThe caption of this chart said "Cover stars: Charting 50 years up until 2010, this graphic shows The Beatles to be the most covered act in living memory." If that is the message, a much simpler chart would work a lot better.

Since the height of the chart indicates the number of covers sold in that year, the real information being shown is the boom and bust cycles of the worldwide economy. So, a lot more records were sold in 2005, and then the market tanked in 2008, for example.

That's why the data analyst should think twice before plotting raw data. Most data like these should be adjusted. In this case, you could either compare artists against one another in each year (by using proportions) or you have to do a seasonal and trend adjustment. I also don't see the point of highlighting year-to-year fluctuations. Nor do I understand why only in certain years is the top-rated cover identified by name and laurel wreath.

 

***

I talked about this stream graph of 311 calls back in 2010. See the post here.

Info_beaut_311calls

***

I featured this set of infographics/pie charts back in 2011. See the post here.

Info_beaut_refugees

***

This chart is a variant of the one from New York Times that I discussed here. I like the proper orientation on the NYT's version. The color scheme here may be slightly more attractive.

Info_beaut_trackfield

 


Reading: WSJ Guide to Information Graphics

Dona Wong, who had stints on the graphics teams at both the Wall Street Journal and the New York Times, has contributed a how-to book on statistical graphics. It is called "The Wall Street Journal. Guide to Information Graphics".

The biggest strength of this book is the material on data collection and selection, which is an overlooked aspect of statistical graphics. The content of p.103, for example, is not typically found in similar books: on this page, Wong works through how to determine the scales for two stock-price charts in such a way that the distances represent relative changes in stock prices (rather than absolute changes). Chapter 3 ("Ready Reference"), which covers this type of material, is almost as big as Chapter 2, which runs through basic rules of making graphs that should be familiar to our readers. Her philosophy, then, leans toward Tukey's as espoused in his seminal book EDA, although Wong keeps to the most basic elements (percentages, indices, log scales, etc.), obviously aiming for a different audience than Tukey.

The guidelines relating to making charts are prescriptive and concise. The following snippet (pp.72-73) is typical of the style:

Dona_1  

Wong focuses on saying what to do, but (usually) not why. Perhaps for this reason, the book has no references or notes, except for mentioning Ed Tufte as Wong's thesis adviser. Almost all the best practices described in the book would meet with our approval. One that has not been featured much on this blog is the preference for shades of the same color to many different colors of the same shade.

Despite the title, the book actually discusses statistical graphics (same as Junk Charts), not "infographics" (as covered by Information Aesthetics, for example). Almost all the graphical examples are conceptual, and not based on real-life examples. This editorial decision has the advantage of sharpening the educational message but the disadvantage of being less engaging.

A unique feature of Wong's book is Chapter 5 ("Charting Your Course"), which covers business charts used to organize operational data, rather than present insights -- things like Gantt charts (which she calls work plans), org charts, flow charts, 2-by-2 matrices, and so on. Things that are in the toolkit of management consultants. This is an under-studied area, and deserves more attention. I am reminded of Tufte's re-design of bus schedules. This type of charts is different in the need to print all pieces of data onto the chart, the prevalence of text data (and the difficulty of incorporating them into charts), and efficient search as a primary goal. And it is in this chapter that the decision to stay conceptual diminishes the impact: it would be very valuable for readers to see a complete Gantt chart based on a real project, and how it evolves over the course of the project. I have always found these types of charts to start out nicely but gradually sink as details and detours pile up.

***

There is one chart on p.59 I would like to discuss.

Dona2  Here, Wong allows the use of double axes in certain cases, basically when the two data series have linearly-related scales. She appends the advice: "Adhere to the correct chart type for each series -- lines for continuous data and bars for discrete quantities... The only exception is when both data series call for a chart with vertical bars. In such instances, convert one to a line." (Regular readers know I don't think much of this rule.)

Based on the chart above, Wong either considers both revenue and market share to be discrete quantities, or considers revenue to be discrete and market share to be continuous. In my mind, both series are continuous data and a chart with two lines is appropriate here.


Serving donuts

David Leonhardt's article on the graduation rates of public universities caught my attention for both graphical and statistical reasons.


Nyt_gradrate David gave a partial review of a new book "Crossing The Finish Line", focusing on their conclusion that public universities must improve their 4-year graduation rates in order for education in the U.S. to achieve progress.  This conclusion was arrived at through statistical analysis of detailed longitudinal data (collected since 1999).

This chart is used to illustrate this conclusion.  We will come to the graphical offering later but first I want to fill in some details omitted from David's article by walking through how a statistician would look at this matter, what it means by "controlling for" something.

The question at hand is whether public universities, especially less selective ones, have "caused" students to lag behind in graduation rate.  A first-order analysis would immediately find that the overall graduation rate at less selective public universities to be lower, about 20% lower, than at more selective public universities.  

A doubter appears, and suggests that less selective schools are saddled with lower-ability students, and that would be the "cause" of lower graduation rates, as opposed to anything the schools actually do to students.  Not so fast, the statistician now disaggregates the data and look at the graduation rates within subgroups of students with comparable ability (in this instance, the researchers used GPA and SAT scores as indicators of ability).  This is known as "controlling for the ability level".  The data now shows that at every ability level, the same gap of about 20% exists: about 20% fewer students graduate at the less selective colleges than at the more selective ones.  This eliminates the mix of abilities as a viable "cause" of lower graduation rates.

The researchers now conclude that conditions of the schools (I think they blame the administrators) "caused" the lower graduation rates.  Note, however, that this does not preclude factors other than mix of abilities and school conditions from being the real "cause" of lower graduation rates.  But as far as this analysis goes, it sounds pretty convincing to me.

That is, if I ignore the fact that graduation rates are really artifacts of how much the administrators want to graduate students.  As the book review article pointed out, at the less selective colleges, they may want to reduce graduation rates in order to save money since juniors and seniors are more expensive to support due to smaller class sizes and so on.  On the other hand, the most selective colleges have an incentive to maintain a near-perfect graduation rates since the US News and other organizations typically use this metric in their rankings -- if you were the administrator, what would you do?  (You didn't hear it from here.)

Back to the chart, or shall we say the delivery of 16 donuts?

First, it fails the self-sufficiency principle.  If we remove the graphical bits, nothing much is lost from the chart.  Both are equally impenetrable.

A far better alternative is shown below, using a type of profile chart.

Redo_gradrate

Finally, I must mention that in this particular case, there is no need to draw all four lines.  Since the finding of a 20% gap essentially holds for all subgroups, no information is lost by collapsing the subgroups and reporting the average line instead (with a note explaining that the same effect affected every subgroup).  

By the way, that is the difference between the statistical grapher - who is always looking to simplify the data - and the information grapher - who is aiming for fidelity. 




Reference: "Colleges are lagging in graduation rates", New York Times, Sept 9, 2009; "Book review: (Not) Crossing the Finish Line", Inside Higher Education, Sept 9 2009.

A book

We bring attention to a book on graphics written by Bernard Lebelle, a frequent contributor to this blog.  The book came out in France earlier this year.  The title is "Convaincre avec des graphiques efficaces sous Excel, PowerPoint ...", published by Eyrolles.  Thankfully because much of the book is visual, I don't need to know French to understand much of it.  Here, I discuss two interesting things:

On page 13, he discussed flow diagrams using the energy flow example that led to a long discussion on this blog.  He proposed using a Merimecko chart instead.

Energy

On page 89, he showed a concentric circle chart (see below).  This is a relatively simple train schedule showing the frequency of trains at each hour on each day of the week.  It looks interesting because of the allusion to the clock, except that typical clocks have twelve hours rather than 24.  I'd create a set of two charts, one for the first twelve hours, one for the second twelve.

Concentric

This sort of chart is very limited in utility but it works well here because the data is entirely categorical - one or two trains per hour, hour of day, day of week - and in addition, the relationships are very simple.  In fact, the reader/user does not need to read any trends, general patterns or estimate the size or shape of anything.  The user is performing a simple search operation, that's it.

(The innermost circle is unlabelled so it is unclear what that signifies.)

Lebelle provided an alternative on page 90, which is essentially a data table, with time on the vertical dimension and calendar date on the horizontal, and the frequency inside the cells.  This is more straightforward, less interesting.


On page 151, he mentioned the self-sufficiency test that we discussed often here.  A graph should do more than just print all the data in the data set.

Lebelle is currently Senior Manager at Deloitte, the management consulting company, and he focuses on graphical construction in Excel.  This is both a limitation and an advantage.  Excel, of course, has many imperfections (don't get me started on the new and horrid Excel).  However, Excel is still the most widely used graphing application, by fa


The book takes a perspective on charting that fits our philosophy very well.  Here is a rough summary of the contents of the book (any mistakes are mine):

chapter 1: a summary of the key features of good charts... issues such as clarity and efficiency of the message are addressed

chapter 2: historical perspective, with examples from Playfair, Minard, Nightingale, etc.  page 38 has an interesting table comparing the contributions of Bertin, Tukey, Tufte, Ware and Cleveland.

chapter 3: constructs of a chart such as axes, legends, etc.  page 43 explains the difference between "information design", "infographics", "charts" and "information visualization".  introduces chartjunk, data-ink ratio.

chapter 4: "decoding" of a chart.  Discusses optical illusions, which I also consider to be fundamental to understanding the effect of charts on the audience.  Talks about how different ways of displaying the same data is perceived differently.  Interesting section (starting p.101) considering some quantitative theories about perception, citing Ernst Weber and Stanley Smith Stevens.

chapter 5: process of making a chart.  The nitty-gritty things like transforming the data, picking a scale, etc.

chapter 6: examples.  Also introduces a classification system for charts.  It has one of those flowcharts which is supposed to allow someone to pick a type of chart based on whether the data is numeric or categorical, etc.  I know this is very popular in engineering and scientific textbooks but I have never found any use for such flowcharts.  There are 30 - 40 pages of charts here and a great resource to get some ideas.

chapter 7: exercises

chapter 8: resources



How to read a graph

Via Gelman, here is a nifty book-buying map from Amazon, displaying the split between "red books" and "blue books" bought by Amazon users in each state in the months leading up to the 2004 and 2008 presidential elections.

Last60days

Gelman noted the similarity between the Amazon map and the red-blue split of rich voters.


This post is about how to read a graph.  Here are some things that come to mind looking at the map:

  • Sampling bias: how does Amazon's customer base compare with the U.S. population, or rich voters?  It would be prudent to check this before making generalizations.  Gelman's point may be that Amazon customers behave like rich voters.
  • Sampling period: is the period long enough to capture the average inclination of the book buyers?  As is well known, book sales follow a long-tail distribution (Chris Anderson wrote an entire book based on this observation.)  Best-sellers have a disproportionate influence on average values.  If the time period is too short, the data may only represent the best-sellers.  Consider the following two maps in successive periods in 2004:

Unfitforcommandaug004

Worsethanwatergateapr04

Much of the red in the first map was due to John O'Neill's "Unfit for Command", published in August 2004, and much of the blue in the second map was due to John Dean's "Worse Than Watergate", published in April 2004.  If one of these two-month periods was used to draw conclusions, we would make big mistakes!

  • Classification: The long-tailed nature of book sales has wide-reaching implications on interpreting the data.  The most essential feature is that single books (bestsellers) have a disproportionate impact on average sales.  Since the key metric here is proportion of red (or blue) books, it follows that whether a best-seller is classified as red or blue makes a huge difference. 
Thus, one of the first things to look at is Amazon's helpful explanation of how they classified books as "red" or "blue".  We learn that they also have "purple" books which are those they could not decide if it's red or blue.  Each red or blue book is given equal weight but it appears that purple books are not tallied.  Glancing at the list of purple books, I see some hugely important books, e.g. Ron Paul's "The Revolution: A Manifesto" (Amazon rank #56  among all books), Tom Friedman's "Hot, Flat and Crowded" (#15).

If the purple books include best-sellers, then the decision to call it purple rather than red or blue causes an influential book to be excluded from the calculation.  We often forget that the decision to exclude is not a neutral decision; it is an active decision that says the excluded data contains no useful information.
 
This is not to say that excluding those books is the wrong decision.  We must make these decisions with considerable care, and realize that excluding best-sellers when book sales have a long-tailed distribution must not be taken lightly.

  • Causality: Lets say we are sufficiently satisfied that we can make a statement about book buying habits and voting behavior.  Then we need to think about the direction of causality.  Is the map saying that red book buyers are likely to vote red?  Or that red voters are likely to buy red books?  No prolonged staring at this data set will resolve this issue as other data would be needed to address it.

The more data is used to create a graph, the harder our task is to interpret it.  But the pay-off for spending the time is all the sweeter.  Happy graph-reading!


One final note: there is no doubt that this interactive map feature is a brilliant marketing move by Amazon.  This is a great and fun way for readers to find interesting books.


Reference: "Amazon, U.S.A.", Gelman blog, Oct 5 2008.