Imagine Magazine, a youth-focused journal by Johns Hopkins's Center of Talented Youth, invited me to contribute an article in celebration of statistics. I try to convey the fun and joy of working with numbers and charts.
You can read it here, starting on p. 22. While you're at it, the rest of the magazine is really great too, and I hope we're able to influence at least a few youngsters to take up math-stats as a career.
I will be the luncheon speaker at INFORMS NYC on Wednesday in NYC. The talk will provide some context for my new book Numbersense (link), and discuss a few examples from the book. You can pre-register here.
INFORMS is the professional society for Operations Research and Management Science people. For some years, I have attended these regularly and learned a lot from other industry speakers.
If you decide at the last minute, you can pay the $5 extra fee on the day of the talk. Or register now.
I will be at Book Expo this Friday signing books at the McGraw-Hill
booth. If you're in NYC, drop by and say hi between 11 and 12.
Yes, it's a new book! The title is Numbersense: How to Use Big Data to Your Advantage (link).
If you read my blogs, you already know where I'm going with this. How
can we be smart consumers of data analyses in a world overflowing with
data? It will be in stores in July. Between now and then, you can come
back here to learn more.
Also, at 12:30, I'll be interviewed at the Shindig event by Peggy Sanservieri, who blogs at Huffington Post on book marketing. This is an online live chat event. Go to their site to register, and you'd have the opportunity to ask me questions.
JMP is giving away signed copies of Numbers Rule Your World. See details here.
JMP is a great piece of software for those who like to point and click, drag things around, interactively build models. People I hire who are analytical but don't have proper statistical training seem to enjoy using it and produce good work from it. There are other similar software on the market; I haven't tried them out so I don't know if they are better or worse but I can say I have had a pleasant time with JMP.
Speaking of which, if you haven't already, do subscribe to my sister blog, where I discuss the statistical thinking behind everything that's happening around us.
The RSS feed: here. The twitter feed combines the two blogs.
I am happy to provide the following review of this interesting book by Martin and Simon, who are readers of Junk Charts. Martin also publishes a blog, and he's the one who has created bumps charts for the Tour de France races (which also appear in the book).
Interactive Graphics for Data Analysis is an advanced book written by two researchers who have deep experience developing graphics software. People who like to go beyond the basics will find it a useful addition to the literature.
To give you an idea of the level of sophistication, just in Chapter 1 (titled Interactivity), the two authors utilize set operations, SQL statements, and parallel coordinate plots. They assume you have some sense of what those are. That said, those sections can be skipped without interrupting the flow of the book.
The following key messages from these authors are worth repeating:
There is a distinction between statistical graphics and data graphics. Underlying trends and patterns in the data is often made clear by performing statistical analyses on the data, with the results added to charts (e.g. loess lines). When dealing with very large data sets, statistical charts (such as box plots) are found to be much more scalable, precisely because they do not attempt to put every data point onto the page.
The authors stress the need to look at a variety of charts when doing exploratory data analysis. This is because most chart types do certain things well but not others.
Throughout the book, they make much hay of the problem of "over-plotting", that is, overlapping data. This happens when data is abundant, or when values are concentrated in a narrow range. A great illustration of this problem is the parallel coordinates plot, which can look entirely different depending on which lines are plotted on top of which other lines. (The charts on the right are identical except for the order in which the lines are plotted.) Common strategies include "jittering", and varying transparency. Many of these strategies have issues of their own.
They also point out that the look of many multivariate charts (such as mosaic charts) depends on the sorting of the data. This is a key weakness of many such plots. Just think about this the next time you create a stacked column chart.
The book is divided into two sections: Principles and Examples. The second half, the Examples section, consists of case studies in which the authors show examples of how to investigate the structure of a given data set.
The example of using the fatty-acid contents of Italian olive oils to deduce their regional origin is a good visualization of how the statistical technique of classification trees work. Here is the telling diagram:
Notice that data with the same color are oils from the same region, the rectangular sections are results of the statistical classification procedure, and we would like to see most (if not all) of the data within each section having the same color.
Without a doubt, graphics designers should be aware of the issues raised by these authors. The book appears to be written for students who are creating statistical software (complete with end-of-chapter exercises.) I'm left wondering what users of graphics software can do with this information because much of this material relates to the design of graphics software. Knowing these issues makes you want to do things the software may not be designed to do efficiently. For example, most software packages I have used do not have a simple toggle to sort categorical variables by various means (alphabetical, increasing or decreasing frequency, increasing or decreasing value of another variable, etc.).
Dan at Eye Heart New York has a fantastic post relating to the recent release of restaurant health inspection data by New York City. This has caused a furor among the restaurant owners because they are now required to wear their A/B/C badges front and center. Dan collected some data (which he also posted), made some charts, and reported some interesting insights.
Here is an overview chart that shows the distribution of scores (the higher the score, the lower the grade). He called it a "scatter plot" but it is really a histogram where the bucket size is 1 except for the rightmost bucket.
I like the use of green, yellow and red colors to indicate (without words) the conversion scale from scores (violation points) to grades (A/B/C). The legend "Count" is an Excel monstrosity. I'd have used a bucket size of at least 5, which would smooth out the gyrations in the green zone.
A more typical way to summarize numeric data in groups is Tukey's boxplot, as shown below.
I use Dan's raw data on this chart. 1 = A, 2 = B, 3 = C. What is group 4?
It turns out Dan has removed this group from all of his analysis. A little research shows that group 4 are restaurants that have been closed by the Dept of Health. Interestingly, the scores of these restaurants are spread widely so the DOH appears to be closing restaurants not just for health violations. (In the rest of this post, I have removed group 4.)
For those not familiar with box plots, the box contains the middle 50% of the data (in this case, the scores of the middle half of the restaurants in the respective group); the line inside the box is the median score; the dots above (or below, though nonexistent here) the vertical lines are outliers. As Dan pointed out, group C has lots of outliers on the high end of the score.
Just for fun, I pulled the violations of the highest scoring restaurant (111 violation points). What I find intriguing is the huge fluctuation in scores over the last 5 inspections. Does this happen to other restaurants too? What does that say about the grading system?
Next, Dan then attempted to address the questions: did scores vary across the 5 boroughs? and did scores vary across cuisine groups? This is the concept covered in Chapter 1 of my book: always look at the variation around averages, that's where the most interesting stuff is.
He calculated the means and standard deviations of different subgroups. It is simpler to visualize the data, again using boxplots.
Here's one dealing with boroughs, and it is clear that there is not much to pick between them. You could possibly say Staten Island is better than the other 4 boroughs.
Here's one dealing with cuisine groups, using Dan's definitions.
The order of the cuisine groups is by median score from lowest on the left to highest on the right. Again, there is no drastic difference. It is certainly not the case that Asian/Latin American restaurants are worse than say European or American ones.
About half of the restaurants under desserts, drinks, misc., african, and others received As while a bit less than half of the other cuisine groups got As. Some of the cuisine groups had few egregious violators (African, Middle East) - but this data is perhaps skewed by the removal of the "closed" restaurants.
One shortcoming of the traditional boxplot is the omission of how large each group is. For groups that are too small, it is difficult to draw any statistical conclusions. We know from Dan's table, for instance, that there were only 17 restaurants classified as "African".
(Unfortunately, Excel does not have built-in capability for generating boxplots.)
For those in the New York area, I will be giving a talk tomorrow (Aug 11, Wed) at noon at Columbia's EdLab. The talk will cover a topic from the book, and what about it is not typically discussed in statistics courses. See here for an abstract.