## More on equal-area histograms

##### May 31, 2023

Today, I'm returning to those "equal-area histograms" that Andrew wrote about last month. I have two previous posts about this. The first post introduces the concept: in a traditional histogram, the columns have the same bin width while the column heights can represent a variety of metrics, such as counts, relative frequencies (i.e. proportion of the data) and densities; in the equal-area histogram, the columns have varying widths while the area of each column is constant, and determined by the number of bins (columns).

Here is a comparison of the two types of histograms.

In a second post, I explained the differences between using counts, frequencies and densities in the vertical axis. The underlying issue is that the histogram is not merely a column chart, in which the width of the columns is arbitrary and data-free - in the histogram, both the heights and widths of columns carry meaning. One feature of the histogram that almost everyone expects is that the area of the columns sum up to 1. This aligns with a desired interpretation of probabilities of data falling into specified ranges, as we'd like the amount of data in the entire range to add up to 100%. Unfortunately, the two items are usually incompatible with each other.

If the height of the columns represents the probability of data falling into the range as indicated by its width, then the sum of the column heights is 1, which implies that the sum of the column areas cannot be 1. On the other hand, if the column areas add up to 1, then the column heights will not add up to 1, and thus, in this scenario, we cannot interpret the column heights to be probabilities. As explained in the second post, the column heights in this situation are densities, which can be defined as the proportion of data divided by the bin width. Intuitively, it gives information on how dense or sparse the data are within the specified range.

***

Today's post start with a toy dataset, containing randomly generated values from a normal distribution (bell curve) centered at 4 and with standard deviation 1.

Here is the traditional histogram of the dataset, using 100 equal-width bin. (I generated 10,000 values)

Next, I created a panel of four equal-area histograms, with increasingly number of bins. Each is built from the same underlying dataset.

The first histogram divides the data into 4 bins; then 10 bins, 20 bins and 100 bins.

In the 4-bin case, each column contains 1/4 = 25% of the data. The middle two columns contain 50% of the data, and they have high densities, as the widths of these columns are low. It's a crude approximation of the familiar bell curve.

As we increase the number of bins, the columns in the middle of the distribution, where most of the data are concentrated, become narrower. In the sparse regions, the column width doesn't necessarily grow because each column must contain 1/n of the data, where n is the number of columns. As the number of columns increases, each column contains less of the data.

The bottom chart is the "percentogram", which is what Andrew's correspondent proposed. The number of bins is set to 100, so each column contains exactly 1 percent of the data. For a normal distribution, the columns in the middle are very tall and thin.

The reason why the middle of the percentogram looks faded is that I asked for a white border around each column. But when the columns are so thin, even if one sets the border width very small, what readers see is a mixture of orange and white.

With high number of bins, we notice a few things: a) the outline of the histogram becomes "ragged" (the more bins there are), b) the middle columns become razor-thin c) the width conceded by the middle columns is absorbed not by the columns at the edges but those between the peak and the edge.

I'm struggling a bit to justify this percentogram versus the typical, equal-width histogram.

Let me go down a different path.

***

In "principled" histograms, the column heights represent data densities, while the total area of the columns add up to 1. This leads us to a new understanding of the relationship between the equal-width histogram and the equal-area histogram.

We start with data density defined by (proportion of data) / (bin width). Those two values are not independent - one is fully determined by the other, given the underlying dataset. In a traditional equal-width histogram, the question is: how much of the data is found in a column of fixed width? In the new equal-area histogram, the question is: how wide is the bin that contains a fixed amount of data? In the former, the denominator is fixed while the numerator varies; the opposite occurs in the latter.

***

We also recognize that given the range of the data, there is a relationship between the the set of bin widths in the two types of histograms. In the traditional histogram, all bin widths have the same value, equal to the range of the data divided by the number of bins. Think of this as the average bin width. In an equal-area histogram, the set of bin widths varies: however, the sum of the bin widths must still add up to the range of the data. For two comparable histograms with the same number of bins, the average of the bin widths must be the same for both sets. (I'm ignoring any rounding situations in which the range of the histogram is larger than the range of the data.)

Now, consider the middle of the normal distribution where the data are dense. In the traditional histogram, the column in the middle still has width equal to the average bin width. In the equal-area histogram, the middle column has width much smaller than the average bin width. In other words, we can think of the column in the traditional histogram being broken up into many thin and slim columns in the equal-area histogram, each containing 1% of the data in the case of the percentogram.

The height of the column is the data density. In the traditional histogram, the middle column is the pooled sample of larger size; in the equal-area histogram, each of those thin and slim columns is a partition of the sample. This explains observation (a) above in which the outline of the equal-area histogram is more ragged - it's because each column contains fewer data from which to estimate the data density.

But this raggedness is artificial, sampling noise.

***

The sparse areas are more complicated still. It's also the reverse of the above. On the edges of the normal distribution, the columns of the new histogram are wider than those of the traditional histogram. So, we can think of breaking up the edge column of the new histogram into multiple columns of the traditional histogram.

The interpretation is more complicated because the data are sparse in this region. Obviously, the estimates of density on the traditional histogram in sparse regions are poor because not enough data reside in there. The density estimate on the new histogram is based on a larger sample size.

However.

Yes, however, whether the new histogram's density estimate is better depends on the shape of the tail of the distribution. A normal distribution has exponential tails, which means that the data density declines quite drastically the further we go into the tail. Therefore, the new histogram averages the data densities across a large part of the tail, wiping out the exponential shape while the traditional histogram preserves that shape - at the expense of greater sampling variability due to smaller sample sizes.

***

For what it's worth, let's look at some histograms for an exponential random variable.

The data are extremely dense on the left side while it has a long tail on the right side.

Here are the four equal-area histograms for 4, 10, 20 and 100 bins.

The four-bin version gives a nice summary of the shape. As the number of bins goes up, as before, the denser regions now have tall, thin spikes. Again, because of the white borders, the last histogram with 100 bins is faded where the data are densest. (So obviously, don't follow my lead, and eliminate borders if you want to use it.)

The 100-bin version looks almost the same as the traditional histogram.

***

At this stage of the exploration, I still haven't found a compelling reason to switch to equal-area hist0grams. In the denser regions, it's adding sampling noise. If I don't care about the sparser areas, specifically, the shape of the tails, maybe they provide a cleaner presentation.

## Visual story-telling: do you know or do you think?

##### May 22, 2023

One of the most important data questions of all time is: do you know? or do you think?

And one of the easiest traps to fall into is: I think, therefore I know.

***

Visual story-telling can be great but it can also mislead. Deception sometimes happens when readers are nudged to "fill in the blanks" with stuff they think they know, but they don't.

A Twitter reader asked me to look at the map in this Los Angeles Times (paywall) opinion column.

The column promptly announces its premise:

Years of widening economic inequality, compounded by the pandemic and political storm and stress, have given Americans the impression that the country is on the wrong track. Now there’s empirical data to show just how far the country has run off the rails: Life expectancies have been falling.

The writer creates the expectation that he will reveal evidence in the form of data to show that life expectancies have been driven down by economic inequality, pandemic, and politics. Does he succeed?

***

The map portrays average life expectancy (at birth) for some mysterious, presumably very recent, year for every county in the United States. From the color legend, we learn that the bottom-to-top range is about 20 years. There is a clear spatial pattern, with the worst results in the south (excepting south Florida).

The choice of colors is telling. Red and blue on a U.S. map has heavy baggage, as they signify the two main political parties in the country. Given that the author believes politics to be a key driver of health outcomes, the usage of red and blue here is deliberate. Throughout the article, the columnist connects the lower life expectancies in southern states to its politics.

For example, he said "these geographical disparities aren't artifacts of pure geography or demographics; they're the consequences of policy decisions at the state level... Of the 20 states with the worst life expectancies, eight are among the 12 that have not implemented Medicaid expansion under the Affordable Care Act..."

Casual readers may fall into a trap here. There is nothing on the map itself that draws the connection between politics and life expectancies; the idea is evoked purely through the red-blue color scheme. So, as readers, we are filling in the blanks with our own politics.

What could have been done instead? Let's look at the life expectancy map side by side with the map of the U.S. 2020 Presidential election.

Because of how close recent elections have been, we may think the political map has a nice balance of red and blue but it isn't. The Democrats' votes are heavily concentrated in densely-populated cities so most of the Presidential election map is red. When placed next to each other, it's obvious that politics don't explain the variance in life expectancy well. The Midwest is deep red and yet they have above average life expectancies. I have circled out various regions that contradict the claim that Republican politics drove life expectancies down.

It's not sufficient to point to the South, in which Republican votes and life expectancy are indeed inversely correlated. A good theory has to explain most of the country.

***

The columnist also suggests that poverty is the cause of low life expectancy. That too cannot be gleaned from the published map. Again, readers are nudged to use their wild imagination to fill in the blank.

Data come to the rescue. Here is a side-by-side comparison of the map of life expectancies and the map of median incomes.

A similar conundrum. While the story feels right in the South, it fails to explain the northwest, Florida, and various other parts of the country. Take a look again at the circled areas. Lower income brackets are also sometimes associated with high life expectancies.

***

The author supplies a third cause of lower life expectancies: Covid-19 response. Because Covid-19 was the "most obvious and convenient" explanation for the loss of life expectancy during the pandemic, this theory suggests that the red areas on the life expectancy map should correspond to the regions most ravaged by Covid-19.

Let's see the data.

The map on the right shows the number of confirmed cases until June 2021. As before, the correlation holds somewhat in the South but there are notable exceptions, e.g. the Midwest. We also have states with low Covid-19 cases but below-average life expectancy.

***

What caused the decline of life expectancy in the U.S. - which began before the pandemic, and has continued beyond - is highly complex, beyond what a single map or a pair of maps or a few pairs of maps could convey. Showing a red-blue map presents a trap for readers to fall into, in which they start thinking, without knowing.

## Parsons Student Projects

##### May 19, 2023

I had the pleasure of attending the final presentations of this year's graduates from Parsons's MS in Data Visualization program. You can see the projects here.

***

A few of the projects caught my eye.

A project called "Authentic Food in NYC" explores where to find "authentic" cuisine in New York restaurants. The project is notable for plowing through millions of Yelp reviews, and organizing the information within. Reviews mentioning "authentic" or "original" were extracted.

During the live presentation, the student clicked on Authentic Chinese, and the name that popped up was Nom Wah Tea Parlor, which serves dim sum in Chinatown that often has lines out the door.

Curiously, the ranking is created from raw counts of authentic reviews, which favors restaurants with more reviews, such as restaurants that have been operating for a longer time. It's unclear what rule is used to transfer authenticity from reviews to restaurants: does a single review mentioning "authentic" qualify a restaurant as "authentic", or some proportion of reviews?

Later, we see a visualization of the key words found inside "authentic" reviews for each cuisine. Below are words for Chinese and Italian cuisines:

These are word clouds with a twist. Instead of encoding the word counts in the font sizes, she places each word inside a bubble, and uses bubble sizes to indicate relative frequency.

Curiously, almost all the words displayed come from menu items. There isn't any subjective words to be found. Algorithms that extract keywords frequently fail in the sense that they surface the most obvious, uninteresting facts. Take the word cloud for Taiwanese restaurants as an example:

The overwhelming keyword found among reviews of Taiwanese restaurants is... "taiwanese". The next most important word is "taiwan". Among the remaining words, "886" is the name of a specific restaurant, "bento" is usually associated with Japanese cuisine, and everything else is a menu item.

Getting this right is time-consuming, and understandably not a requirement for a typical data visualization course.

The most interesting insight is found in this data table.

It appears that few reviewers care about authenticity when they go to French, Italian, and Japanese restaurants but the people who dine at various Asian restaurants, German restaurants, and Eastern European restaurants want "authentic" food. The student concludes: "since most Yelp reviewers are Americans, their pursuit of authenticity creates its own trap: Food authenticity becomes an americanized view of what non-American food is."

This hits home hard because I know what authentic dim sum is, and Nom Wah Tea Parlor it ain't. Let me check out what Yelpers are saying about Nom Wah:

1. Everything was so authentic and delicious - and cheap!!!
2. Your best bet is to go around the corner and find something more authentic.
3. Their dumplings are amazing everything is very authentic and tasty!
4. The food was delicious and so authentic, and the staff were helpful and efficient.
5. Overall, this place has good authentic dim sum but it could be better.
6. Not an authentic experience at all.
7. this dim sum establishment is totally authentic
8. The onions, bean sprouts and scallion did taste very authentic and appreciated that.
9. I would skip this and try another spot less hyped and more authentic.
10. I would have to take my parents here the next time I visit NYC because this is authentic dim sum.

These are the most recent ten reviews containing the word "authentic". Seven out of ten really do mean authentic, the other three are false friends. Text mining is tough business! The student removed "not authentic" which helps. As seen from above, "more authentic" may be negative, and there may be words between "not" and "authentic". Also, think "not inauthentic", "people say it's authentic, and it's not", etc.

One thing I learned from this project is that "authentic" may be a synonym for "I like it" when these diners enjoy the food at an ethnic restaurant. I'm most curious about what inauthentic onions, bean sprouts and scallion taste like.

I love the concept and execution of this project. Nice job!

***

Another project I like is about tourism in Venezuela. The back story is significant. Since a dictatorship took over the country, the government stopped reporting tourism statistics. It's known that tourism collapsed, and that it may be gradually coming back in recent years.

This student does not have access to ready-made datasets. But she imaginatively found data to pursue this story. Specifically, she mentioned grabbing flight schedules into the country from the outside.

The flow chart is a great way to explore this data:

A map gives a different perspective:

I'm glad to hear the student recite some of the limitations of the data. It's easy to look at these visuals and assume that the data are entirely reliable. They aren't. We don't know that what proportion of the people traveling on those flights are tourists, how full those planes are, or the nationalities of those on board. The fact that a flight originated from Panama does not mean that everyone on board is Panamanian.

***

The third project is interesting in its uniqueness. This student wants to highlight the effect of lead in paint on children's health. She used the weight of lead marbles to symbolize the impact of lead paint. She made a dress with two big pockets to hold these marbles.

It's not your standard visualization. One can quibble that dividing the marbles into two pockets doesn't serve a visualziation purpose, and so on. But at the end, it's a memorable performance.

## Graph workflow and defaults wreak havoc

##### May 12, 2023

For the past week or 10 days, every time I visited one news site, it insisted on showing me an article about precipitation in North Platte. It's baiting me to write a post about this lamentable bar chart (link):

***

This chart got problems, and the problems start with the tooling, which dictates a workflow.

I imagine what the chart designer had to deal with.

For a bar chart, the tool requires one data series to be numeric, and the other to be categorical. A four-digit year is a number, which can be treated either as numeric or categorical. In most cases, and by default, numbers are considered numeric. To make this chart, the user asked the tool to treat years as categorical.

Many tools treat categories as distinct entities ("nominal"), mapping each category to a distinct color. So they have 11 colors for 11 years, which is surely excessive.

This happens because the year data is not truly categorical. These eleven years were picked based on the amount of rainfall. There isn't a single year with two values, it's not even possible. The years are just irregularly spaced indices. Nevertheless, the tool misbehaves if the year data are regarded as numeric. (It automatically selects a time-series line chart, because someone's data visualization flowchart says so.) Mis-specification in order to trick the tool has consequences.

The designer's intention is to compare the current year 2023 to the driest years in history. This is obvious from the subtitle in which 2023 is isolated and its purple color is foregrounded.

How unfortunate then that among the 11 colors, this tool grabbed 4 variations of purple! I like to think that the designer wanted to keep 2023 purple, and turn the other bars gray -- but the tool thwarted this effort.

The tool does other offensive things. By default, it makes a legend for categorical data. I like the placement of the legend right beneath the title, a recognition that on most charts, the reader must look at the legend first to comprehend what's on the chart.

Not so in this case. The legend is entirely redundant. Removing the legend does not affect our cognition one bit. That's because the colors encode nothing.

Worse, the legend sows confusion because it presents the same set of years in chronological order while the bars below are sorted by amount of precipitation: thus, the order of colors in the legend differs from that in the bar chart.

I can imagine the frustration of the designer who finds out that the tool offers no option to delete the legend. (I don't know this particular tool but I have encountered tools that are rigid in this manner.)

***

Something else went wrong. What's the variable being plotted on the numeric (horizontal) axis?

The answer is inches of rainfall but the answer is actually not found anywhere on the chart. How is it possible that a graphing tool does not indicate the variables being plotted?

I imagine the workflow like this: the tool by default puts an axis label which uses the name of the column that holds the data. That column may have a name that is not reader-friendly, e.g. PRECIP. The designer edits the name to "Rainfall in inches". Being a fan of the Economist graphics style, they move the axis label to the chart title area.

The designer now works the chart title. The title is made to spell out the story, which is that North Platte is experiencing a historically dry year. Instead of mentioning rainfall, the new title emphasizes the lack thereof.

The individual steps of this workflow make a lot of sense. It's great that the title is informative, and tells the story. It's great that the axis label was fixed to describe rainfall in words not database-speak. But the end result is a confusing mess.

The reader must now infer that the values being plotted are inches of rainfall.

Further, the tool also imposes a default sorting of the bars. The bars run from longest to shortest, in this case, the longest bar has the most rainfall. After reading the title, our expectation is to find data on the Top 11 driest years, from the driest of the driest to the least dry of the driest. But what we encounter is the opposite order.

Most graphics software behaves like this as they are plotting the ranks of the categories with the driest being rank 1, counting up. Because the vertical axis moves upwards from zero, the top-ranked item ends up at the bottom of the chart.

***

Moving now from the V corner to the D corner of the Trifecta checkup (link), I can't end this post without pointing out that the comparisons shown on the chart don't work. It's the first few months of 2023 versus the full years of the others.

The fix is to plot the same number of months for all years. This can be done in two ways: find the partial year data for the historical years, or project the 2023 data for the full year.

(If the rainy season is already over, then the chart will look exactly the same at the end of 2023 as it is now. Then, I'd just add a note to explain this.)

***

Here is a version of the chart after doing away with unhelpful default settings: