Decluttering charts

Enrico posted about the following chart, addressing the current assault on scientific research funding, and he's worried that poor communications skills are hurting the cause.

Bertini_tiretracks

He's right. You need half an hour to figure out what's going on here.

Let me write down what I have learned so far.

The designer only cares about eight research areas - all within the IT field - listed across the bottom.

Paired with each named research area are those bolded blue labels that run across the top (but not quite). I think they represent the crowning achievement within each field but I'm just guessing here.

It appears that each field experiences a sequence of development stages. Typically, universities get things going, then industry R&D teams enter the game, and eventually, products appear in the market. The orange, blue and black lines show this progression. The black line morphs into green, and may even expand in thickness - indicating progressive market adoption and growth.

For example, the first field from the left, digital communications, is shown to have begun in 1965 at universities. Then in early 1980s, industry started investing in this area. It was not until the 1990s when products became available, and not until the mid 2000s when the market exceeded $10 billion.

Even now, I haven't resolved all its mysteries. It's not explained the difference between a solid black line and a dotted black line. Further, it appears possible to bypass $1 billion and hit $10 billion right away.

***

Next, we must decipher the strange web of gray little arrows.

It appears that the arrows can go from orange to blue, blue to orange, blue to black, orange to black. Under digital communications, I don't see black or green back to blue or orange. However, under computer architecture, I see green to orange; under parallel & distributed systems, I see green to blue. I don't see any black to orange or black to blue, so black is a kind of trapping state (things go in but don't come out). Sometimes, it's better to say which direction is not possible - in this case, I think other than nothing comes out of black, every other direction is possible.

It remains unclear what sort of entity each arrow depicts. Each arrow has a specific start and end time. I'm guessing it has to do with a specific research item. Taking the bottom-most arrow for digital communications, I suppose something begun in academia in 1980 and then attracted industry investment around 1982. An arrow that points backwards from industry to academia indicates that universities pick up new research ideas from industry. Digital communications things tend to have short arrows, suggesting that it takes only a few years to bring a product to market.

To add to this mess, some arrows cross research areas. These are shown as curved arrows, rather than straight arrows. For these curved arrows, the "slope" of the arrow no longer holds any meaning.

The set of gray arrows are trying too hard. They are overstuffed with purposes. On the one hand, the web of arrows - and I'm referring to those between research areas - portray the synergies between different research areas. On the other hand, the arrows within each research area show the development trajectories of anonymized subjects. The arrows going back and forth between the orange and blue bars show the interplay between universities and industry research groups.

***

Lastly, we look at those gray text labels at the very top of the page. That's a grab-bag of corporate names (Motorola, Intel, ...) and product names (iPhone, iRobot, ...). Some companies span several research areas. I'm amused and impressed that apparently a linear sequence can be found for the eight research areas such that every single company has investments in only contiguous areas, precluding the need to "leapfrog" certain research areas!

Actually, no, that's wrong. I do notice Nvidia and HP appearing twice. But why is Google not part of digital communications next to iPhone?

Given that no universities are listed, the company and product labels are related to only the blue, black or green lines below. It might be only related to black and/or green. I'm not sure.

***

So far, I've expended energy only to tease out the structure of the underlying dataset. I haven't actually learned anything about the data!

***

The designer has to make some decisions because the different potential questions that the dataset can address impose conflicting graphical requirements.

If the goal is to surface a general development process that repeats for every research area, then the chart should highlight commonality, rather than difference. By contrast, if one's objective is to illustrate how certain research areas have experiences unique to themselves, one should choose a graphical form that brings out the differences.

If the focus is on larger research areas, then the relevant key dates are really the front ends of each vertical line; nothing else matters. By contrast, if one wants to show individual research items, then many more dates become pertinent.

A linear arrangement of the research areas will not perform if one's goal is to uncover connections between research areas. By contrast, if one attempts to minimize crossovers in a network design, it would be impossible to keep all elements belonging to each research area in close proximity.

A layering approach that involves multiple charts to tell the whole story may be the solution. See for example Gelman's post on ladder of abstraction.


Hammock plots

Prof. Matthias Schonlau gave a presentation about "hammock plots" in New York recently.

Here is an example of a hammock plot that shows the progression of different rounds of voting during the 1903 papal conclave. (These are taken at the event and thus a little askew.)

Hammockplot_conclave

The chart shows how Cardinal Sarto beat the early favorite Rampolla during later rounds of voting. The chart traces the movement of votes from one round to the next. The Vatican destroys voting records, and apparently, records were unexpectedly retained for this particular conclave.

The dataset has several features that brings out the strengths of such a plot.

There is a fixed number of votes, and a fixed number of candidates. At each stage, the votes are distributed across the subset of candidates. From stage to stage, the support levels for candidate shift. The chart brings out the evolution of the vote.

From the "marginals", i.e. the stacked columns shown at each time point, we learn the relative strengths of the candidates, as they evolve from vote to vote.

The links between the column blocks display the evolution of support from one vote to the next. We can see which candidate received more votes, as well as where the additional votes came from (or, to whom some voters have drifted).

The data are neatly arranged in successive stages, resulting in discrete time steps.

Because the total number of votes are fixed, the relative sizes of the marginals are nicely constrained.

The chart is made much more readable because of binning. Only the top three candidates are shown individually with all the others combined into a single category. This chart would have been quite a mess if it showed, say, 10 candidates.

How precisely we can show the intra-stage movement depends on how the data records were kept. If we have the votes for each person in each round, then it should be simple to execute the above! If we only have the marginals (the vote distribution by candidate) at each round, then we are forced to make some assumptions about which voters switched their votes. We'd likely have to rule out unlikely scenarios, such as that in which all of the previous voters for candidate X switched to someone other candidates while another set of voters switched their votes to candidate X.

***

Matthias also showed examples of hammock plots applied to different types of datasets.

The following chart displays data from course evaluations. Unlike the conclave example, the variables tied to questions on the survey are neither ordered nor sequential. Therefore, there is no natural sorting available for the vertical axes.

Hammockplot_evals

Time is a highly useful organizing element for this type of charts. Without such an organizing element, the designer manually customizes an order.

The vertical axes correspond to specific questions on the course evaluation. Students are aggregated into groups based on the "profile" of grades given for the whole set of questions. It's quite easy to see that opinions are most aligned on the "workload" question while most of the scores are skewed high.

Missing values are handled by plotting them as a new category at the bottom of each vertical axis.

This example is similar to the conclave example in that each survey response is categorical, one of five values (plus missing). Matthias also showed examples of hammock plots in which some or all of the variables are numeric data.

***

Some of you will see some resemblance of the hammock plot with various similar charts, such as the profile chart, the alluvial chart, the parallel coordinates chart, and Sankey diagrams. Matthias discussed all those as well.

Matthias has a book out called "Applied Statistical Learning" (link).

Also, there is a Python package for the hammock plot on github.


Aligning the visual and the message

Today's post is about work by Diane Barnhart, who is a product manager at Bloomberg, and is taking Ray Vella's infographics class at NYU. The class is given a chart from the Economist, as well as some data on GDP per capita in selected countries at the regional level. The students are asked to produce data visualization that explores the change in income inequality (as indicated by GDP per capita).

Here is Diane's work:

Diane Barnhart_Rich Get Richer

In this chart, the key measure is the GDP per capita of different regions in Germany relative to the national average GDP. Hamburg, for example, has a GDP per capita that was 80% above the national average in 2000 while Leipzig's GDP per capita was 30% below the national average in 2000. (This metric is a bit of a head scratcher, and forms the basis of the Economist chart.)

***

Diane made several insightful design choices.

The key insight of this graph is also one of the easiest to see. It's the narrowing of the range of possible values. In 2000, the top value is about 90% while the bottom is under -40%, making a range of 130%. In 2020, the range has narrowed to 90%, with the values falling between 60% and -30%. In other words, the gap between rich and poor regions in Germany has reduced over these two decades.

The chosen chart form makes this message come alive.

Diane divided the regions into three groups, mapped to the black, red and yellow colors of the German flag. Black are for those regions that have GDP per capita above the average; yellow for those regions with GDP per capita over 25% below the average.

Instead of applying color to individual lines that trace the GDP metric over time for each region, she divided the area between the lines into three, and painted them. This necessitates a definition of the boundary line between colored areas over time. I gathered that she classified the regions using the latest GDP data (2020) and then traced the GDP trend lines back in time. Other definitions are also possible.

The two-column data table shown on the right provides further details that aren't found in the data visualization. The table is nicely enhanced with colors. They represent an augmentation of the information in the main chart, not a repetition.

All in all, this is a delightful project, and worthy of a top grade!


Anti-encoding

Howie H., sometime contributor to our blog, found this chart in a doctor's office:

WhenToExpectAReturnCall_sm

Howie writes:

Among the multitude of data visualization sins here, I think the worst is that the chart *anti*-encodes the data; the longest wait time has the shortest arc!

While I waited I thought about a redesign.  Obviously a simple bar chart would work.  A properly encoded radial bar could work, or small multiple pie charts.  But I think the design brief here probably calls for a bit of responsible data art, as this is supposed to be an eye-catching poster.

I came up with a sort of bar superimposed on a calendar for reference.  To quickly draft the design it was easier to do small multiples, but maybe all three arrows could be placed on a two-week grid and the labels could be inside the arrows, or something like that.  It’s a very rough draft but I think it points toward a win-win of encoding the actual data while retaining the eye-catching poster-ness that I’m guessing was a design goal.

Here is his sketch:

JunkCharts-redo_howardh_WhenToExpectAReturnCall redesign sm

***

I found a couple of interesting ideas from Howie's re-design.

First, he tried to embody the concept of a week's wait by visual reference to a weekly calendar.

Second, in the third section, he wanted readers to experience "hardship"  by making them wrap their eyes to a second row.

He wanted the chart to be both accurate and eye-catching.

It's a nice attempt that will improve as he fiddles more with it.

***

Based on Howie's ideas, I came up with two sketches myself.

In the first sketch, instead of the arrows, I put numbers into the cells.

Junkcharts_redo_whentoexpectareturncall_1

In the second sketch, I emphasized eye-catching while sacrificing accuracy. It uses a spiral imagery, and I think it does a good job showing the extra pain of a week-long wait. Each trip around the circle represents 24 hours.

Junkcharts_redo_whentoexpectacall_2

The wait time is actually encoded in the traversal of angles, rather than the length of the spiral. I call this creation less than accurate because most readers will assume the spiral length to be the wait time, and thus misread the data.

Which one(s) do you like?


Coffee in different shapes and sizes: a test of self-sufficiency

Take a look at the following graphic showing top producers of coffee in 2o24:

Junkcharts_voronoicoffeeproduction

Then, try the following tasks:

  • Which country is the top producer?
  • What proportion of the world's production does the top country make?
  • Which countries form the top three?
  • How much is the "Rest of the World" compared to Brazil?
  • How many countries account for the top 50% of the world's production?
  • Does Indonesia or Columbia produce more coffee?
  • Compare India and Uganda
  • How about Honduras vs Peru?

I finished two cups of coffee and still couldn't answer most of these questions. How about you?

***

Now, let's look at the original chart, published by Voronoi, and sent to me by a long-time reader:

Visualcapitalist_coffee

Try those questions again, and the answers seem much more available.

How so?

What we've just demonstrated is that when the reader takes information from this graphic, the reader is consuming the data labels, while the visual encoding of data to shapes has offered zero help.

Given this finding, replacing the above chart with a data table would have achieved the same result, if not expediting understanding.

***

I'm using this graphic to illustrate my "self-sufficiency" test: by removing all data labels from the chart, we reveal how much work the visual elements are doing to enable understanding of the message and the underlying data.

***

Now, our long-time reader has a few comments, with which I agree:

  • what they did right: avoided the "let's just use a choropleth trap"
  • what went wrong? a) using shapes you can't compare at a glance
  • what went wrong? b) no color difference between the shapes
  • what went wrong? c) it looks like larger values are on top, except for Mexico which is squeezed up top for some reason

 

 

 

 

 

 


Election coverage prompts good graphics

The election broadcasts in the U.S. are full-day affairs, and they make a great showcase for interactive graphics.

The election setting is optimal as it demands clear graphics that are instantly digestible. Anything else would have left viewers confused or frustrated.

The analytical concepts conveyed by the talking heads during these broadcasts are quite sophisticated, and they did a wonderful job at it.

***

One such concept is the value of comparing statistics against a benchmark (or, even multiple benchmarks). This analytics tactic comes in handy in the 2024 election especially, because both leading candidates are in some sense incumbents. Kamala was part of the Biden ticket in 2020, while Trump competed in both 2016 and 2020 elections.

Msnbc_2024_ga_douglas

In the above screenshot, taken around 11 pm on election night, the MSNBC host (that looks like Steve K.) was searching for Kamala votes because it appeared that she was losing the state of Georgia. The question of the moment: were there enough votes left for her to close the gap?

In the graphic (first numeric column), we were seeing Kamala winning 65% of the votes, against Trump's 34%, in Douglas county in Georgia. At first sight, one would conclude that Kamala did spectacularly well here.

But, is 65% good enough? One can't answer this question without knowing past results. How did Biden-Harris do in the 2020 election when they won the presidency?

The host touched the interactive screen to reveal the second column of numbers, which allows viewers to directly compare the results. At the time of the screenshot, with 94% of the votes counted, Kamala was performing better in this county than they did in 2020 (65% vs 62%). This should help her narrow the gap.

If in 2020, they had also won 65% of the Douglas county votes, then, we should not expect the vote margin to shrink after counting the remaining 6% of votes. This is why the benchmark from 2020 is crucial. (Of course, there is still the possibility that the remaining votes were severely biased in Kamala's favor but that would not be enough, as I'll explain further below.)

All stations used this benchmark; some did not show the two columns side by side, making it harder to do the comparison.

Interesting side note: Douglas county has been rapidly shifting blue in the last two decades. The proportion of whites in the county dropped from 76% to 35% since 2000 (link).

***

Though Douglas county was encouraging for Kamala supporters, the vote gap in the state of Georgia at the time was over 130,000 in favor of Trump. The 6% in Douglas represented only about 4,500 votes (= 70,000*0.06/0.94). Even if she won all of them (extremely unlikely), it would be far from enough.

So, the host flipped to Fulton county, the most populous county in Georgia, and also a Democratic stronghold. This is where the battle should be decided.

Msnbc_2024_ga_fulton

Using the same format - an interactive version of a small-multiples arrangement, the host looked at the situation in Fulton. The encouraging sign was that 22% of the votes here had not yet been counted. Moreover, she captured 73% of those votes that had been tallied. This was 10 percentage points better than her performance in Douglas, Ga. So, we know that many more votes were coming in from Fulton, with the vast majority being Democratic.

But that wasn't the full story. We have to compare these statistics to our 2020 benchmark. This comparison revealed that she faced a tough road ahead. That's because Biden-Harris also won 73% of the Fulton votes in 2020. She might not earn additional votes here that could be used to close the state-wide gap.

If the 73% margin held to the end of the count, she would win 90,000 additional votes in Fulton but Trump would win 33,000, so that the state-wide gap should narrow by 57,000 votes. Let's round that up, and say Fulton halved Trump's lead in Georgia. But where else could she claw back the other half?

***

From this point, the analytics can follow one of two paths, which should lead to the same conclusion. The first path runs down the list of Georgia counties. The second path goes up a level to a state-wide analysis, similar to what was done in my post on the book blog (link).

Cnn_2024_ga

Around this time, Georgia had counted 4.8 million votes, with another 12% outstanding. So, about 650,000 votes had not been assigned to any candidate. The margin was about 135,000 in Trump's favor, which amounted to 20% of the outstanding votes. But that was 20% on top of her base value of 48% share, meaning she had to claim 68% of all remaining votes. (If in the outstanding votes, she got the same share of 48% as in the already-counted, then she would lose the state with the same vote margin as currently seen, and would lose by even more absolute votes.)

The reason why the situation was more hopeless than it even sounded here is that the 48% base value came from the 2024 votes that had been counted; thus, for example, it included her better-than-benchmark performance in Douglas county. She would have to do even better to close the gap! In Fulton, which has the biggest potential, she was unable to push the vote share above the 2020 level.

That's why in my book blog (link), I suggested that the networks could have called Georgia (and several other swing states) earlier, if they used "numbersense" rather than mathematical impossibility as the criterion.

***

Before ending, let's praise the unsung heroes - the data analysts who worked behind the scenes to make these interactive graphics possible.

The graphics require data feeds, which cover a broad scope, from real-time vote tallies to total votes casted, both at the county level and the state level. While the focus is on the two leading candidates, any votes going to other candidates have to be tabulated, even if not displayed. The talking heads don't just want raw vote counts; in order to tell the story of the election, they need some understanding of how many votes are still to be counted, where they are coming from, what's the partisan lean on those votes, how likely is the result going to deviate from past elections, and so on.

All those computations must be automated, but manually checked. The graphics software has to be reliable; the hosts can touch any part of the map to reveal details, and it's not possible to predict all of the user interactions in advance.

Most importantly, things will go wrong unexpectedly during election night so many data analysts were on standby, scrambling to fix issues like breakage of some data feed from some county in some state.


Small tweaks that make big differences

It's one of those days that a web search led me to an unfamiliar corner, and I found myself poring over a pile of column charts that look like this:

GO-and-KEGG-diagrams-A-Forty-nine-different-GO-term-annotations-of-the-parental-genes

This pair of charts appears to be canonical in a type of genetics analysis. I'll focus on the column chart up top.

The chart plots a variety of gene functions along the horizontal axis. These functions are classified into three broad categories, indicated using axis annotation.

What are some small tweaks that readers will enjoy?

***

First, use colors. Here is an example in which the designer uses color to indicate the function classes:

Fcvm-09-810257-g006-3-colors

The primary design difference between these two column charts is using three colors to indicate the three function classes. This little change makes it much easier to recognize the ending of one class and the start of the other.

Color doesn't have to be limited to column areas. The following example extends the colors to the axis labels:

Fcell-09-755670-g004-coloredlabels

Again, just a smallest of changes but it makes a big difference.

***

It bugs me a lot that the long axis labels are printed in a slanted way, forcing every serious reader to read with slanted heads.

Slanting it the other way doesn't help:

Fig7-swayright

Vertical labels are best read...

OR-43-05-1413-g06-vertical

These vertical labels are best read while doing side planks.

Side-Plank

***

I'm surprised the horizontal alignment is rather rare. Here's one:

Fcell-09-651142-g004-horizontal

 


Reading log: HBR's specialty bar charts

Today, I want to talk about a type of analysis that I used to ask students to do. I'm calling it a reading log analysis – it's a reading report that traces how one consumes a dataviz work from where your eyes first land to the moment of full comprehension (or abandonment, if that is the outcome). Usually, we do this orally during a live session, but it's difficult to arrive at a full report within the limited class time. A written report overcomes this problem. A stack of reading logs should be a gift to any chart designer.

My report below is very detailed, reflecting the amount of attention I pay to the craft. Most readers won't spend as much time consuming a graphic. The value of the report is not only in what it covers but also in what it does not mention.

***

The chart being analyzed showed up in a Harvard Business Review article (link), and it was submitted by longtime reader Howie H.

Hbr_specialbarcharts

First and foremost, I recognized the chart form as a bar chart. It's an advanced bar chart in which each bar has stacked sections and a vertical line in the middle. Now, I wanted to figure out how data enter the picture.

My eyes went to the top legend which tells me the author was comparing the proportion of respondents who said "business should take responsibility" to the proportion who rated "business is doing well". The difference in proportions is called the "performance gap". I glanced quickly at the first row label to discover the underlying survey addresses social issues such as environmental concerns.

Next, I looked at the first bar, trying to figure out its data encoding scheme. The bold, blue vertical line in the middle of the bar caused me to think each bar is split into left and right sections. The right section is shaded and labeled with the performance gap numbers so I focused on the segment to the left of the blue line.

My head started to hurt a little. The green number (76%) is associated with the left edge of the left section of the bar. And if the blue line represents the other number (29%), then the width of the left section should map to the performance gap. This interpretation was obviously incorrect since the right section already showed the gap, and the width of the left section was not equal to that of the right shaded section.

I jumped to the next row. My head hurt a little bit more. The only difference between the two rows is the green number being 74%, 2 percent smaller. I couldn't explain how the left sections of both bars have the same width, which confirms that the left section doesn't display the performance gap (assuming that no graphical mistakes have been made). It also appeared that the left edge of the bar was unrelated to the green number. So I retreated to square one. Let's start over. How were the data encoded in this bar chart?

I scrolled down to the next figure, which applies the same chart form to other data.

Hbr_specialbarcharts_2

I became even more confused. The first row showed labels (green number 60%, blue number 44%, performance gap -16%). This bar is much bigger than the one in the previous figure, even though 60% was less than 76%. Besides, the left section, which is bracketed by the green number on the left and the blue number on the right, appeared much wider than the 16% difference that would have been merited. I again lapsed into thinking that the left section represents performance gaps.

Then I noticed that the vertical blue lines were roughly in proportion. Soon, I realized that the total bar width (both sections) maps to the green number. Now back to the first figure. The proportion of respondents who believe business should take responsibility (green number) is encoded in the full bar. In other words, the left edges of all the bars represent 0%. Meanwhile the proportion saying business is doing well is encoded in the left section. Thus, the difference between the full width and the left-section width is both the right-section width and the performance gap.

Here is an edited version that clarifies the encoding scheme:

Hbr_specialbarcharts_2

***

That's my reading log. Howie gave me his take:

I had to interrupt my reading of the article for quite a while to puzzle this one out. It's sorted by performance gap, and I'm sure there's a better way to display that. Maybe a dot plot, similar to here - https://junkcharts.typepad.com/junk_charts/2023/12/the-efficiency-of-visual-communications.html.

A dot plot might look something like this:

Junkcharts_redo_hbr_specialcharts_2
Howie also said:

I interpret the authros' gist to be something like "Companies underperform public expectations on a wide range of social challenges" so I think I'd want to focus on the uniform direction and breadth of the performance gap more than the specifics of each line item.

And I agree.


Do you want a taste of the new hurricane cone?

The National Hurricane Center (NHC) put out a press release (link to PDF) to announce upcoming changes (in August 2024) to their "hurricane cone" map. This news was picked up by Miami Herald (link).

New_hurricane_map_2024

The above example is what the map looks like. (The data are probably fake since the new map is not yet implemented.)

The cone map has been a focus of research because experts like Alberto Cairo have been highly critical of its potential to mislead. Unfortunately, the more attention paid to it, the more complicated the map has become.

The latest version of this map comprises three layers.

The bottom layer is the so-called "cone". This is the white patch labeled below as the "potential track area (day 1-5)".  Researchers dislike this element because they say readers tend to misinterpret the cone as predicting which areas would be damaged by hurricane winds when the cone is intended to depict the uncertainty about the path of the hurricane. Prior criticism has led the NHC to add the text at the top of the chart, saying "The cone contains the probable path of the storm center but does not show the size of the storm. Hazardous conditions can occur outside of the cone."

The middle layer are the multi-colored bits. Two of these show the areas for which the NHC has issued "watches" and "warnings". All of these color categories represent wind speeds at different times. Watches and warnings are forecasts while the other colors indicate "current" wind speeds. 

The top layer consists of black dots. These provide a single forecast of the most likely position of the storm, with the S, H, M labels indicating the most likely range of wind speeds at forecast times.

***

Let's compare the new cone map to a real hurricane map from 2020. (This older map came from a prior piece also by NHC.)

Old_hurricane_map_2020

Can we spot the differences?

To my surprise, the differences were minor, in spite of the pre-announced changes.

The first difference is a simplification. Instead of dividing the white cone (the bottom layer) into two patches -- a white patch for days 1-3, and a dotted transparent patch for days 4-5, the new map aggregates the two periods. Visually, simplifying makes the map less busy but loses the implicit acknowledge found in the old map that forecasts further out are not as reliable.

The second point of departure is the addition of "inland" warnings and watches. Notice how the red and blue areas on the old map hugged the coastline while the red and blue areas on the new map reach inland.

Both changes push the bottom layer, i.e. the cone, deeper into the background. It's like a shrink-flation ice cream cone that has a tiny bit of ice cream stuffed deep in its base.

***

How might one improve the cone map? I'd start by dismantling the layers. The three layers present answers to different problems, albeit connected.

Let's begin with the hurricane forecasting problem. We have the current location of the storm, and current measurements of wind speeds around its center. As a first requirement, a forecasting model predicts the path of the storm in the near future. At any time, the storm isn't a point in space but a "cloud" around a center. The path of the storm traces how that cloud will move, including any expansion or contraction of its radius.

That's saying a lot. To start with, a forecasting model issues the predicted average path -- the expected path of the storm's center. This path is (not competently) indicated by the black dots in the top layer of the cone map. These dots offer only a sampled view of the average path.

Not surprisingly, there is quite a bit of uncertainty about the future path of any storm. Many models simulate future worlds, generating many predictions of the average paths. The envelope of the most probable set of paths is the "cone". The expanding width of the cone over time reflects the higher uncertainty of our predictions further into the future. Confusingly, this cone expansion does not depict spatial expansion of either the storm's size or the potential areas that may suffer the greatest damage. Both of those tend to shrink as hurricanes move inland.

Nevertheless, the cone and the black dots are connected. The path drawn out by the black dots should be the average path of the center of the storm.

The forecasting model also generates estimates of wind speeds. Those are given as labels inside the black dots. The cone itself offers no information about wind speeds. The map portrays the uncertainty of the position of the storm's center but omits the uncertainty of the projected wind speeds.

The middle layer of colored patches also inform readers about model projections - but in an interpreted manner. The colors portray hurricane warnings and watches for specific areas, which are based on projected wind speeds from the same forecasting models described above. The colors represent NHC's interpretation of these model outputs. Each warning or watch simultaneously uses information on location, wind speed and time. The uncertainty of the projected values is suppressed.

I think it's better to use two focused maps instead of having one that captures a bit of this and a bit of that.

One map can present the interpreted data, and show the areas that have current warnings and watches. This map is about projected wind strength in the next 1-3 days. It isn't about the center of the storm, or its projected path. Uncertainty can be added by varying the tint of the colors, reflecting the confidence of the model's prediction.

Another map can show the projected path of the center of the storm, plus the cone of uncertainty around that expected path. I'd like to bring more attention to the times of forecasting, perhaps shading the cone day by day, if the underlying model has this level of precision.

***

Back in 2019, I wrote a pretty long post about these cone maps. Well worth revisiting today!


To a new year of pleasant surprises

Happy new year!

This year promises to be the year of AI. Already last year, we pretty much couldn't lift an eyebrow without someone making an AI claim. This year will be even noisier. Visual Capitalist acknowledged this by making the noisiest map of 2023:

Visualcapitalist_01_Generative_AI_World_map sm

I kept thinking they have a geography teacher on the team, who really, really wants to give us a lesson of where each country is on the world map.

All our attention is drawn to the guiding lines and the random scatter of numbers. We have to squint to find the country names. All this noise drowns out the attempt to make sense of the data, namely, the inset of the top 10 countries in the lower left corner, and the classification of countries into five colored groups.

A small dose of editing helps. Remove most data labels except for the countries for which they have a story. Provide a data table below for those who want details.

***

In the Methodology section, the data analysts (possibly from a third party called ElectronicsHub) indicated that they used Google search volume of "over 90 of the most popular generative AI tools", calculating the "overall volume across all tools per 100k population". Then came a baffling line: "all search volumes were scaled up according to the search engine market share in each country, using figures from statscounter.com." (Note: in the following, I'm calling the data "AI-related search" for simplicity even though their measurement is restricted to the terms described above.)

It took me a while to comprehend what they could have meant by that line. I believe this is what that sentence means: Google is not the only search engine out there so by only researching Google search volume, they undercount the true search volume. How did they deal with the missing data problem? They "scaled up" so if Google is 80% of the search volume in a country, then they divide the Google volume by 80% to "scale up" to 100%.

Whenever we use heuristics like this, we should investigate its foundations. What is the implicit assumption behind this scaling-up procedure? It is that all search engines are effectively the same. The users of non-Google search engines behave exactly as the Google search engine users. If the analysts somehow could get their hands on the data of other search engines, they would discover that the proportion of search volume that is AI-related is effectively the same as seen on Google.

This is one of those convenient, and obviously wrong assumptions – if true, the market would have no need for more than one search engine. Each search engine's audience is just a random sample from the population of all users.

Let's make up some numbers. Let's say Google has 80% share of search volume in Country A, and AI-related search 10% of the overall Google search volume. The remaining search engines have 20% share. Scaling up here means taking the 8% of Google AI-related search volume, divide by 80%, which yields 10%. Since Google owns 8% of the 10%, the other search engines see 2% of overall search volume attributed to AI searches in Country A. Thus, the proportion of AI-related searches on those other search engines is 2%/20% = 10%.

Now, in certain countries, Google is not quite as dominant. Let's say Google only has 20% share of Country B's search volume. AI-related search on Google is 2%, which is 10% of its total. Using the same scaling-up procedure, the analysts have effectively assumed that the proportion of AI-related search volume in the dominant search engines in Country B to be also 10%.

I'm using the above calculations to illustrate a shortcoming of this heuristic. Using this procedure inflates the search volume in countries in which Google is less dominant because the inflation factor is the reciprocal of Google's market share. The less dominant Google is, the larger the inflation factor.

What's also true? The less dominant Google is, the smaller proportion of the total data the analysts are able to see, the lower the quality of the available information. So the heuristic is the most influential where it has the greatest uncertainty.

***

Hope your new year is full of uncertainty, and your heuristics shall lead you to pleasant surprises.

If you like the blog's content, please spread the word. I'm looking forward to sharing more content as the world of data continues to evolve at an amazing pace.

Disclosure: This blog post is not written by AI.