## To a new year of pleasant surprises

##### Jan 01, 2024

Happy new year!

This year promises to be the year of AI. Already last year, we pretty much couldn't lift an eyebrow without someone making an AI claim. This year will be even noisier. Visual Capitalist acknowledged this by making the noisiest map of 2023:

I kept thinking they have a geography teacher on the team, who really, really wants to give us a lesson of where each country is on the world map.

All our attention is drawn to the guiding lines and the random scatter of numbers. We have to squint to find the country names. All this noise drowns out the attempt to make sense of the data, namely, the inset of the top 10 countries in the lower left corner, and the classification of countries into five colored groups.

A small dose of editing helps. Remove most data labels except for the countries for which they have a story. Provide a data table below for those who want details.

***

In the Methodology section, the data analysts (possibly from a third party called ElectronicsHub) indicated that they used Google search volume of "over 90 of the most popular generative AI tools", calculating the "overall volume across all tools per 100k population". Then came a baffling line: "all search volumes were scaled up according to the search engine market share in each country, using figures from statscounter.com." (Note: in the following, I'm calling the data "AI-related search" for simplicity even though their measurement is restricted to the terms described above.)

It took me a while to comprehend what they could have meant by that line. I believe this is what that sentence means: Google is not the only search engine out there so by only researching Google search volume, they undercount the true search volume. How did they deal with the missing data problem? They "scaled up" so if Google is 80% of the search volume in a country, then they divide the Google volume by 80% to "scale up" to 100%.

Whenever we use heuristics like this, we should investigate its foundations. What is the implicit assumption behind this scaling-up procedure? It is that all search engines are effectively the same. The users of non-Google search engines behave exactly as the Google search engine users. If the analysts somehow could get their hands on the data of other search engines, they would discover that the proportion of search volume that is AI-related is effectively the same as seen on Google.

This is one of those convenient, and obviously wrong assumptions – if true, the market would have no need for more than one search engine. Each search engine's audience is just a random sample from the population of all users.

Let's make up some numbers. Let's say Google has 80% share of search volume in Country A, and AI-related search 10% of the overall Google search volume. The remaining search engines have 20% share. Scaling up here means taking the 8% of Google AI-related search volume, divide by 80%, which yields 10%. Since Google owns 8% of the 10%, the other search engines see 2% of overall search volume attributed to AI searches in Country A. Thus, the proportion of AI-related searches on those other search engines is 2%/20% = 10%.

Now, in certain countries, Google is not quite as dominant. Let's say Google only has 20% share of Country B's search volume. AI-related search on Google is 2%, which is 10% of its total. Using the same scaling-up procedure, the analysts have effectively assumed that the proportion of AI-related search volume in the dominant search engines in Country B to be also 10%.

I'm using the above calculations to illustrate a shortcoming of this heuristic. Using this procedure inflates the search volume in countries in which Google is less dominant because the inflation factor is the reciprocal of Google's market share. The less dominant Google is, the larger the inflation factor.

What's also true? The less dominant Google is, the smaller proportion of the total data the analysts are able to see, the lower the quality of the available information. So the heuristic is the most influential where it has the greatest uncertainty.

***

Hope your new year is full of uncertainty, and your heuristics shall lead you to pleasant surprises.

If you like the blog's content, please spread the word. I'm looking forward to sharing more content as the world of data continues to evolve at an amazing pace.

Disclosure: This blog post is not written by AI.

I’m not sure I follow your criticism of the scaling up procedure? Is the assumption that users of other search engines search for AI terms at the same rate as Google users wrong? Of course it is. But failing to adjust for google’s market share seems by far the greater of two evils here, as then you’re definitely introducing artefacts into your metric (lower Google market share would always reduce search volume per capita).

Perhaps a better way to think of it is that what you’re really doing is scaling *down* the denominator in the searches per capita ratio to match Google’s share of users. And then your metric is *just* Google search volume / # Google users. It’s exactly the same number, but it’s slightly more transparent what the number is.

Posted by: conchis | Jan 05, 2024 at 10:47 PM

C: Thanks for your comment. Just to complete the circle, when they are "scaling down" proportionally, and not doing anything to correct for biases, they also assume that Google is a random sample of all search engines.

I know this is standard practice for lots of analysts and there are whole books written in which they implicitly make this assumption. It doesn't make it right.

For example, since tech-savvy people are more likely to use Google than other search engines, and tech-savvy people are more likely to be interested in AI tooling than other people, it's quite likely that Google takes a larger share of AI searches than non-Google search engines. If one takes a Bayesian view of things, one can come up with priors to model the potential biases and adjust the numbers further.

Posted by: Kaiser | Jan 06, 2024 at 03:38 PM