As a commercial entity, Google, like many companies, is keen on protecting what it perceives to be corporate secrets (the irony: Google and many other companies tell users they should have no secrets). I understand that but the act of hiding creates problems for analysts. Measurement problems.

Measurement concerns the use of a metric to represent a quantity. You don't just define a metric and then claim it's a valid construct for measuring some quantity. Even if that metric is calculated perfectly, the measurement may still be inaccurate because the metric doesn't properly measure the quantity!

Google doesn't want us to know the real amount of searches so when it publishes things like Google Trends, it wants to have it both ways. It wants to generate interest in the search data by providing relative comparisons but it wants to hide the true volumes, as in the however many billions of searches for a particular search term.

So it came up with an "indexing" strategy. A search index is the relative amount of searches with the reference level of 100 given to the top-ranked search term in any given list of search terms. I recently discussed an example application of this on the sister blog Junk Charts (link). This is an important topic that I want to surface on this blog as well.

In the example shown, the graphic designer hones in on a set of search terms, say all keywords related to how to fix something in the house (like walls, doors, windows, washing machines, etc.) in a specific country (say, France). Instead of publishing how many searches occurred, Google prints a Search Index for each search term in France. So, a value of 49 for washing machine means that searches for washing machine in France is 49% of the volume of searches for windows in France. (You can check this out on the interactive graphic here.)

***

What are the characteristics of this Search Index metric used to represent the quantity of **popularity of search**?

Firstly, it is not a direct measurement of popularity. The direct metric is the proportion of search volume for each search term relative to the aggregate.

Secondly, it is a relative scale. Each number can only be interpreted in relation to another number. If no other number is available, then the comparison is with the reference level, which is the top-ranked item in the relevant list of items, arbitrarily set to 100. If two numbers are available, say index of 50 and 25, we learn that the second item is half the popularity of the first item, and the first item is half the popularity of the top-ranked item.

Thirdly, this indexing strategy preserves the distribution; it is merely a re-scaled version of the underlying search volumes. You divide the number of searches of each item by a fixed number - the search volume of the most popular item. In other words, if the top-ranked item dominates the list, then it will get a value of 100 and the other items close to 0. This transformation is unlike taking logarithms, which changes the distribution by pulling the small values towards the middle.

The more easily interpreted popularity metric is also a re-scaled version of the underlying search volumes. Instead of dividing by the search volume of the most popular item, you divide by the aggregate seaerch volume of all items. So the Search Index is really just a re-scaled version of the direct metric of popularity. You can go back and forth between the two scales if you know the proportion of searches accounted for by the top-ranked item.

Fourthly, it's actually simpler than that. The relative popularity of the items is already baked into the Search Index. The easiest way to transform the Search Index to the direct popularity metric is to "normalize" the scale. **Normalization** means dividing each value by the sum of all values. This makes each normalized value interpretable as a proportion of the total.

*The problem with the Search Index is that you can't interpret it as a proportion. It feels like a proportion because the maximum value is 100, the minimum is 0, and every Index value falls between 0 and 100. But what fails is that the sum of the Index values does not add up to 100.*

If you're analyzing 5 items, and every item has equal numbers of searches, then every item has index value of 100. The sum of the indices is 500.

If you have 10 items, all equally popular, every item still has index 100, but the sum of the indices is now 1000.

If you have 5 items, and the search volume drops by half for each lower rank, then the indices would be 100, 50, 25, 13, and 6 (with rounding). For 10 items, the indices would be 100, 50, 25, 13, 6, 3, 2, 1, 0, 0 (again with rounding). The sum of the indices are respectively 194 and 200.

If you now try to compare two sets of indices, in our example, one for how to fix something in the house in France and one for how to fix something in the house in Russia, then you have a problem. The sum of the indices in France is a lot larger than the sum of the indices in Russia. So a 49 in France is not the same as a 49 in Russia!

One last example. Say we have three items A, B, C. If these are evenly distributed, then each item should have popularity 33%, and Search Index will be 100. The Search Indices sum to 300. Normalized, we get back to 33% for each item.

Now, if the items have popularity, 50%, 33%, 17%, then the Search Indices are 100, 66, 34. Note item B has 33% popularity in both cases but in the even distribution case, the Search Index of item B is 100 while in the uneven case, item B has index 66. This is what we mean by the metric not properly capturing the quantity we want to measure.

## Comments

You can follow this conversation by subscribing to the comment feed for this post.