Facebook puts out a press release proclaiming that it removed "583 million fake accounts, 836 million spam posts in the first 3 months of 2018."

The "laugh track" was supposed to be: "Wow, Facebook has really done so much to combat fake accounts and spam posts."

If you have numbersense, you realize this is a classic example of Roslingese: a pile of numerators without denominators.

All that line tells me is that there are lots of accounts created and lots of posts on Facebook. Those numbers seem too BIG to be credible. For example, Mashable reported in 2017 that "Facebook now reaches 1.86 billion monthly active users, having added 70 million users in the quarter." Putting these together, **Facebook is claiming that in one quarter, it is adding 70 million net accounts but removed 583 million fake accounts. **

What we need is what proportion of fake accounts, and what proportion of spam posts were removed.

Typing "health benefits" into the search box on Google News gives me a long list of activities and foods that are supposedly good for me. Scanning the headlines, I see fasting, running, garlic, honey, tumeric, amchur, drinking hydrogen peroxide, gardening, low-carb diets, green tea, hummus, etc.

I am a skeptic when it comes to these claims. Here's how I think about them.

Hypothetically, someone – hopefully a nutrition scientist – is making the claim that a substance X will lead to an improvement in health metric Y. The effect of X on Y has a direction and a magnitude. The direction of the effect can be positive or negative while its magnitude is either large or small. So, any effect X on Y is captured by one of the four squares as shown.

Most reported effects X on Y tend to be small and positive. As such:

1) We are talking about a small *average* effect. This is wrongly interpreted as meaning everyone who takes substance X will accrue a small benefit on Y. That's usually not how we obtain a small average effect. A better interpretation is that **a small proportion of people who take substance X will accrue a benefit**. Most people who take X won't.

You might still argue with me in this way: I understand I am just buying a lottery ticket; the chance of winning is low but what's the harm? This takes me to:

2) When the effect size is small, it's possible that the direction of the effect is wrongly measured. The difference between a small positive effect and a small negative effect is not much! Instead of a benefit Y, you might end up with a harm Z.

The kicker: if the small magnitude of the benefit is enough for you to take seriously, then you should also worry about an effect of equal but opposite magnitude.

Or, you just decide that you don't care about these little effects, which is why I don't pay much attention to those news stories.

Older readers might not have heard of the company MoviePass but millennials likely have, as they account for half of the 2 million subscribers (link).

Moviepass is basically Groupon, which we debunked on this blog years ago. Scratch that. Moviepass is worse than Groupon. At the start, Groupon had at least some cheerleaders but Moviepass sounds dead on arrival.

Recall: Groupon is the digital coupon startup that offered an incredible deal to consumers - a typical coupon for a restaurant will let you dine there at half price or more. So for a $100-meal, the restaurant gets paid $50, but Groupon also claims it is bringing diners to the restaurant and takes a revenue share, meaning the restaurant takes in $25 before paying rent, food costs, people costs, etc.

Groupon definitely has a great deal for consumers but the restaurants got the short end of the stick. The deal only makes sense if the coupon users would otherwise not have dined at the restaurant. If the coupon users are regular diners, then the restaurant loses $75 for every such table! You'd need three new diners to cover the loss of one regular diner (that only gets the restaurant to break even on the offer cost). In **Numbersense**, I use this example to illustrate counterfactual thinking.

Worse still, the regular diner has a higher chance of wanting that Groupon, just because s/he already likes the product.

***

What is Moviepass? It is a monthly subscription plan for movie-going. It made the news in 2017 when the company dropped the monthly subscription price to $9.95, and grew subscribers 100-fold to 2 million. For $9.95 a month - roughly the average price of one movie ticket, subscribers are allowed to see one movie a day. Yes, this is insane.

It is even more insane. Moviepass pays the movie theaters full price for the tickets so the only way it makes any profit is if the subscriber does not watch even one movie a month. They would need subscribers who pay them monthly but don't watch movies.

Exactly the opposite should happen. The more movies one watches, the more lucrative is this subscription plan, and the more likely one will sign up. If movies is not a big part of your life, you're just not going to buy a subscription in Moviepass. If a subscriber watches one movie a day (the maximum allowed), then Moviepass loses $290 each month for that customer, month after month. I also know a subscriber who has a habit of just booking tickets and not using them. Just for giggles, he says.

***

There is speculation that maybe the cinemas eventually will strike a deal with Moviepass to sell them tickets at a discount. If cinemas were to do that, they become like the restaurant owners who bought into the Groupon scheme. The cinemas might think this brings them new customers. The reality is that it's their most loyal customers who will sign up most enthusiastically.

For each loyal customer who sign up, the cinemas would lose revenues on the discounting. If they gave Moviepass a 50% discount, then instead of getting $10 a ticket from these loyal movie-goers, they get $5 a ticket. To pay for that discount, Moviepass has to deliver one additional ticket from a newbie who would otherwise not have gone to the movie. But the loyal customer may watch say one movie a week but the newbie might watch one movie a month. So the cinemas might require four newbies to pay for the discounting of one loyal customer!

Moviepass is no doubt a great deal for movie lovers. But how long can this last?

As a commercial entity, Google, like many companies, is keen on protecting what it perceives to be corporate secrets (the irony: Google and many other companies tell users they should have no secrets). I understand that but the act of hiding creates problems for analysts. Measurement problems.

Measurement concerns the use of a metric to represent a quantity. You don't just define a metric and then claim it's a valid construct for measuring some quantity. Even if that metric is calculated perfectly, the measurement may still be inaccurate because the metric doesn't properly measure the quantity!

Google doesn't want us to know the real amount of searches so when it publishes things like Google Trends, it wants to have it both ways. It wants to generate interest in the search data by providing relative comparisons but it wants to hide the true volumes, as in the however many billions of searches for a particular search term.

So it came up with an "indexing" strategy. A search index is the relative amount of searches with the reference level of 100 given to the top-ranked search term in any given list of search terms. I recently discussed an example application of this on the sister blog Junk Charts (link). This is an important topic that I want to surface on this blog as well.

In the example shown, the graphic designer hones in on a set of search terms, say all keywords related to how to fix something in the house (like walls, doors, windows, washing machines, etc.) in a specific country (say, France). Instead of publishing how many searches occurred, Google prints a Search Index for each search term in France. So, a value of 49 for washing machine means that searches for washing machine in France is 49% of the volume of searches for windows in France. (You can check this out on the interactive graphic here.)

***

What are the characteristics of this Search Index metric used to represent the quantity of **popularity of search**?

Firstly, it is not a direct measurement of popularity. The direct metric is the proportion of search volume for each search term relative to the aggregate.

Secondly, it is a relative scale. Each number can only be interpreted in relation to another number. If no other number is available, then the comparison is with the reference level, which is the top-ranked item in the relevant list of items, arbitrarily set to 100. If two numbers are available, say index of 50 and 25, we learn that the second item is half the popularity of the first item, and the first item is half the popularity of the top-ranked item.

Thirdly, this indexing strategy preserves the distribution; it is merely a re-scaled version of the underlying search volumes. You divide the number of searches of each item by a fixed number - the search volume of the most popular item. In other words, if the top-ranked item dominates the list, then it will get a value of 100 and the other items close to 0. This transformation is unlike taking logarithms, which changes the distribution by pulling the small values towards the middle.

The more easily interpreted popularity metric is also a re-scaled version of the underlying search volumes. Instead of dividing by the search volume of the most popular item, you divide by the aggregate seaerch volume of all items. So the Search Index is really just a re-scaled version of the direct metric of popularity. You can go back and forth between the two scales if you know the proportion of searches accounted for by the top-ranked item.

Fourthly, it's actually simpler than that. The relative popularity of the items is already baked into the Search Index. The easiest way to transform the Search Index to the direct popularity metric is to "normalize" the scale. **Normalization** means dividing each value by the sum of all values. This makes each normalized value interpretable as a proportion of the total.

*The problem with the Search Index is that you can't interpret it as a proportion. It feels like a proportion because the maximum value is 100, the minimum is 0, and every Index value falls between 0 and 100. But what fails is that the sum of the Index values does not add up to 100.*

If you're analyzing 5 items, and every item has equal numbers of searches, then every item has index value of 100. The sum of the indices is 500.

If you have 10 items, all equally popular, every item still has index 100, but the sum of the indices is now 1000.

If you have 5 items, and the search volume drops by half for each lower rank, then the indices would be 100, 50, 25, 13, and 6 (with rounding). For 10 items, the indices would be 100, 50, 25, 13, 6, 3, 2, 1, 0, 0 (again with rounding). The sum of the indices are respectively 194 and 200.

If you now try to compare two sets of indices, in our example, one for how to fix something in the house in France and one for how to fix something in the house in Russia, then you have a problem. The sum of the indices in France is a lot larger than the sum of the indices in Russia. So a 49 in France is not the same as a 49 in Russia!

One last example. Say we have three items A, B, C. If these are evenly distributed, then each item should have popularity 33%, and Search Index will be 100. The Search Indices sum to 300. Normalized, we get back to 33% for each item.

Now, if the items have popularity, 50%, 33%, 17%, then the Search Indices are 100, 66, 34. Note item B has 33% popularity in both cases but in the even distribution case, the Search Index of item B is 100 while in the uneven case, item B has index 66. This is what we mean by the metric not properly capturing the quantity we want to measure.