(Photo credit: melystu, Flickr)

Overheard on the news: the University of California system may drop the SAT requirement for college admissions, which could be a body blow to the testing industry because UC is gigantic and the crown jewel of U.S. public universities. (There is a LA Times article about this news although it's behind a paywall.)

I'll discuss both sides of this argument.

To the supporters of the SAT, test scores are "objective" measures of ability. That perked up my ears because I believe **there is no such thing as objective statistics**. What they mean when they say "objective" is *standardized*: **test scores provide a standard of comparison**, a systematic way of comparing applications. As someone who have evaluated applications, I heartily appreciate the value of having such standards.

You might think the transcripts that list what courses were taken, and what grades were achieved would constitute objective data. Alas! Two classes with the same titles "Introduction to Algebra" can be completely different. And if we are talking humanities, classes may have unique titles (and contents). The same class with the same title taught by different instructors can also be different. **To properly interpret transcripts would require a lot of contextual information not found on the applications. **

Is high school GPA better than standardized tests? **Because of grade compression, it has become practically impossible to use grades for evaluation**. The issue here is less subjectivity but lack of variance: how can you differentiate between one A and another A?

The best argument supporting standardized testing is that it offers a benchmark for comparisons - over time, and between students.

***

Now turning to the opponents of the SAT. They like to argue that **the SAT is not objective enough**. They point to studies that show that test items may be biased against certain subgroups, such as women and African Americans. (Chapter 3 of my book **Numbers Rule Your World** (link) covers the fascinating world of how the ETS, the group that runs the SAT, uses statistical techniques to reduce subgroup bias in standardized testing.)

One problem with the opposition is that they have nothing to offer as a substitute. What are they suggesting as the replacement for the SAT?

Perhaps the substitutes are teacher recommendations, high school GPAs, etc. The Achilles heel is that those instruments are no more objective, and in my mind, clearly more subjective than the SAT. So there is an inconsistency in this line of reasoning: **they complain that the SAT is not objective enough but they don't have an alternative that is more objective.**

(On a related note, I'm eagerly waiting for the climate skeptics to deliver their own alternative set of models that issue predictions about the future, which can be verified by other scientists.)

For readers who know the materials in Chapter 3 of **Numbers Rule Your World** (link), you learn how hard it is to remove all subgroup bias from test items. I showed some examples of **test items that look fine but were found to be wildly biased** against certain groups when measured in an experiment. If the primary concern is about subgroup bias, one should also be worried about such bias emanating from human reviews of transcripts, teacher recommendations, application essays, etc.

***

If and when the SAT gets knocked out of the college admissions process, it will just clear the way for another testing company to rebuild this industry.

The data science profession will likely be leading this charge. Many data scientists are already promising predictive models that find best applicants, best customers, best employees, etc. These models are data-hungry. They require huge mounds of data, structured data, preferably numeric. Subjective data like teacher recommendations do not lend themselves to such models. But standardized tests, especially multiple-choice questions, effortlessly generate huge mounds of such data. So, what do you think is hot in this startup arena?

Many data science startups administer all kinds of tests, the goal of which is to generate training data for their predictive algorithms. That's why they may kill the SAT but they won't kill standardized testing.

A co-worker pointed me to a Reddit thread that someone put up to showcase a small data-science side project - to establish whether NBA star James Harden's performance is affected by his "affinity for ... strip clubs" (thereafter, SC).

The **analytical plan** used in this project is **typical of many data science projects**. In this post (and possibly a few more), I raise issues that analysts should ponder when conceptualizing this type of analysis. These comments are designed to help those who are interested in advancing the sophistication of their analytical work. I am not re-analyzing the data, or debunking what's been done. I am impressed by what "AngryCentrist" has put together - starting with how AC gathered the data to how AC executed an analytical plan resulting in some conclusions.

***

Let me summarize the analytical game plan as follows: (you can read the gory details in this google doc)

- The outcome variable is John Harden's on-court performance, expressed in an inverse scale (i.e. the larger the metric, the worse his performance). The explanatory variable is the quality rating of SCs in the cities where Harden played away games. The analysis looks at the correlation between these two variables.
- Only the last four seasons of away games in which Harden played are used. In each season, there are usually one or two away games against any given team. Team and city are synonymous except for Los Angeles which fields two teams. (Brooklyn and New York are treated as separate cities.)
**On-court performance is measured as the number of metrics on which Harden's in-city performance is worse than his season average**. Only six metrics (e.g turnovers) are considered over four seasons so this performance metric is a number between 0 and 24. Zero means Harden out-performed his season average for each of the 6 metrics over each of the 4 seasons. By contrast, if Harden did worse than his season average for every metric over those four seasons, then the performance metric attains the maximum value of 24. The redditor, AngryCentrist, called this the "number of sub-par performances".- SC ratings are based on the first 10 results with customer ratings when doing a search on Google of the type "[city] strip clubs".
- The analyst concluded that the two variables were positively correlated based on fitting a linear regression model. Higher SC quality rating is associated with worse on-court performance. The r-squared was 20 percent (which is not impressive).

***

For this post, I focus on the explanatory variable, the SC average rating, and the concept of **"noise" in our data**.

The SC average rating is an example of an **indirect metric**, a "noisy" proxy for what we want to measure. **Notice that the outcome variable is about a particular person (Harden) while the explanatory variable measures a large set of people**, of which Harden is allegedly a member.

The "noise" comes from several sources. If Harden was not a member of the set, then the explanatory variable has no direct connection with the outcome. So we hope that Harden not only visited SCs in all those cities, but also reviewed his experiences on Google, thus contributing to the average ratings.

If Harden did enjoy himself at the SCs and submitted reviews to Google, his ratings would still have been immaterial, just one vote in the average. Since the analysis only extracted "top 10" SCs based on a Google search, one expects the selected SCs are the more popular ones, which means that Harden's ratings would be even more insignificant.

If we don't require Harden's ratings to be included in these SC averages, we would at least require that his ratings - if revealed - to be highly correlated with the average ratings among Google reviewers. This assumption has two components - Harden would have to patronize the popular SCs, rather than ones that are more exclusive, or discreet; and Harden's opinion of his experiences there would have to be similar to that of the average patron.

Let's pretend that Harden visited one club prior to playing an away game in Miami. The average rating of the top 10 clubs in Miami is then not quite up to the task. That specific club might not even be part of the ten. This gap is also part of the "noise" of indirect measurement.

Timing is yet another source of noise. What if Harden engaged in the adult activity after the game rather than before the game?

Timing enters the picture in another way. Adult entertainment venues come and go, just like any other businesses. If Harden had visited a club before a match early in the NBA season, and that club subsequently shut down, then this club would not feature in the Google search, and would be absent from the average rating. Besides, new clubs that opened after Harden's visit would enter the average rating, if they become popular enough. Either of these situations causes the indirect metric to stray away from its proper value.

***

Here's the big picture. **The indirect measure of average quality rating of SCs is a good proxy only under a large set of presumptions.** None of the above conditions is directly observable, so each is presumed.

In general, the further away the measurement is from what is supposed to be measured, the more "noisy" is the measurement.

A workaround is to change our theory of the link between SC rating and Harden's performance. Maybe the link does not require physical visits to these establishments and subsequent reviews. Perhaps it is Harden's knowledge of a vibrant SC scene, regardless of active participation, that is doing the trick. This link though cannot be proven or disproven by the data, and thus is also presumed.

**Assumptions are not avoidable. Verifiable assumptions are better than unverifiable ones. A good analysis plan should make the fewest, and weakest assumptions.**

There are lots of myths in data science which are repeated endlessly on social media and the internet. One popular myth says that standardizing variables is to make them normal. This is not true!

This myth is found explicitly in the Python documentation on the StandardScaler method in the popular Scikitlearn package (link):

Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).

Let me first show you a quick **counter-example**, then provide some intuition around the standardization procedure. (Warning: this is a rather long read that is more technical than usual.)

***

(a) I generate 10,000 random numbers between -5 and 7 from a *uniform* *distribution*. A uniform distribution means that each value between -5 and 7 has equal chance of appearing. This is clearly NOT a normal distribution (also known as the Bell curve), in which values cluster around the average - in a uniform distribution, the average value appears just as frequently as any other value in the range.

The following histogram shows the count of values for our 10,000 uniform random numbers, and as expected, we see an almost equal representation of every value in the range.

(b) The average of the 10,000 numbers is 1.0 (rounded to 1 decimal place). This falls unsurprisingly on the midpoint of the range -5 to 7. The standard deviation, which measures the spread of the data, is 3.5 (rounded to 1 decimal place).

(c) Now, I standardize the data, which means subtracting the average value and dividing by the standard deviation. This is what the histogram looks like after standardizing the data. You can immediately see that **the standardized data do not look anything like a Bell curve**.

(d) The average of the 10,000 standardized numbers is 0, and the standard deviation is 1. So the standardization formula did change the values of the average and standard deviation but it did not change the shape of the distribution. To see this more clearly, I made the horizontal axis range the same on both charts.

You can see that the middle of the distribution is shifted from 1 to 0. Also, the standardized numbers are squeezed into a smaller range than the original numbers. This reflects dividing each number by the standard deviation (which is dividing by 3.5).

***

Now, let's develop some intuition about this business of standardization, normal distributions, normal probability plots, and so on.

Here is a generic normal distribution, or a Bell curve:

One beautiful thing about normal distributions is that they are completely determined by its average value and its standard deviation. The average pins the midpoint of the distribution, where the familiar bulge of the distribution sits. The standard deviation controls how widely the data are spread out along the range of values shown on the horizontal axis.

The above normal distribution has average 5.1 and standard deviation 3.0.

We can standardize this distribution, which means subtracting the average and dividing by the standard deviation. As we learned earlier, the effect is to recenter the midpoint to zero, and to fix the standard deviation to 1. Here are the histograms of the original normal variables, and the standardized normal variables.

In this case, the standardized variables do look like a Bell curve. Well, that is because the original data have the shape of a Bell curve to begin with.

***

Let's introduce a tool that tests whether a given set of data has the shape of the Bell curve. This is called the **normal probability plot**. (Technically, it's a type of **qqplot** in which the reference "q" is the normal distribution.)

This plot exhibits a straight line if the test distribution is a normal distribution. Deviation from the normal distribution is revealed by bending on the left and right edges of the sequence of dots. In the following chart, I test the set of random uniform numbers from before. You can see the bending which tells us that those numbers are not shaped like a normal distribution.

The next normal probability plot asks whether the *standardized* uniform numbers we generated above is a normal distribution, and it is a resounding no! You can see the bending on both ends of the curve.

Can we salvage this situation by making more data? No! Here is what the plot looks like when I generate 1 million numbers instead of 10,000. The bending does not go away.

As before, the average value of the 1 million standardized numbers is 0 and the standard deviation is 1 but the shape after standardization is not normal. **Standardizing variables does not make them normal.**

***

The **standard normal distribution** has mean 0 and standard deviation 1. A normal distribution is completely determined by those two statistics. However, the reverse is not generally true! **If a distribution has mean 0 and standard deviation 1, it does not mean we have a standard normal distribution, or any normal distribution.**

The following uniform distribution in the range [-sqrt(3), sqrt(3)] has mean 0 and standard deviation 1, by construction. Here is what the histogram looks like:

This does not look anything like "standard normally distributed data", even though it has mean 0 and sd 1. Every value, including the average value, has similar representation.

(Standardizing this distribution does nothing at all since we'd be subtracting the mean of zero, and dividing by 1.)

***

Why then do we standardize data? First, note that the average of the standardized data is fixed to zero. Next, the standard deviation is fixed to 1 - standard deviation is the average of the squared distances of the data from the average: thus by fixing sd to 1 (relative to the average of zero), the dispersion of the standardized data is being constrained. If a particular value strays far away from zero, it causes the standard deviation to rise but the total is capped under 1, and so the range of possible values in the standardized data is restricted.

Standardization places different data sets on the same scale so that they can be compared systematically. It does not turn non-normal data into normal data.

Over lunch, my friend complained that mobile app developers are silently using his mobile data when he's not even using the app. That's a pet peeve of mine!

As a quick review of what's happening in the mobile ether, see this Washington Post article, appropriately titled "While you're asleep, your iphone stays busy". These apps are sending data about you to home base continuously. In a single week, the reporter found over 5,000 active trackers on his iphone. What was not mentioned: each tracker was using his mobile data while he was asleep.

My mobile subscription plan is tiered. If I go over 2 gigabytes a month, the mobile operator charges me by the pound for incremental usage, and those charges are designed to be exorbitant - in order to create an incentive to buy into the next higher tier.

Notice how the app developer made decisions that can cause users to send more money to mobile operators. The flow of money through the app ecosystem is complex: the users typically do not pay app developers directly. (Some apps, like Netflix, are exceptions as they charge users subscription fees.) Most apps make money through advertising, and it is believed that ad revenues can be lifted by omnipresent data collection about users. The telcos who run the networks sit between app developers and users, collecting broadband access fees from users, an indirect way in which we pay for apps. The telcos also charge app developers bandwidth fees since they need access to the pipes to reach users.

This three-way relationship has become troubled in recent years, as manifested in the "net neutrality" controversy, which pits pipe operators (cable companies, phone operators, etc.) against content providers (Google, Facebook, Netflix, etc.) In 2015, the FCC under Obama sided with big tech, who advocates net neutrality. Then, in 2018, the Trump administration repealed the Obama-era regulation, to the delight of telcos. In the U.S., this issue is still very much alive.

***

I'm staying neutral because net neutrality is a battle between two groups of powerful companies - tech and telecoms. To understand what's going on, let's flash back to the beginnings of the commercial Internet.

At the start, both groups had their interests aligned. They understood that the fixed-fee plan for Internet access is the key to unlocking the potential of the World Wide Web. When the benefits of the Internet were not yet clear to everyone, charging by the pound of data transferred would abort adoption. One should be clear about this -- the fixed fee plan has never been egalitarian; it is well known that a small percent of "hogs" account for 90% of total bandwidth usage. In effect, most users are subsidizing the few bandwidth hogs.

Eventually, the pipe providers discovered that they got the short end of the stick. As the Internet technologies advanced, applications started gobbling up more and more bandwidth. Streaming video, such as popularized by Netflix and Youtube, uses massively more bandwidth than emails and simple webpages. Internet phone calls require synchronizing the traffic, which adds to the complexity of traffic management. To maintain service levels, the telcos must invest heavily in new pipes, new equipment, new software, etc.

The adoption of fixed-fee plans constrains the telcos from receiving additional revenues to cover those costs. Thus, they are mad. They want the app and content developers to share the growing costs.

***

The telcos demanded that Internet traffic be differentiated. The cost of provisioning a video stream is much higher than the cost of sending an email. The email is a tiny file, maybe a few kilobytes while a video stream can be over 1 gigabytes (1 GB = 1 million kb). We can generally tolerate waits of seconds or minutes for emails to arrive but when video data are delayed, we complain about buffering. Emails can arrive out of sequence but the next section of video can't be shown until the previous section has arrived. The telcos thus want to charge more money to deliver video streams compared to emails.

Notably, the telcos made a calculation here. They chose not to raise this money from their subscribers (not yet, more to come). Instead, they turn to their other source of income: big tech - those businesses that fill the pipes with content and apps. These developers reach their users through the telcos' pipes, and pay telcos for bandwidth.

***

Tech companies are understandably pushing back against the telecoms. Consumers love the "free" phone calls and video streaming that is "free" or much cheaper than cable TV. The push to differentiate Internet traffic directly impacts the profit margins. To fight back, the big players banded together and coined the principle of "net neutrality".

The "net neutrality" principle is to demand that all types of Internet traffic be treated the same. The principle appeals to our notion of equal and fair treatment but note that fairness is applied to inanimate, manufactured objects ("bits" of data) that could not sense "fair" rather than people with feelings. The parties injured by the alleged lack of fairness are the tech giants, not users (not yet, more to come).

Net neutrality also appeals to nostalgia. In the beginning, the Internet revolution succeeded because of its single-minded pursuit of turning everything into "bits". Theoretically, emails, videos, images, voice, etc. can all be converted into streams of 1s and 0s and pushed through networks of pipes regardless of their aggregated identities. This tactic vastly simplifies all communications technologies, and dominated in applications that are low bandwidth, and asynchronous.

Over time, Internet technologies encroached into voice and video applications. It is not really the case that technically, all traffic can be treated the same, that is to say, anonymously, not knowing its type. So if "neutrality" means equal treatment, big tech is demanding equal treatment when it comes to bandwidth pricing but not when it comes to service level -- the tech giants are not saying they'd accept the same quality of service for video and voice as for emails!

***

The reason why I'm staying neutral in this debate is because as users, we are in a loss-loss situation. It doesn't matter who wins in this net neutrality debate between two gigantic, highly profitable industries. One side will end up losing revenues to the other side. The losing side will then try to recoup their lost revenues by making users pay more. If the cable company lost, then the cable broadband rates would go up even more. If net neutrality lost, then the Netflix subscription fee would eventually go up.

Net neutrality will not be neutral to consumer pocketbooks.