If you follow sports, you could not avoid the Novak Djokovic saga at the Australian Open, which is scheduled to start this week. In brief, Australia, having pursued a zero Covid policy for most of the pandemic, only allows vaccinated visitors to enter. Djokovic, who's the world #1 male tennis player, is also a prominent anti-vaxxer. Much earlier in the pandemic, he infamously organized a tennis tournament, which had to be aborted when several players, including himself, caught Covid-19 (link). He is still unvaccinated, and yet he was allowed into Australia to play the Open. People are upset. Some players who got themselves vaccinated in order to play in the tournament are not cool. Spectators who also must be vaccinated in order to watch the matches in person are not amused.
When the public learned that Djokovic received a special exemption, the Australian government decided to cancel his visa. Djokovic's camp, however, proceeded to fight his case in court. This then became messier and messier, as the superstar told his side of the story. His parents, his fans, and the Serbian government aggressively supported the player. [Djokovic lost in court for the second time this Sunday, and was deported and no longer could play in the tournament.]
In the midst of it all, some enterprising data journalists uncovered tantalizing clues that demonstrate that Djokovic's story used to obtain the exemption is full of holes. It's a great example of the sleuthing work that data analysts undertake to understand the data.
***
A central plank of the tennis player's story is that he tested positive for Covid-19 on December 16. This test result provided grounds for an exemption from vaccination, although the Australian government tightened entry requirements due to the Omicron surge. The timing of the test result was convenient, raising the question of whether it was faked. Intriguingly, Djokovic attended a children's event the day after he said he tested positive, and also gave an in-person interview to a French reporter two days after. His team maintained that the test was authentic, and offered evolving explanations and apologies for his not isolating after testing positive.
Digital breadcrumbs caught up with Djokovic. As everyone should know by now, every email receipt, every online transaction, every time you use a mobile app, you are leaving a long trail for investigators. It turns out that test results from Serbia include a QR code. QR code is nothing but a fancy bar code. It's not an encrypted message that can only be opened by authorized people. Since Djokovic's lawyers submitted the test result in court documents, data journalists from the German newspaper Spiegel, partnering with a consultancy Zerforschung, scanned the QR code, and landed on the Serbian government's webpage that informs citizens of their test results.
The information displayed on screen was limited and not very informative. It just showed the test result was positive (or negative), and a confirmation code. What caught the journalists' eyes was that during the investigation, they scanned the QR code multiple times, and saw Djokovic's test result flip-flop. At 1 pm, on December 10, the test was shown as negative (!) but about an hour later, it appeared as positive. That's the first red flag.
Since statistical sleuthing inevitably involves guesswork, we typically want multiple red flags before we sound the alarm.
The next item of interest is the confirmation code which consists of two numbers separated by a dash. The investigators were able to show that the first number is a serial number. This is an index number used by databases to keep track of the millions of test results. In many systems, this is just a running count. If it is a running count, data sleuths can learn some things from it. (This is why even so-called metadata can reveal more than you think. Djokovic may have become the latest victim.)
Djokovic's supposedly positive test result on December 16 has serial number 7371999. If someone else's test has a smaller number, we can surmise that the person took the test prior to Dec 16, 1 pm. Similarly, if someone took a test after Dec 16, 1 pm, it should have an serial number larger than 7371999. There's more. The gap between two serial numbers provides information about the duration between the two tests. Further, this type of index is hard to manipulate. If you want to fake a test in the past, there is no index number available for insertion if the count increments by one for each new test! (One can of course insert a fake test right now before the next real test result arrives.)
The researchers compared the gaps in these serial numbers and the official tally of tests conducted within a time window, and felt satisifed that the first part of the confirmation code is an index that effectively counts the number of tests conducted in Serbia. Why is this important?
It turns out that Djokovic's lawyers submitted another test result to prove that he has recovered. The negative test result was supposedly conducted on December 22. What's odd is that this test result has a smaller serial number than the initial positive test result, suggesting that the first (positive) test may have come after the second (negative) test. That's red flag #2!
To get to this point, the detectives performed some delicious work. The landing page from the QR code does not actually include a time stamp, which would be a huge blocker to any of the investigation. But... digital breadcrumbs.
While human beings don't need index numbers, machines almost always do. The URL of the landing page actually contains a disguised date. For the December 22 test result, the date was shown as 1640187792. Engineers will immediately recognize this as a "Unix date". A simple decoder returns a human-readable date: December 22, 16:43:12 CET 2021. So this second test was indeed performed on the day the lawyers had presented to the court.
Dates are also a type of index, which can only increment. Surprisingly, the Unix date on the earlier positive test translates to December 26, 13:21:20 CET 2021. If our interpretation of the date values is correct, then the positive test appeared 4 days after the negative test in the system. That's red flag #3.
To build confidence that they interpreted dates correctly, the investigators examined the two possible intervals: December 16 and 22 (Djokovic's lawyers), and December 22 and 26 (apparent online data). Remember the jump in serial numbers in each period should correspond to the number of tests performed during that period. It turned out that the Dec 22-26 time frame fits the data better than Dec 16-22!
***
The stuff of this project is fun - if you're into data analysis. The analysts offer quite strong evidence that there may be something smelly about the test results, and they have a working theory about how the tests were faked.
That said, statistics do not nail fraudsters. We can show plausibility or even high probability but we cannot use statistics alone to rule out any outliers. Typically, statistical evidence needs physical evidence. That's one of the key takeaways in Chapter 5 of Numbers Rule Your World (link).
***
Some of the reaction to the Spiegel article demonstrates what happens with suggestive data that nonetheless are not infallible.
Some aspects of the story were immediately confirmed by Serbians who have taken Covid-19 tests. The first part of the confirmation number appears to change with each test, and the more recent serial number is larger than the older ones. The second part of the confirmation number, we learned, is a kind of person ID, as it does not vary between successive test results.
One part of the story did not hold up. The date found on the landing page URL does not seem to be the date of the test, but the date on which someone requests a PDF download of the result. This behavior can easily be verified by anyone who has test results in the system.
Because of this one misinterpretation, the data journalists seemed to have lost a portion of readers, who now consider the entire data investigation debunked. Unfortunately, this reaction is typical. It's even natural in some circles. It's related to the use of "counterexamples" to invalidate hypotheses. Since someone found the one thing that isn't consistent with the hypothesis, the entire argument is thought to have collapsed.
However, this type of reasoning should be avoided in statistics, which is not like pure mathematics. One counterexample does not spell doom to a statistical argument. A counterexample may well be an outlier. The preponderance of evidence may still point in the same direction. Remember there were multiple red flags. Misinterpreting the dates does not invalidate the other red flags. In fact, the new interpretation of the dates cannot explain the jumbled serial numbers, which do not vary by the requested PDFs.
***
Statistical investigations can be very powerful, and have gained strength in the Big Data era due to digital breadcrumbs. Nevertheless, statistical arguments suggest plausibility or probability, never certainty. Short of a confession or whistle-blowing or leaking, those who are inclined to disbelieve can always find reasons to disbelieve. Similarly, interested investigators can easily fool themselves.
Recent Comments