The UK has joined California, Texas, etc. in the list of organizations that couldn't keep their testing data straight. It just admitted that over the last week or so, it under-counted Covid-19 cases by 30 percent.
(Graphic from BBC.)
The official reason is an Excel limitation, which is a brilliant stroke of PR (I have previously praised the PR firm behind the UK's coronavirus response. See here.) This message is intended for those who do not use Excel, which, I guess, is the majority of the UK population. Among those who have been exposed to Excel, there exists a set of vocal dissenters, people who hyperventilate at the mention of Microsoft's software, and they will help propagate the message.
Anyone who uses Excel a lot will not fall for this.
Unlike California and Texas, where the officials remain mum about the details of their supposed faux pas, the UK government explained to the Daily Mail what this Excel error is. We are to believe that there is a master spreadsheet recording all cases, that spreadsheet ran out of space, nobody noticed it until a backlog of 16,000 cases has stacked up, and they fixed the issue by splitting the data onto multiple spreadsheets.
One word: implausible.
***
The limit on the number of rows on a single Excel spreadsheet is above 1 million. The cumulative number of cases in the UK as of October 2 was fewer than 500,000. Assuming that they collect data for each individual case, so that there is one row of data per case, the spreadsheet should be at least 50% empty.
I'm not convinced that's the workflow. It would seem more likely that there is one line per reporting unit, not one line per individual case. If that is so, then the spreadsheet is even smaller.
Let's play along and assume somehow someone created a process filled with duplicates and errors or other horrors such that the spreadsheet manages to overrun 1 million rows.
We now have to believe that the Excel spreadsheet is maintained by a machine without a single human opening it. If a human is transferring the data onto this master spreadsheet, as I expect to be the scenario, how is it that this live person is not alert to the space outage? One simply cannot add any more rows to the bottom of the sheet!
If a machine is programmed to add rows to this spreadsheet, we now have to believe that neither Excel nor the program returns any error messages or warnings when the spreadsheet runs out of room!
Let's keep playing along. Against all odds, the spreadsheet is somehow completely full and nothing more can be added. Neither humans nor machines notice anything odd. The new data simply fell into a black hole.
Now, how do we explain that the case counts are increasing each day for a week before they noticed the missing cases?
From September 25 to October 2, the period in question, the UK reported on average 4,368 cases before admitting the mistake. The revised data bumped the daily average to 6,348 cases, which means the public was given a number 30 percent below the real tally.
According to the Daily Mail, "some 16,000 confirmed infections had to be added to the daily totals running back more than a week." If we buy the Excel story, what should have happened is that one day, the spreadsheet ran out of space, and from that point on, the case count plateaued with no more additions.
What should have happened is suddenly no more cases are recorded, and the error is immediately obvious.
P.S. Lets' keep playing. So they keep one gigantic spreadsheet not just of positive cases but of all test results from which they extract the cases. As of October 2, the U.K. said they have "processed" 22.7 million PCR tests. Remember the limit of a single spreadsheet is 1 million, which has been passed long time ago.
***
Dr. Backlog sure chose a curious time to jump the pond. The undercounting occurs right as cases were spiking in the U.K., France, Spain, and all over Europe, and governments had to make the difficult decision of whether or not to re-impose restrictions to ward off a second wave.
Yes I will be the first to admit the data world is messy, and there can be innocent explanations for data discrepancies. However, any experienced data analyst realizes one cannot lazily cling to the most convenient explanation; all possible explanations must be investigated, including politically incorrect ones.
What seems to have happened is that the data was being converted from .csv to .xls (not .xlsx), and the older format has a limit of 65536 rows. The complicating factor is that the upload used 'several' rows for each case. Which is why nobody noticed the limitation - the ceiling on each upload, and hence the maximum number of cases which could be recorded, was 65536 / several, so it wasn't even as if anyone could spot that the case count was (e.g.) 8192 several days in a row.
Posted by: Phil | 10/05/2020 at 03:36 PM
Phil: Thanks for the additional info. If they really want to blame it on software, they should publish their entire workflow. I still don't think the pieces fit together. How would we explain the discrepancy by day in the graphic, including the day with zero error?
Are they saying that this process has no basic auditing functionality? How about a simple report that lists (a) the total number of rows in the csv (input) and (b) the total number of rows in the xls (output)?
Posted by: Kaiser | 10/05/2020 at 04:17 PM
One sheet per reporting unit? Here in Japan they are entering data by hand from reports sent by fax, so nothing surprises me.
Posted by: Tore | 10/05/2020 at 07:39 PM
I saw one reference to cases being stored columns, which would mean hitting the case limit much earlier.
Posted by: KL | 10/05/2020 at 08:39 PM
Keep the ideas flowing. The problem is that all these pieces so far don't fit together. If it's not one master sheet but one sheet per reporting unit, then the scale of the data is much smaller. If the limit is much lower, like 65,000, as opposed to 1 million, then they should have run into problems much earlier.
When data analysts investigate data problems, just like coders debugging code, we are striving to draw the entire line from the start to the end, and so far, we've been fed some breadcrumbs, and they don't seem to have come from the same piece of bread.
Posted by: Kaiser | 10/05/2020 at 09:10 PM
Yikes.
XLSX is thirteen years old, at which point it immediately supplanted XLS as the default Excel format.
What aging VB hacker is being allowed to place XLS at the core of a new, critical, automated workflow in 2020?
Posted by: Matt VE | 10/06/2020 at 12:12 AM
This article from the BBC points to the limit being around 1400 cases per template (https://www.bbc.co.uk/news/technology-54423988).
The UK Civil Service isn't famous for it's up-to-date IT systems or going with expensive vendors. I could see a plausible scenario where there are a group of data entryists spending hours a day getting CSV files from various private companies and copy-pasting into a template file for upload. The CSV will be different sizes at different times, and a straight copy-paste will have a warning message about lost formats between file types, as well as the data loss message. You're behind on your copy-pasting so keep dimissing the prompts and uploading. The numbers vary so nothing smells off to begin with until your fourth day of near identical case numbers has someone asking a question.
I'd guess they've used XLS because all of their systems use Windows, but some are possibly still on XP or running old macros that haven't been ported. So the message from up high could be "Everyone supports XLS. Use that."
Sadly, if they just stuck with CSV into central system, they wouldn't have this issue...
Posted by: Paul | 10/06/2020 at 04:19 AM
Theories about the limit being very low - the missing piece is why now? why not earlier?
Be careful about offering a purely technical solution. Almost every explanation - including the many thoughtful comments above - includes some "dumb" or "clueless" humans, and if that is to be part of the story, those types of mistakes cannot be prevented merely by switching to a different technology.
For example, a purely CSV solution. They're going to tell us that they belatedly realize that some of their files got truncated because they didn't expect to find a comma inside one of the fields.
Posted by: Kaiser | 10/06/2020 at 12:15 PM