« Leave the scientists alone, let them do their jobs | Main | How to act like a data scientist 13: explicating your mental model »

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Phil

What seems to have happened is that the data was being converted from .csv to .xls (not .xlsx), and the older format has a limit of 65536 rows. The complicating factor is that the upload used 'several' rows for each case. Which is why nobody noticed the limitation - the ceiling on each upload, and hence the maximum number of cases which could be recorded, was 65536 / several, so it wasn't even as if anyone could spot that the case count was (e.g.) 8192 several days in a row.

Kaiser

Phil: Thanks for the additional info. If they really want to blame it on software, they should publish their entire workflow. I still don't think the pieces fit together. How would we explain the discrepancy by day in the graphic, including the day with zero error?

Are they saying that this process has no basic auditing functionality? How about a simple report that lists (a) the total number of rows in the csv (input) and (b) the total number of rows in the xls (output)?

Tore

One sheet per reporting unit? Here in Japan they are entering data by hand from reports sent by fax, so nothing surprises me.

KL

I saw one reference to cases being stored columns, which would mean hitting the case limit much earlier.

Kaiser

Keep the ideas flowing. The problem is that all these pieces so far don't fit together. If it's not one master sheet but one sheet per reporting unit, then the scale of the data is much smaller. If the limit is much lower, like 65,000, as opposed to 1 million, then they should have run into problems much earlier.

When data analysts investigate data problems, just like coders debugging code, we are striving to draw the entire line from the start to the end, and so far, we've been fed some breadcrumbs, and they don't seem to have come from the same piece of bread.

Matt VE

Yikes.

XLSX is thirteen years old, at which point it immediately supplanted XLS as the default Excel format.

What aging VB hacker is being allowed to place XLS at the core of a new, critical, automated workflow in 2020?

Paul

This article from the BBC points to the limit being around 1400 cases per template (https://www.bbc.co.uk/news/technology-54423988).

The UK Civil Service isn't famous for it's up-to-date IT systems or going with expensive vendors. I could see a plausible scenario where there are a group of data entryists spending hours a day getting CSV files from various private companies and copy-pasting into a template file for upload. The CSV will be different sizes at different times, and a straight copy-paste will have a warning message about lost formats between file types, as well as the data loss message. You're behind on your copy-pasting so keep dimissing the prompts and uploading. The numbers vary so nothing smells off to begin with until your fourth day of near identical case numbers has someone asking a question.

I'd guess they've used XLS because all of their systems use Windows, but some are possibly still on XP or running old macros that haven't been ported. So the message from up high could be "Everyone supports XLS. Use that."

Sadly, if they just stuck with CSV into central system, they wouldn't have this issue...

Kaiser

Theories about the limit being very low - the missing piece is why now? why not earlier?

Be careful about offering a purely technical solution. Almost every explanation - including the many thoughtful comments above - includes some "dumb" or "clueless" humans, and if that is to be part of the story, those types of mistakes cannot be prevented merely by switching to a different technology.

For example, a purely CSV solution. They're going to tell us that they belatedly realize that some of their files got truncated because they didn't expect to find a comma inside one of the fields.

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.

Search3

  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Numbersense:
Amazon - Barnes&Noble

Junk Charts Blog



Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep



Link to Principal Analytics Prep

Community