There is one mode of data analysis that rarely generates anything of real value. Here, the analyst asks the question: what can I do with this dataset? Unfortunately, the data could be very large, but it may not contain the information needed to address a meaningful question.
Let's take the Citibike dataset as an example. It contains transactional data of when a rider checks out or checks in a bicycle.
This leads to a pile of predictable analyses and data graphics.
For example:
This dynamic map shows the popularity of different bike stations throughout the day.
Another example:
This one shows the top 10 bike stations by popularity.
Those are the typical outputs of analyzing Citibike data. The problem: why would Citibike management care?
***
Instead, we should ask Citibike management what problems they are concerned with.
Someone I recently interviewed gave a good answer: maybe the managers want to know whether there are riders who could not get bikes or there are bicycles that are rarely checked out. In other words, excess demand or excess supply. If the analyst comes back with an answer, to tell management whether they are facing excess demand or supply, that would be a useful outcome.
We'll ignore whether we want the answer at the local or global scale. Let's take one particular neighborhood (say, East Village) for argument's sake. Can we say if there is excess demand or excess supply in East Village of Citibikes?
You can now filter the data to only East Village bike stations, and count the number of unique bike identification numbers across the transactional dataset. That gives you a number.
But what does that number mean?
It turns out the count of unique bicycles is measuring neither supply nor demand.
If there were excess demand, meaning some riders are turned away, that information is not in the transactional dataset. It's unlikely to be in any dataset owned by Citibike. They would have to find creative ways to collect the data.
***
This is just one of many examples in which despite the sizes of today's datasets, they still don't contain the right data that hold the key to answering meaningful questions.
Because the Citibike dataset is available, we are bomboarded with analyses that focus on imaginary problems of little value.
Recent Comments