« Know your data 19: don't tell me you are 2 blocks away when you are 20 blocks away | Main | GMO labeling is good science »


Feed You can follow this conversation by subscribing to the comment feed for this post.

Adam Schwartz

Wouldn't it be simpler to use the two possible routes (A->B->C) and (X->B->C), say, selecting randomly between them for 4 weeks and just recording the start and end times of the trips?

In my thinking, even though this muddies the waters a bit in terms of the difference between the A and X segments, if all the variability is in the B and C segments whether you save 5 minutes or not getting to the next platform it could all get wiped out by an unpredictable next train.

What I encourage folks to look at while at work is the big outcome we care about. That is, if you do something and are able to measure a micro level outcome but the effect doesn't show up in the macro level then you probably didn't do really anything at all, no? We tend to refer to it as squeezing the balloon at work - reducing the effort on part of a project just to have it reappear somewhere else.

Adam Schwartz

In a more practical example of my point, I once shopped my "primary" grocery store for 6 weeks carefully recording everything I bought, it's regular price and the price I paid for it. Then I went and shopped two competitive stores for several weeks again recording all the details of every item I bought, what I would normally pay for it and what I actually paid for it. I assumed I'd need all that detail to figure out which store was better to shop at from a price perspective.

In the end, I learned that there was just too much variability (what I wanted to buy that week vs. what was on sale that week) to make a big difference and I could've just recorded the macro level ("what did I spend on groceries this week?") and compare and gotten the same answer. :)

Gene Miller

Did you consider the distribtion of wait times for B after arriving from A, vs arriving from X? E.g. Sometimes a train on the Express track waits fot the next train the Local track to arrive, and leaves immediately after passengers have had time to transfer. Other sorts of correlations may occur for your situation. The MTA publishes schedules for all its lines, and the are reflected in Google's trip suggestions. I find that they are usually right +/- 60 seconds for trains on the local seventh ave line. My guess is that unusual delays occur less than 20% of the time.


Adam/Gene: In my case, I still don't need the last legs because physically I will be on the same train going up town regardless of how I got to Times Square. This means that the equivalence is at the level of each sample unit, which is much stronger than equivalence in a probabilistic sense.

Gene: yes, I think I should be able to look at some published MTA statistics to know the wait times waiting for A versus X.


The correct answer is that you move to Tokyo, where the trains and subways run on time. And the interesting statistical question in NY is: why can't MTA do the same?

Anyway, if you can estimate the variance and its distribution, then run a Monte Carlo simulation because the analytical solution is too complicated.


PeterL: I wonder how other cities deal with unexpected delays such as a passenger "falling ill" on the train, which seems to occur with some frequency. Also, passengers aggressively hold doors open because the five minutes they save are more important than many thousand minutes lost by the others already on board. There could also be structural reasons.
Someone also help us with the cleanliness issue that seems to be uniquely NYC.

The comments to this entry are closed.

Get new posts by email:
Kaiser Fung. Business analytics and data visualization expert. Author and Speaker.
Visit my website. Follow my Twitter. See my articles at Daily Beast, 538, HBR, Wired.

See my Youtube and Flickr.


  • only in Big Data
Numbers Rule Your World:
Amazon - Barnes&Noble

Amazon - Barnes&Noble

Junk Charts Blog

Link to junkcharts

Graphics design by Amanda Lee

Next Events

Jan: 10 NYPL Data Science Careers Talk, New York, NY

Past Events

Aug: 15 NYPL Analytics Resume Review Workshop, New York, NY

Apr: 2 Data Visualization Seminar, Pasadena, CA

Mar: 30 ASA DataFest, New York, NY

See more here

Principal Analytics Prep

Link to Principal Analytics Prep