So I recently moved and needed to find the optimal subway ride up to Columbia. I have been go back and forth between my two choices to collect some data to help make up my mind. Both routes require two train exchanges but only the first leg differs. In other words:
Route 1 : A -> B -> C
Route 2 : X -> B -> C
Here, the "nodes" (A, X, B, C) are train lines and the first arrow is the Times Square subway station. There are two ways to get from my apartment to Times Square, after which the two routes are identical.
This means the problem reduces to comparing:
Route 1s : A
Route 2s : X
How long does it take to get to Times Square using line A versus line X? Based on my experience so far, A > X by 5 minutes. Under normal circumstances, X is the choice as I get to Times Square 5 minutes earlier. The entire trip takes 35-40 minutes - the 5 minutes don't seem like a world of difference.
So far, I have ignored perhaps the most important piece of data: how variable are the travel times on line A versus line X? Each of those legs consists of part walking, part waiting at the platform, and part riding on the train. The waiting is the key source of variability.
The 5 minutes' difference was based on smooth transitions on both lines. If you have used NY subways, you'll know that wait times of 5-15 minutes are very common. So if line X tends to require longer waits at the platform than line A, then the 5-minute advantage can easily be wiped out!
So my next data collection task is to figure out how likely it is to suffer the distribution of wait times at the respective platforms.
***
I cover the average versus variability concept in Chapter 1 of Numbers Rule Your World (link). This concept is related to the signal and noise concept that Nate Silver made famous. The normal difference of 5 minutes is the "signal". The "noise", that is to say, the variability of wait times on the platform, may be so strong that you cannot "see" the signal. This is what I mean by "wiping out the difference."
Wouldn't it be simpler to use the two possible routes (A->B->C) and (X->B->C), say, selecting randomly between them for 4 weeks and just recording the start and end times of the trips?
In my thinking, even though this muddies the waters a bit in terms of the difference between the A and X segments, if all the variability is in the B and C segments whether you save 5 minutes or not getting to the next platform it could all get wiped out by an unpredictable next train.
What I encourage folks to look at while at work is the big outcome we care about. That is, if you do something and are able to measure a micro level outcome but the effect doesn't show up in the macro level then you probably didn't do really anything at all, no? We tend to refer to it as squeezing the balloon at work - reducing the effort on part of a project just to have it reappear somewhere else.
Posted by: Adam Schwartz | 08/16/2016 at 03:18 PM
In a more practical example of my point, I once shopped my "primary" grocery store for 6 weeks carefully recording everything I bought, it's regular price and the price I paid for it. Then I went and shopped two competitive stores for several weeks again recording all the details of every item I bought, what I would normally pay for it and what I actually paid for it. I assumed I'd need all that detail to figure out which store was better to shop at from a price perspective.
In the end, I learned that there was just too much variability (what I wanted to buy that week vs. what was on sale that week) to make a big difference and I could've just recorded the macro level ("what did I spend on groceries this week?") and compare and gotten the same answer. :)
Posted by: Adam Schwartz | 08/16/2016 at 03:24 PM
Did you consider the distribtion of wait times for B after arriving from A, vs arriving from X? E.g. Sometimes a train on the Express track waits fot the next train the Local track to arrive, and leaves immediately after passengers have had time to transfer. Other sorts of correlations may occur for your situation. The MTA publishes schedules for all its lines, and the are reflected in Google's trip suggestions. I find that they are usually right +/- 60 seconds for trains on the local seventh ave line. My guess is that unusual delays occur less than 20% of the time.
Posted by: Gene Miller | 08/17/2016 at 01:10 PM
Adam/Gene: In my case, I still don't need the last legs because physically I will be on the same train going up town regardless of how I got to Times Square. This means that the equivalence is at the level of each sample unit, which is much stronger than equivalence in a probabilistic sense.
Gene: yes, I think I should be able to look at some published MTA statistics to know the wait times waiting for A versus X.
Posted by: Kaiser | 08/17/2016 at 03:41 PM
The correct answer is that you move to Tokyo, where the trains and subways run on time. And the interesting statistical question in NY is: why can't MTA do the same?
Anyway, if you can estimate the variance and its distribution, then run a Monte Carlo simulation because the analytical solution is too complicated.
Posted by: PeterL | 08/21/2016 at 05:25 PM
PeterL: I wonder how other cities deal with unexpected delays such as a passenger "falling ill" on the train, which seems to occur with some frequency. Also, passengers aggressively hold doors open because the five minutes they save are more important than many thousand minutes lost by the others already on board. There could also be structural reasons.
Someone also help us with the cleanliness issue that seems to be uniquely NYC.
Posted by: Kaiser | 08/22/2016 at 11:52 AM