I have an article in Slate, coauthored with Shannon Palus, that just came out, on the recent controversy over pre-sales of Taylor Swift concert tickets (link). A Slate editor contacted me about a dataset that was collected by a Swiftie who was irate that her special status as a "verified fan" apparently did not bump up her chance of getting a ticket but might have even backfired. So the fan set up a survey to expand the sample size. A very good idea!
In the article, I presented my findings. There are a number of factors that affect a fan's chance of getting a ticket. The main takeaway is it's too simplistic to say having verified fan status is always bad. In statistical terms, interactions matter (keep reading if that term is unfamiliar).
The following chart summarizes the data analysis:
In my post, I'll briefly describe the analytical process.
First, I think about what tool I'd use for the job. In this case, I chose JMP, which is my go-to tool for quick, exploratory, one-time-only analyses.
I opened the raw data file, and inspected the rows (which correspond to individual responses to the survey). I removed a small number of rows from the dataset which could not be analyzed - e.g. rows with a time stamp but containing null responses to all questions. Also, from the response patterns, I could see that if someone said they did not have verified fan status, then no other responses were recorded. Those rows are also removed.
Next, I selected the columns of data that are relevant to the problem of predicting success rate based on some factors like having a verified-fan boost, time zone, ticketing platform, etc. For each factor, I made sure all answers are valid and coded missing values the way I like it. A small number of discretionary decisions has to be made. For example, the key outcome variable - whether someone was able to purchase the ticket - has a third possible answer of "Yes I got tickets but only through a friend who got through the line". I decided to count this as a "No" instead of a "Yes". Also, for reasons that I don't need to explain to modelers, I created duplicate columns, e.g. one column showing responses Yes/No, treated as text, and a duplicate showing response 0/1, treated as numbers or factors.
For the time being, I resisted renaming columns, which were long strings of the entire survey questions. I eventually succumbed and replaced those long strings with simple column names like "time zone". Having extremely long column names just make analytical results really hard to read.
I ran some standard analyses, including multiple regression and decision tree. For multiple regression, I screened for interaction effects. In particular, JMP has a really useful "profiler" feature that allows me to visualize the "response surface". I first ran the decision tree using the automated process. This methodology is very useful for assessing the potential interaction effects, and relative importance of the various factors. The regression equation also provides similar information: these analyses are complementary.
One of the reasons I chose JMP is that the decision tree procedure has a tree-building feature. Instead of using the automated rules specified by the inventors, I can iteratively build my own decision tree. The chart you saw above is the decision tree I built manually. It is not identical to what the automated procedure created. That's because I made decisions about how to structure the data (based on the learning I synthesized from both analyses I performed). The automated procedure always picks the strictly most discriminant split at each stage, but frequently, there are multiple splitting variables with similar degrees of discrimination. For example, the automated procedure would not have split the responses by ticketing platform first but not doing so vastly complicates the story.
I did one final transformation. The original question on boosting allows four choices: No boost, only Lover Fest boost, only Taylor Nation boost, and both Lover Fest and Taylor Nation boosts. Using this factor directly obscures the main and interaction effects, so I split this up into two variables: Lover Fest boost (yes/no) and Taylor Nation boost (yes/no). This rearrangement provides a more nuanced analysis. For example, for Ticketmaster users in the East Coast, having the Taylor Nation boost reduces success rate by a little but having the Lover Fest boost reduces success rate by a lot. Having both boosts is about as bad as having just the Lover Fest boost, thus we can't add up the two individual effects and get the aggregate effect. (This is what statisticians call "interaction" between two factors.) For this reason, the final tree diagram did not mention Taylor Nation boost for Ticketmaster users in the East time zone.
***
How does my analytical process deviate from what's presented in a textbook? I treat the results from standard analytical methods as references, not gospel. These standard methods spit out the best fitting models according to some criteria of "goodness". Such criteria almost never include maximizing human understanding of the (unknown) processes that generated the data. Chasing after tiny improvements to "R-squared" frequently over-complicates the model which reduces clarity to humans.
***
If you have any questions about the analytical process, just ask below.
Comments
You can follow this conversation by subscribing to the comment feed for this post.