In the leadup to today’s hearing, U.S. Supreme Court judge nominee Brett Kavanaugh produced what he claimed to be calendar entries from the summer of 1982 as evidence that he did not attend a specific party, and therefore he could not have committed the unsavory acts alleged of him.
This story perked my interest from the data perspective. What kind of information is contained in calendar datasets? And how can such data be used to support or invalidate hypotheses?
This discussion is of great relevance beyond the Kavanaugh case because the majority of the data being collected today - surveillance data, transaction data, clickstream, etc. - are all event data: they record when someone did something.
***
When I started this post, no actual pictures of the most famous calendar right now has been published. USA Today did publish a few pictures yesterday. Here is one page to give us a visual:
This post is not about the specifics of Kavanaugh's youthful activities but about the nature of calendar datasets.
A calendar dataset consists of calendar entries. When converting the above image into a spreadsheet, one would create a row for each event (so multiple rows for multiple events on the same day). Each row should contain the following data, laid out in columns:
- The date and day of week of an event
- The time of the event
- The location of the event, address and/or directions to it
- Other people involved in the event
- Duration of the event
- Miscellaneous notes relevant and specific to the event
- Other notes unrelated to the event
In industry parlance, the above list is called a “data dictionary.” It describes the structure of the data. Any data analyst who has used such things realize that these definitions are imprecise. If one randomly selects any two calendar datasets, there will be variance in what one actually gets.
A lot depends on how the calendar owner uses the calendar! Here are some considerations:
Planner or diary. Most people I know use calendars for planning future events. Kavanaugh appeared to use it also as a diary of sorts. Some entries record past events, e.g. a basketball score. Other entries pertain to future events, e.g. going to a movie. This is a key issue in interpretation: if it’s used as a planner, then the data become frozen in time once the event is over but if it’s a diary, the analyst should assume that edits or appends may change the data even after the event date.
Planner entries are tricky because the information may not hold true beyond the date of the event – an event might get cancelled, the calendar owner might decide to skip the event, times or venues or attendees may shift, etc. To prevent mis-interpreting the data, the analyst should strive to annotate the data with whether a calendar entry is for planning or diary.
How changes are made. If the calendar is used as a diary, the owner may revise the information after the event happened, e.g. correct the list of attendees. Even if the calendar is purely used for planning, edits will be inevitable, e.g. the venue of an event might change before the event happens. How are such changes enacted? Some owners may erase and overwrite; other owners may black out and append; still others may strike out and append. In the first case, we have no trace of the change at all; in the second case, we know something has changed but not the specifics; in the third, we have both the old and the new versions of the information.
How complete is the data? A fallacy is to assume WYSIWYG. There is no guarantee that every event in someone’s life is recorded on the calendar. We can assume that every event that the calendar owner wants to record is found on the calendar. Even that assumption is not safe. It happens a few times a month when I forgot to put meetings down on my calendar (most of these I did attend, and then there are the ones that I forgot about because they are not recorded). Missing events appear as missing rows on the dataset.
Kavanaugh did not seem to care as much about the time of events. There are many entries that did not contain when he was supposed to do something. So his calendar dataset contains a large number of missing entries under the column of “time of event.”
How consistent is the data? An analyst can’t even assume that a specific person would follow strict guidelines in terms of filling out a specific item on the data dictionary. Take Kavanaugh’s calendar as an example. On many entries, he did not record the attendees of the event. But on some, he did. When the dataset shows no attendee names for a particular event, does that mean he was alone, or does that mean he chose not to list the attendees? Even if there is a list of attendees, the analyst is hard-pressed to know if the list is complete.
How reliable is the data? Some people are sloppy about details, others are meticulous. Some people correct the entries, other don’t. Some data elements might be deemed not important enough to correct.
How the data was generated. Most of the data will be transcribed (by hand or by optical character recognition software) from the paper calendar to a database. Note, however, some of the data must be inferred. A good example is item (a), the date and day of week of an event. When we read the calendar, we know by the location of the text which date the event is supposed to occur but there is no handwriting of the date! So, in order to generate that column, the analyst must extract (using my own calendar as an example) the year from the cover, the month from the page title and the day from the column name.
Wait – it gets more complicated. Imagine you fill out the space for a given day, so the additional entries are written in the margins with a guiding line to that entry. (Ouch!)
***
I can keep expanding the list of issues but you get the idea. Data is a dirty business. The worst thing that an analyst can do is to presume that the data in front of him/her is (a) complete and (b) accurate.
Now, let’s get to the logic of the Kavanaugh defense. The basic premise is that the calendar did not contain an entry showing that he was at the same party as the accuser, and therefore he was not at such a party, and therefore he could not have committed those alleged acts.
For that logic to hold, we have to believe a further set of assumptions that are not proven by the calendar dataset:
- That the calendar contained every party that Kavanaugh attended during those months
- That Kavanaugh primarily used the calendar as a diary recording past events, rather than as a planner for future events, that is to say, information related to the event is true or corrected after the event
- That for the specific event in question, he was in “diary” mode rather than “planning” mode so that the absence of the party can be inferred to mean he wasn't there.
- That when he revised entries in his calendar, he never erased past writing.
- That none of the blacked-out entries contained relevant information to the current situation.
- That he always listed all attendees of parties he went to.
It’s really hard to use data to prove “absence,” and this is no exception.
***
Be careful next time you interpret your event dataset!
Recent Comments