Book review: Visualizing Baseball

I requested a copy of Jim Albert’s Visualizing Baseball book, which is part of the ASA-CRC series on Statistical Reasoning in Science and Society that has the explicit goal of reaching a mass audience.

Visualizingbaseball_coverThe best feature of Albert’s new volume is its brevity. For someone with a decent background in statistics (and grasp of basic baseball jargon), it’s a book that can be consumed within one week, after which one receives a good overview of baseball analytics, otherwise known as sabermetrics.

Within fewer than 200 pages, Albert outlines approaches to a variety of problems, including:

  • Comparing baseball players by key hitting (or pitching) metrics
  • Tracking a player’s career
  • Estimating the value of different plays, such as a single, a triple or a walk
  • Predicting expected runs in an inning from the current state of play
  • Analyzing pitches and swings using PitchFX data
  • Describing the effect of ballparks on home runs
  • Estimating the effect of particular plays on the outcome of a game
  • Simulating “fake” games and seasons in order to produce probabilistic forecasts such as X% chance that team Y will win the World Series
  • Examining whether a hitter is “streaky” or not

Most of the analyses are descriptive in nature, e.g. describing the number and types of pitches thrown by a particular pitcher, or the change in on-base percentage over the career of a particular hitter. A lesser number of pages are devoted to predictive analytics. This structure is acceptable in a short introductory book. In practice, decision-makers require more sophisticated work on top of these descriptive analyses. For example, what’s the value of telling a coach that the home run was the pivotal moment in a 1-0 game that has played out?

To appreciate the practical implications of the analyses included in this volume, I’d recommend reading Moneyball by Michael Lewis, or the more recent Astroball by Ben Reiter.

For the more serious student of sabermetrics, key omitted details will need to be gleaned from other sources, including other books by the same author – for years, I have recommended Curve Ball by Albert and Bennett to my students.

***

In the final chapters, Albert introduced the simulation of “fake” seasons that underlies predictions. An inquiring reader should investigate how the process is tied back to the reality of what actually happened; otherwise, the simulation will have a life of its own. Further, if one simulates 1,000 seasons of 2018 baseball, a large number of these fake seasons would crown some team other than the Red Sox as the 2018 World Series winner. Think about it: that’s how it is possible to make the prediction that the Red Sox has a say 60 percent chance of winning the World Series in 2018! A key to understanding the statistical way of thinking is to accept the logic of this fake simulated world. It is not the stated goal of Albert to convince readers of the statistical way of thinking – but you’re not going to be convinced unless you think about why we do it this way.

***

While there are plenty of charts included in the book, a more appropriate title for “Visualizing Baseball” would have been “Fast Intro to Baseball Analytics”. With several exceptions, the charts are not essential to understanding the analyses. The dominant form of exposition is first describe the analytical conclusion, then introduce a chart to illustrate that conclusion. The inverse would be: Start with the chart, and use the chart to explain the analysis.

The visualizations are generally of good quality, emphasizing clarity over prettiness. The choice of sticking to one software, ggplot2 in R, without post-production, constrains the visual designer to the preferences of the software designer. Such limitations are evident in chart elements like legends and titles. Here is one example (Chapter 5, Figure 5.8):

Albert_visualizingbaseball_chart

By default, the software prints the names of data columns in the titles. Imagine if the plot titles were Changeup, Fastball and Slider instead of CU, FF and SL. Or that the axis labels were “horizontal location” and “vertical location” (check) instead of px and pz. [Note: The chart above was taken from the book's github site; in the  Figure 5.8 in the printed book, the chart titles were edited as suggested.]

The chart analyzes the location relative to the strike zone of pitches that were missed versus pitches that were hit (not missed). By default, the software takes the name of the binary variable (“Miss”) as the legend title, and lists the values of the variable (“True” and “False”) as the labels of the two colors. Imagine if True appeared as “Miss” and False as “Hit” .

Finally, the chart exhibits over-plotting, making it tough to know how many blue or gray dots are present. Smaller dot size might help, or else some form of aggregation.

***

Visualizing Baseball is not the book for readers who learn by running code as no code is included in the book. A github page by the author hosts the code, but only the R/ggplot2 code for generating the data visualization. Each script begins after the analysis or modeling has been completed. If you already know R and ggplot2, the github is worth a visit. In any case, I don’t recommend learning coding from copying and pasting clean code.

All in all, I can recommend this short book to any baseball enthusiast who’s beginning to look at baseball data. It may expand your appreciation of what can be done. For details, and practical implications, look elsewhere.


Seeking simplicity in complex data: Bloomberg's dataviz on UK gender pay gap

Bloomberg featured a thought-provoking dataviz that illustrates the pay gap by gender in the U.K. The dataset underlying this effort is complex, and the designers did a good job simplifying the data for ease of comprehension.

U.K. companies are required to submit data on salaries and bonuses by gender, and by pay quartiles. The dataset is incomplete, since some companies are slow to report, and the analyst decided not to merge companies that changed names.

Companies are classified into industry groups. Readers who read Chapter 3 of Numbers Rule Your World (link) should ask whether these group differences are meaningful by themselves, without controlling for seniority, job titles, etc. The chapter features one method used by the educational testing industry to take a more nuanced analysis of group differences.

***

The Bloomberg visualization has two sections. In the top section, each company is represented by the percent difference between average female pay and average male pay. Then the companies within a given industry is shown in a histogram. The histograms provide a view of the disparity between companies within a given industry. The black line represents the relative proportion of companies in a given industry that have no gender pay gap but it’s the weight of the histogram on either side of the black line that carries the graphic’s message.

This is the histogram for arts, entertainment and recreation.

Bloomberg_genderpaygap_arts

The spread within this industry is very wide, especially on the left side of the black line. A large proportion of these companies pay women less on average than men, and how much less is highly variable. There is one extreme positive value: Chelsea FC Foundation that pays the average female about 40% more than the average male.

This is the histogram for the public sector.

Bloomberg_genderpaygap_public
It is a much tighter distribution, meaning that the pay gaps vary less from organization to organization (this statement ignores the possibility that there are outliers not visible on this graphic). Again, the vast majority of entities in this sector pay women less than men on average.

***

The second part of the visualization look at the quartile data. The employees of each company are divided into four equal-sized groups, based on their wages. Think of these groups as the Top 25% Earners, the Second 25%, etc. Within each group, the analyst looks at the proportion of women. If gender is independent of pay, then we should expect the proportions of women to be about the same for all four quartiles. (This analysis considers gender to be the only explainer for pay gaps. This is a problem I've called xyopia, that frames a complex multivariate issue as a bivariate problem involving one outcome and one explanatory variable. Chapter 3 of Numbers Rule Your World (link) discusses how statisticians approach this issue.)

Bloomberg_genderpaygap_public_pieOn the right is the chart for the public sector. This is a pie chart used as a container. Every pie has four equal-sized slices representing the four quartiles of pay.

The female proportion is encoded in both the size and color of the pie slices. The size encoding is more precise while the color encoding has only 4 levels so it provides a “binned” summary view of the same data.

For the public sector, the lighter-colored slice shows the top 25% earners, and its light color means the proportion of women in the top 25% earners group is between 30 and 50 percent. As we move clockwise around the pie, the slices represent the 2nd, 3rd and bottom 25% earners, and women form 50 to 70 percent of each of those three quartiles.

To read this chart properly, the reader must first do one calculation. Women represent about 60% of the top 25% earners in the public sector. Is that good or bad? This depends on the overall representation of women in the public sector. If the sector employs 75 percent women overall, then the 60 percent does not look good but if it employs 40 percent women, then the same value of 60% tells us that the female employees are disproportionately found in the top 25% earners.

That means the reader must compare each value in the pie chart against the overall proportion of women, which is learned from the average of the four quartiles.

***

In the chart below, I make this relative comparison explicit. The overall proportion of women in each industry is shown using an open dot. Then the graphic displays two bars, one for the Top 25% earners, and one for the Bottom 25% earners. The bars show the gap between those quartiles and the overall female proportion. For the top earners, the size of the red bars shows the degree of under-representation of women while for the bottom earners, the size of the gray bars shows the degree of over-representation of women.

Redo_junkcharts_bloombergukgendergap

The net sum of the bar lengths is a plausible measure of gender inequality.

The industries are sorted from the ones employing fewer women (at the top) to the ones employing the most women (at the bottom). An alternative is to sort by total bar lengths. In the original Bloomberg chart - the small multiples of pie charts, the industries are sorted by the proportion of women in the bottom 25% pay quartile, from smallest to largest.

In making this dataviz, I elected to ignore the middle 50%. This is not a problem since any quartile above the average must be compensated by a different quartile below the average.

***

The challenge of complex datasets is discovering simple ways to convey the underlying message. This usually requires quite a bit of upfront analytics, data transformation, and lots of sketching.

 

 


Form and function: when academia takes on weed

I have a longer article on the sister blog about the research design of a study claiming 420 "cannabis" Day caused more road accident fatalities (link). The blog also has a discussion of the graphics used to present the analysis, which I'm excerpting here for dataviz fans.

The original chart looks like this:

Harperpalayew-new-420-fig2

The question being asked is whether April 20 is a special day when viewed against the backdrop of every day of the year. The answer is pretty clear. From this chart, the reader can see:

  • that April 20 is part of the background "noise". It's not standing out from the pack;
  • that there are other days like July 4, Labor Day, Christmas, etc. that stand out more than April 20

It doesn't even matter what the vertical axis is measuring. The visual elements did their job. 

***

If you look closely, you can even assess the "magnitude" of the evidence, not just the "direction." While April 20 isn't special, it nonetheless is somewhat noteworthy. The vertical line associated with April 20 sits on the positive side of the range of possibilities, and appears to sit above most other days.

The chart form shown above is better at conveying the direction of the evidence than its strength. If the strength of the evidence is required, we use a different chart form.

I produced the following histogram, using the same data:

Redo_420day_2

The histogram is produced by first locating the midpoints# of the vertical lines into buckets, and then counting the number of days that fall into each bucket.  (# Strictly speaking, I use the point estimates.)

The midpoints# are estimates of the fatal crash ratio, which is defined as the excess crash fatalities reported on the "analysis day" relative to the "reference days," which are situated one week before and one week after the analysis day. So April 20 is compared to April 13 and 27. Therefore, a ratio of 1 indicates no excess fatalities on the analysis day. And the further the ratio is above 1, the more special is the analysis day. 

If we were to pick a random day from the histogram above, we will likely land somewhere in the middle, which is to say, a day of the year in which no excess car crashes fatalities could be confirmed in the data.

As shown above, the ratio for April 20 (about 1.12)  is located on the right tail, and at roughly the 94th percentile, meaning that there were 6 percent of analysis days in which the ratios would have been more extreme. 

This is in line with our reading above, that April 20 is noteworthy but not extraordinary.

 

P.S. [4/27/2019] Replaced the first chart with a newer version from Harper's site. The newer version contains the point estimates inside the vertical lines, which are used to generate the histogram.

 

 

 

 

 


The Bumps come to the NBA, courtesy of 538

The team at 538 did a post-mortem of their in-season forecasts of NBA playoffs, using Bumps charts. These charts have a long history and can be traced back to Cambridge rowing. I featured them in these posts from a long time ago (link 1, link 2). 

Here is the Bumps chart for the NBA West Conference showing all 15 teams, and their ranking by the 538 model throughout the season. 

Fivethirtyeight_nbawest_bumps

The highlighted team is the Kings. It's a story of ascent especially in the second half of the season. It's also a story of close but no cigar. It knocked at the door for the last five weeks but failed to grab the last spot. The beauty of the Bumps chart is how easy it is to see this story.

Now, if you'd focus on the dotted line labeled "Makes playoffs," and note that beyond the half-way point (1/31), there are no further crossings. This means that the 538 model by that point has selected the eight playoff teams accurately.

***

Now what about NBA East?

Fivethirtyeight_nbaeast_bumps

This chart highlights the two top teams. This conference is pretty easy to predict at the top. 

What is interesting is the spaghetti around the playoff line. The playoff race was heart-stopping and it wasn't until the last couple of weeks that the teams were settled. 

Also worthy of attention are the bottom-dwellers. Note that the chart is disconnected in the last four rows (ranks 12 to 15). These four teams did not ever leave the cellar, and the model figured out the final rankings around February.

Using a similar analysis, you can see that the model found the top 5 teams by mid December in this Conference, as there are no further crossings beyond that point. 

***
Go check out the FiveThirtyEight article for their interpretation of these charts. 

While you're there, read the article about when to leave the stadium if you'd like to leave a baseball game early, work that came out of my collaboration with Pravin and Sriram.


How to describe really small chances

Reader Aleksander B. sent me to the following chart in the Daily Mail, with the note that "the usage of area/bubble chart in combination with bar alignment is not very useful." (link)

Dailymail-image-a-35_1431545452562

One can't argue with that statement. This chart fails the self-sufficiency test: anyone reading the chart is reading the data printed on the right column, and does not gain anything from the visual elements (thus, the visual representation is not self-sufficient). As a quick check, the size of the risk for "motorcycle" should be about 30 times larger than that of "car"; the size of the risk for "car" should be 100 times larger than that of "airplane". The risk of riding motorcycles then is roughly 3,000 times that of flying in an airplane. 

The chart does not appear to be sized properly as a bubble chart:

Dailymail_travelrisk_bubble

You'll notice that the visible proportion of the "car" bubble is much larger than that of the "motorcycle" bubble, which is one part of the problem.

Nor is it sized as a bar chart:

Dailymail_travelrisk_bar

As a bar chart, both the widths and the heights of the bars vary; and the last row presents a further challenge as the bubble for the airplane does not touch the baseline.

***

Besides the Visual, the Data issues are also quite hard. This is how Aleksander describes it: "as a reader I don't want to calculate all my travel distances and then do more math to compare different ways of traveling."

The reader wants to make smarter decisions about travel based on the data provided here. Aleksandr proposes one such problem:

In terms of probability it is also easier to understand: "I am sitting in my car in strong traffic. At the end in 1 hour I will make only 10 miles so what's the probability that I will die? Is it higher or lower than 1 hour in Amtrak train?"

The underlying choice is between driving and taking Amtrak for a particular trip. This comparison is relevant because those two modes of transport are substitutes for this trip. 

One Data issue with the chart is that riding a motorcycle and flying in a plane are rarely substitutes. 

***

A way out is to do the math on behalf of your reader. The metric of deaths per 1 billion passenger-miles is not intuitive for a casual reader. A more relevant question is what's the chance of dying from the time I spend per year of driving (or riding a plane). Because the chance will be very tiny, it is easier to express the risk as the number of years of travel before I expect to see one death.

Let's assume someone drives 300 days per year, and 100 miles per day so that each year, this driver contributes 30,000 passenger-miles to the U.S. total (which is 3.2 trillion). We convert 7.3 deaths per 1 billion passenger-miles to 1 death per 137 million passenger-miles. Since this driver does 30K per year, it will take (137 million / 30K) = about 4,500 years to see one death on average. This calculation assumes that the driver drives alone. It's straightforward to adjust the estimate if the average occupancy is higher than 1. 

Now, let's consider someone who flies once a month (one outbound trip plus one return trip). We assume that each plane takes on average 100 passengers (including our protagonist), and each trip covers on average 1,000 miles. Then each of these flights contributes 100,000 passenger-miles. In a year, the 24 trips contribute 2.4 million passenger-miles. The risk of flying is listed at 0.07 deaths per 1 billion, which we convert to 1 death per 14 billion passenger-miles. On this flight schedule, it will take (14 billion / 2.4 million) = almost 6,000 years to see one death on average.

For the average person on those travel schedules, there is nothing to worry about. 

***

Comparing driving and flying is only valid for those trips in which you have a choice. So a proper comparison requires breaking down the average risks into components (e.g. focusing on shorter trips). 

The above calculation also suggests that the risk is not evenly spread out throughout the population, despite the use of an overall average. A trucker who is on the road every work day is clearly subject to higher risk than an occasional driver who makes a few trips on rental cars each year.

There is a further important point to note about flight risk, due to MIT professor Arnold Barnett. He has long criticized the use of deaths per billion passenger-miles as a risk metric for flights. (In Chapter 5 of Numbers Rule Your World (link), I explain some of Arnie's research on flight risk.) The problem is that almost all fatal crashes involving planes happen soon after take-off or not long before landing. 

 


Book Preview: How Charts Lie, by Alberto Cairo

Howchartslie_coverIf you’re like me, your first exposure to data visualization was as a consumer. You may have run across a pie chart, or a bar chart, perhaps in a newspaper or a textbook. Thanks to the power of the visual language, you got the message quickly, and moved on. Few of us learned how to create charts from first principles. No one taught us about axes, tick marks, gridlines, or color coding in science or math class. There is a famous book in our field called The Grammar of Graphics, by Leland Wilkinson, but it’s not a For Dummies book. This void is now filled by Alberto Cairo’s soon-to-appear new book, titled How Charts Lie: Getting Smarter about Visual Information.

As a long-time fan of Cairo’s work, I was given a preview of the book, and I thoroughly enjoyed it and recommend it as an entry point to our vibrant discipline.

In the first few chapters of the book, Cairo describes how to read a chart. Some may feel that there is not much to it but if you’re here at Junk Charts, you probably agree with Cairo’s goal. Indeed, it is easy to mis-read a chart. It’s also easy to miss the subtle and brilliant design decisions when one doesn’t pay close attention. These early chapters cover all the fundamentals to become a wiser consumer of data graphics.

***

How Charts Lie will open your eyes to how everyone uses visuals to push agendas. The book is an offshoot of a lecture tour Cairo took during the last year or so, which has drawn large crowds. He collected plenty of examples of politicians and others playing fast and loose with their visual designs. After reading this book, you can’t look at charts with a straight face!

***

In the second half of his book, Cairo moves beyond purely visual matters into analytical substance. In particular, I like the example on movie box office from Chapter 4, titled “How Charts Lie by Displaying Insufficient Data”. Visual analytics of box office receipts seems to be a perennial favorite of job-seekers in data-related fields.

The movie data is a great demonstration of why one needs to statistically adjust data. Cairo explains why Marvel’s Blank Panther is not the third highest-grossing film of all time in the U.S., as reported in the media. That is because gross receipts should be inflation-adjusted. A ticket worth $15 today cost $5 some time ago.

This discussion features a nice-looking graphic, which is a staircase chart showing how much time a #1 movie has stayed in the top position until it is replaced by the next higher grossing film.

Cairo_howchartslie_movies

Cairo’s discussion went further, exploring the number of theaters as a “lurking” variable. For example, Jaws opened in about 400 theaters while Star Wars: The Force Awakens debuted in 10 times as many. A chart showing per-screen inflation-adjusted gross receipts looks much differently from the original chart shown above.

***

Another highlight is Cairo’s analysis of the “cone of uncertainty” chart frequently referenced in anticipation of impending hurricanes in Florida.

Cairo_howchartslie_hurricanes

Cairo and his colleagues have found that “nearly everybody who sees this map reads it wrongly.” The casual reader interprets the “cone” as a sphere of influence, showing which parts of the country will suffer damage from the impending hurricane. In other words, every part of the shaded cone will be impacted to a larger or smaller extent.

That isn’t the designer’s intention! The cone embodies uncertainty, showing which parts of the country has what chance of being hit by the impending hurricane. In the aftermath, the hurricane would have traced one specific path, and that path would have run through the cone if the predictive models were accurate. Most of the shaded cone would have escaped damage.

Even experienced data analysts are likely to mis-read this chart: as Cairo explained, the cone has a “confidence level” of 68% not 95% which is more conventional. Areas outside the cone still has a chance of being hit.

This map clinches the case for why you need to learn how to read charts. And Alberto Cairo, who is a master visual designer himself, is a sure-handed guide for the start of this rewarding journey.

***

Here is Alberto introducing his book.


This chart advises webpages to add more words

A reader sent me the following chart. In addition to the graphical glitch, I was asked about the study's methodology.

Serp-iq-content-length

I was able to trace the study back to this page. The study uses a line chart instead of the bar chart with axis not starting at zero. The line shows that web pages ranked higher by Google on the first page tend to have more words, i.e. longer content may help with Google ranking.

Backlinko_02_Content-Total-Word-Count_line

On the bar chart, Position 1 is more than 6 times as big as Position 10, if one compares the bar areas. But it's really only 20% larger in the data.

In this case, even the line chart is misleading. If we extend the Google Position to 20, the line would quickly dip below the horizontal axis if the same trend applies.

The line chart includes too much grid, one of Tufte's favorite complaints. The Google position is an integer and yet the chart's gridlines imply that 0.5 rank is possible.

Any chart of this data should supply information about the variance around these average word counts. Would like to see a side-by-side box plot, for example.

Another piece of context is the word counts for results on the second or third pages of Google results. Where are the short pages?

***

Turning to methodology, we learn that the research team analyzed 1 million pages of Google search results, and they also "removed outliers from our data (pages that contained fewer than 51 words and more than 9999 words)."

When you read a line like this, you have to ask some questions:

How do they define "outlier"? Why do they choose 51 and 9,999 as the cut-offs?

What proportion of the data was removed at either end of the distribution?

If these proportions are small, then the outliers are not going to affect that average word count by much, and thus there is no point to their removal. If they are large, we'd like to see what impact removing them might have.

In any case, the median is a better number to use here, or just show us the distribution, not just the average number.

It could well be true that Google's algorithm favors longer content, but we need to see more of the data to judge.

 

 


Labels, scales, controls, aggregation all in play

JB @barclaysdevries sent me the following BBC production over Twitter.

Johnbennett_barclaysdevries_bbc_chinagrowth

He was not amused.

This chart pushes a number of my hot buttons.

First, I like to assume that readers don't need to be taught that 2007 and 2018 are examples of "Year".

Second, starting an area chart away from zero is equally as bad as starting a bar chart not at zero! The area is distorted and does not reflect the relative values of the data.

Third, I suspect the 2007 high point is a local peak, which they chose in order to forward a sky-is-falling narrative related to China's growth.

So I went to a search engine and looked up China's growth rate, and it helpfully automatically generated the following chart:

Google_chinagrowth

Just wow! This chart does a number of things right.

First, it confirms my hunch above. 2007 is a clear local peak and it is concerning that the designer chose that as a starting point.

Second, this chart understands that the zero-growth line has special meaning.

Third, there are more year labels.

Fourth, and very importantly, the chart offers two "controls". We can look at China's growth relative to India's and relative to the U.S.'s. Those two other lines bring context.

JB's biggest complaint is that the downward-sloping line confuses the issue, which is that slowing growth is still growth. The following chart conveys a completely different message but the underlying raw data are the same:

Redo_chinagdpgrowth

 


Men and women faced different experiences in the labor market

Last week, I showed how the aggregate statistics, unemployment rate, masked some unusual trends in the labor market in the U.S. Despite the unemployment rate in 2018 being equal, and even a little below, that in 2000, the peak of the last tech boom, there are now significantly more people "not in the labor force," and these people are not counted in the unemployment rate statistic.

The analysis focuses on two factors that are not visible in the unemployment rate aggregate: the proportion of people considered not in labor force, and the proportion of employees who have part-time positions. The analysis itself masks a difference across genders.

It turns out that men and women had very different experiences in the labor market.

For men, things have looked progressively worse with each recession and recovery since 1990. After each recovery, more men exit the labor force, and more men become part-timers. The Great Recession, however, hit men even worse than previous recessions, as seen below:

Jc_unemployment_rate_explained_men

For women, it's a story of impressive gains in the 1990s, and a sad reversal since 2008.

Jc_unemployment_rate_explained_women

P.S. See here for Part 1 of this series. In particular, the color scheme is explained there. Also, the entire collection can be viewed here


What to make of the historically low unemployment rate

One of the amazing economic stories of the moment is the unemployment rate, which at around 4% has returned to the level last reached during the peak of the tech boom in 2000. The story is much more complex than it seems.

I devoted a chapter of Numbersense (link) to explain how the government computes unemployment rates. The most important thing to realize is that an unemployment rate of 4 percent does NOT mean that four out of 100 people in the U.S. are unemployed, and 96 out of 100 are employed.

It doesn't even mean that four out of 100 people of working age are unemployed, and 96 out of 100 of working age are employed.

What it means is of the people that the government decides are "employable", 96 out of 100 are employed. Officially, this employability is known as "in labor force." There are many ways to be disqualified from the labor force; one example is if the government decides that the person is not looking for a job.

On the flip side, who the government counts as "employed" also matters! Part-timers are considered employed. They are counted just like a full-time employee in the unemployment metric. Part-time, according to the government, is one to 34 hours worked during the week the survey is administered.

***

So two factors can affect the unemployment rate a lot - the proportion of the population considered "not in labor force" (thus not counted at all); and the proportion of those considered employed who are part-timers. (Those are two disjoint groups.)

The following chart then shows that despite the unemployment rate looking great, the U.S. labor market in 2018 looks nothing like what it looked like from 1990 to 2008.

Jc_unemployment_rate_explained

Technical notes: all the data are seasonally adjusted by the Bureau of Labor Statistics. I used a spline to smooth the data first - the top chart shows the smoothed version of the unemployment rates. Smoothing removes month-to-month sharp edges from the second chart. The color scale is based on standardized values of the smoothed data.

 

P.S. See Part 2 of this series explores the different experiences of male and female workers. Also, the entire collection can be viewed here.