## Convenience charting

##### Jan 24, 2007

Statisticians have long riled against "convenience sampling", that is, the practice of selecting samples based on what's easily available, not at random.  Say picking your friends.

Dustin J sent in this example of what can only be called "convenience charting".  Dustin said he had no clue what this chart is saying, and I am not surprised.

The chart plots a statistical object known as the "survival function".  It is likely that "survival analysis" was done, after which the chart creator  picked up the resulting statistical object and dumped it onto this "convenience chart".

If we take the top line on the "child survival" graph, it shows the probability of one child surviving up to a certain age, if the child belonged to a family with 1-3 kids.  The chance is about 92.5% that the child will survive through age 2, and 88% that the child will survive through age 18.  The difference between those percentages is due to the chance that the child may die between ages 2 and 18.

A slight transformation of the data will make this point much clearer.  What is the probability of a child dying by a certain age?  Using the example, a child has 12% chance to die by age 18, and 7.5% chance of dying between ages 0-2.

The junkart chart depicts this probability.  (I reverse-engineered the data which explains why the distances between the line segments look strange.)

What this chart doesn't address is how we are to interpret the probability of "a child dying" in a family with more than one child.  Is it a random child dying?  At least one child dying?  Exactly one child dying (the other X-1 surviving)?

The original chart also committed a number of standard errors.  The child survival function represent probabilities, not percentages.  The third category should be 8-11 kids, not 7-11.  If we are picky, then we would also like to see "confidence intervals" because there must have been many fewer families in the 12+ sample than the 1-3 sample.  In the second chart (which I don't have space to discuss), some data labels are missing, which indicates a presumption that all readers have seen the first chart.

Reference:  "Child, Parents Drive Each Other to Early Graves", Washington Post, Jan 14, 2007.

You can follow this conversation by subscribing to the comment feed for this post.

I just noticed a really bizarre feature of the Washington Post graph: a little pair of break lines below "75" on the "percentage survival" scale, below which was a "0" next to the "childs age" scale.

I wonder what the graph's designer thought he was achieving by that? I suppose he had vaguely remembered that you're supposed to start a scale from zero, but hadn't quite thought it through...

The other problem is if the kid dies before the other kids are born... so a family may have had 12 kids over time because they kept losing them. Looking at huge early age mortality, this seems likely to me. Little Mary died, so Ma and Pa try for another. =That= one died as a baby, so they go for it again.

Other than the differing scales, poor curve labels, weird shading, and impossibly vague reference, these are perfectly good--if idealized--survival curves. More sophisticated versions of these charts are the norm in many medical journals, as well as engineering journals, where they're called reliability functions. The transformed chart is nice, BUT NOBODY USES THAT KIND OF CHART.

Confidence intervals are nice, and the charter could have started off with ONE curve plus confidence interval to let the reader "calibrate his eye." But the story of these charts is how the survival function changes with number of children, and confidence bands would have cluttered a static chart to the point of unreadability.

Definitely NOT a junk or convenience chart, just a interesting chart poorly executed.

Derek: I find the break to help me know without thinking about the numbers that the axis doesn't really start at zero. So I don't think it's a defect.

Mike: I agree. As a former engineer, I'm familiar with survival curves. The reworked chart does nothing for me. I don't think it's depicting standard errors or confidence intervals, it just shows the distribution in a different way, a way I find harder to internalize quickly.

What's lacking in the original charts is the analysis. Meep took a good first cut at it ("a family may have had 12 kids over time because they kept losing them"). Likewise, maternal mortality was relatively high in the first month or so post partum, and the more kids she had, the more worn out a mother was, so the less likely she would be to survive yet another delivery. In addition, both mothers and fathers would be older for each subsequent birth, and the survival curves would be expected to be lower for older subjects.

Continuing the analysis of why the curves change as they do: If a kid had more siblings, it would enjoy a lower proportion of a parent's care. The parents would also be older and maybe less capable of providing adequate care

By the way, Kaiser, I'm liking these "multiple one-dimensional scatter graphs" more and more as I use them and see them used; I produced a labelled one at work yesterday that really clarified the discussion we were having.

Is there a catchy name for them? If there isn't I think we should coin one and spread it around. How about "dash graph"?

I have to agree with Mike I found the original graphs easier to understand than the new graphs. The lines do an effective job of showing me decreasing survival over time and the 4 lines on the same chart help show the relative differences. I am not saying they are great charts but I think they show the information better than the new set...

Jon, I know it's boring of me to say "Tufte" all the time, but I'm with Tufte on this one: if the fact that the scale doesn't go down to the origin needs emphasising, then don't let the vertical and horizontal scales touch. This designer went to the trouble of taking the vertical scale all the way down until it touched the horizontal scale, and then added an artificial gap, when a natural gap would have done the same job.

Actually, I see on closer examination that he also went to the trouble of making the horizontal scale extend past the zero mark just so that it could touch the vertical!

Survivorship curves are very common, but graphs of age-specific death probabilities are not at all rare--it just depends on what the researcher is focusing on. I've read the original PNAS paper; the WashPost figure on age-specific child survival is for sibship size, but there is also a figure in the paper for birth order.

I'll just add that we should feel for people like Dustin (and I'm sure many other readers) who do not have engineering or statistics training. Concepts of survival and censorship mean nothing to them.

On confidence levels, that would depend on the sample sizes, which in turn determine how wide those bands would be. I'm suggesting that the bands for 8-11, 12+ would be very wide given what I think is the low incidence of such families but I could be wrong.

One subtle thing that I changed was the data series on the horizontal axis (family size instead of child's age).

Wait a sec -- and that mortality of mother by number of kids in family.... might it also be that women who have 12+ kids are =older= when they give birth to the last one? Even back in ye olden days women could be having children into their 40s (not likely, but it's possible).

I don't see that they're controlling for factors that would bias this data, such as having more children because of a factor such as having a farm to run (as opposed to a less labor-intensive store), or because for some reason the children keep dying at a young age and thus you go for more kids. And yes, people in ye olden days could control having or not having kids in the obvious manner.

Kaiser:

Mean children ever born in this population is 8.04, so the bands for 8-11 and 12+ aren't as small as you might think.

Meep:

The kid survivorship estimates were from Cox regressions with controls that included whether the parents were alive 5 years after the birth of the last kid, sibship size, and birth order. And this population is one known to be without much parity-specific birth control.

Kaiser:

I thought someone else had already mentioned this but in re-reading the comments I see that it wasn't specifically pointed out. In your original post, you said that the original WashPost figure had mis-labeled the child survivorship vertical axis with percentages rather than probabilities. Actually, survivorship curves are for the proportion surviving. (These are, of course, equivalent to the cumulative probability of surviving to a particular age)

Wow, we are getting technical here.

Robert: especially since the survival curves are not empirical but fitted, the vertical axis represents the "ideal" probability of survival. Hiding in there is the frequentist view of probabilities as the limiting case of proportions.

Robert: thanks for pointing out the mean family size; I didn't note this data was from the 19th century and also from a Salt Lake City-based database

I've re-read the article and noticed that the way they described the curve was something like "probability of a child dying by age 18" rather than "probability of a child surviving through age 18". This is one of those subjective things about graphing. Some of us will prefer the former concept; others the latter.

Kaiser wrote: "Wow, we are getting technical here."

Yes, and that's actually a remarkable thing. The study was pretty technical but the graphics in the WashPost, flawed though they are, make the topic quite accessible.

Derek -

"[I]f the fact that the scale doesn't go down to the origin needs emphasising, then don't let the vertical and horizontal scales touch."

Good point, but then you'd have to add a vertical line along the axis, so it was clear that it didn't extend all the way down, thus adding chart-ink. (That's a bit tongue-in-cheek: Unlike Tufte, I don't mind a few extra black pixels if it helps clarify, as does an interrupted axis, or one that doesn't touch the other.)

The comments to this entry are closed.