Unlearning descriptive statistics

If you've ever used an arithmetic mean, a Pearson correlation or a standard deviation to describe a dataset, I'm writing this for you. Better numbers exist to summarize location, association and spread: numbers that are easier to interpret and that don't act up with wonky data and outliers.

Statistics professors tend to gloss over basic descriptive statistics because they want to spend as much time as possible on margins of error and t-tests and regression. Fair enough, but the result is that it's easier to find a machine learning expert than someone who can talk about numbers. Forget what you think you know about descriptives and let me give you a whirlwind tour of the real stuff.

The average

The arithmetic mean is one of many measures of central tendency. One particularly useful feature of the mean is that, whenever we lack outside information like a scientific theory, it is our best possible guess for what to expect in the future. Sum up all of the rainy days in your area for as many years as you have data for, divide by the amount of years, and that's your best bet for how much rain to expect this year. Multiply by 10, and that's how much rain you can expect in a decade.

Because the mean is so closely tied to the expected value, it's a great number to use if you're an economist, a gambler, or an economist gambler.

Often, however, we're interested not in what we can expect but rather in what is typical, and these are two very different concepts. For example, some statistics about Homo sapiens: a mode of 2 legs, a median of 2 legs, a mean of 1.9 legs and an ordinal center of 1 leg. We can't expect every person on the planet to have two legs, but having two legs sure seems typical to me.

The problem with the arithmetic mean is that it does not correspond to anything or anyone, it just blends everything together. The median, on the other hand, can be interpreted as a typical sort of value. It's the point where half of the values are lower and half of the values are higher.

For colors, months, countries, brand names and any other kind of data that is not quantitative and has no order to it, there is no median, and instead the most common values (including the mode) and least common values are a good way to indicate what's typical and what's not.

First caveat: multimodal distributions have more than one central or typical value, and they are trickier to describe. If you look at how tall the average adult human is, you will find a bump around 165 cm and another around 175 cm. These local maxima (modes) are useful statistics, but often it pays to prod a little deeper and see why there's more than one peak in the first place. In the case of human height, the answer is obvious: a typical adult woman is roughly 165 cm tall, a typical adult man roughly 175 cm. Once you split the data by gender, the bimodal distribution disappears.

Second caveat: continuing with our dataset of adult human height, note that there are more women than men on this planet. However, the typical adult human is not therefore a 170 cm tall woman. The median of a dataset with two or more dimensions is not accurately represented by the median of each individual dimension. What you want is the centerpoint or the half-space. With very many dimensions, however, the concept of a central value becomes less and less useful. Interestingly, this is true not just for humans but for machines as well: off-the-shelf nearest-neighbor algorithms don't work well in high-dimensional spaces.

The spread

The standard deviation measures how spread out different values are. Here's how it works:

you subtract the mean from each value
you square each deviation from the mean
you sum up the squared deviations and divide the sum by n
you take the square root of the average squared deviation.

Why would you square something only to take its square root a couple steps later? Well, it's because we're not interested in whether a value is above or below the mean, but rather we wish to know how far away it is from the mean in either direction.

The average daily temperature on the Faroe Islands in August is 11°C. 9°C and 13°C are an equal distance from that mean, but if you put together -2°C and 2°C, you get an average deviation from the mean of precisely zero and that's not very informative. Solution: squaring turns negative values into positive values.

We square the distances to the mean to make them positive... but why not just remove all signs and take the absolute values? Squaring is a mathematical hack: computing the derivative of a squared difference is easy but computing the derivative of an absolute difference is a pain in the neck, and we need that derivative for maximum likelihood estimation of statistical models.

Easy differentiation is nice, but not terribly relevant when all you want to do is describe the spread of your data.

The standard deviation lacks an easy interpretation. People who are new to statistics often seem to think it represents the average distance of a value from the mean, but it doesn't. In normally distributed data, the standard deviation is about 25% larger than the mean absolute deviation.

When communicating how far apart values are, use the mean absolute deviation or the median absolute deviation (MAD). These statistics have the distinct advantage that they stand for what your audience will think they stand for.

An acceptable substitute, also quite easy to interpret, is the interquartile range. Sort the data, put it into four bins of equal size, and return the lower and upper bound of the two bins in the middle, otherwise known as the first and third quartile. Half of your data is in between these goal posts. The interquartile range is the measure of spread you will usually see pictured as the box in a boxplot.

(The interquartile range is also sometimes communicated as a single number, the difference between the third and first quartiles.)

The location

Statisticians and mathematicians are lazy, so instead of devising one statistical method that works for data with a mean of 2 and a variance of 5, and another statistical method for data with a mean of 23 and a variance of 8.7, instead we will shrink and squeeze and stretch the data until it fits the method we already have. These standardized numbers are called pivotal quantities, quantities that make no reference to the mean or variance or any other parameter of a statistical distribution, and they are used a lot in statistics.

One such pivotal quantity is the z-score. To convert a dataset into z-scores, subtract the mean from each value and then divide each value by the standard deviation. This normalizes every value to a normal distribution with a mean of 0 and a standard deviation of 1. Once in that standardized format, you can run all kinds of statistical tests, in particular Wald tests.

Normalized data is also useful when comparing things. If you took a test and got 15 out of 20 questions right, is that above or below average, and exactly how far above or below?

Z-scores are great for statistical tests. As a basis for comparisons, they are flawed:

you have to know a lot about statistics to interpret a z-score: that the normal distribution is symmetrical, that ±1 standard deviation corresponds to 68% of the data and that ±2 standard deviations corresponds to 95% of the data;
z-scores do not magically turn any data into a normal distribution - if you z-transform data that is skewed, the transformed data will still be skewed.

A more easily interpretable number is the percentile rank. The 25th percentile is higher than 25% of the data but lower than 75% of the data. If you're in the 99th percentile, you're a part of the top 1 percent. The 50th percentile is our good friend the median. Percentile refers to the actual value, percentile rank is the fraction of the data it corresponds to. You can calculate the percentile rank for any value in a dataset.

As with the median, percentile ranks are immune to skew and kurtosis: regardless of whether most of your data is at the top or the bottom or the middle, the rank will give you a good idea of where in the data a value is located, while the z-score turns to gibberish when the underlying distribution of the data is not normal.

Strangely, I almost never hear statisticians talk about z-scores but it pops up from time to time in news articles. I'm guessing it's part of an antiquated but influential syllabus at a bunch of journalism schools, but who knows. It's weird. Don't do it. Use percentile ranks.

The skew

Data is skewed when it contains a disproportionate amount of small or large values, rather than the data being nicely spread out in both directions around the mean. If you graph the distribution, it will look lopsided, with the bulk of the data on one side and a long tail on the other. Negative skewness means the data is skewed to the left, which means it has a fat left tail, and positive skewness shows up as a fat right tail on a histogram.

Skewness is another statistic where I sometimes see non-statisticians trying to outdo statisticians. Skewness is a number that is used so little in statistics that even an experienced data scientist would have a hard time drawing a distribution of approximately the right shape if you gave them a skew statistic. Many wouldn't even be able to tell you how to calculate it. Uummm, it's the third moment, right?

How can we convey skewness if not through a statistic? For a technical audience, a QQ-plot can communicate how two distributions differ in shape. In every other situation, use a histogram. A histogram organizes the data into an arbitrary number of equal intervals, counts how many points fit in each interval, and plots those counts as a bar chart. It takes up more space than a number but you get to see the true shape of the data.

The outliers

One of the most basic statistical laws is that crazy things will happen, and more often than you'd think. As a result, there's almost never a reason to pay particular attention to outliers or to consider them a nuisance that must be dealt with, unless they are obvious errors.

In fact, analysis of just the bottom or top of a dataset opens you up to regression toward the mean, which will invalidate your conclusions.

But there are moments when you do need a way to spot anomalies, perhaps to detect fraud or malfunctioning machines.

It is common to look for outliers by identifying values that are more than 3 standard deviations from the mean. Intuitively this makes a lot of sense, because the standard deviation and the mean were probably the first things you calculated when you got the data, and a normal distribution has very little density at 3 standard deviations beyond the mean. However, x deviations from the mean is a self-defeating heuristic: it relies on exactly those measures that are inflated or skewed by outliers. Instead, use the median and median absolute deviation as your basic metrics, and pick whatever multiplier that's of practical relevance to you.

Caveat: be warned that just as a geometric median is not the same as the componentwise medians, which we discussed earlier, an observation can be an outlier even when none of its individual facets are outlying. To stick to anatomical examples: it's not uncommon to have a Y chromosome, and it's not uncommon to be a woman, but it would be rather special if a woman had a Y chromosome.

Cook's distance is one of a number of similar metrics that can detect outliers in multidimensional datasets. It's a useful technique but does require building a model of the data. Instead of hunting for outliers per se, we leave out one observation from the model at a time and check whether this single observation affects the model parameters one way or the other, the idea being that something can only count as outlandish if it has an outlandish impact on how we see the world. Because it adds or removes the entire observation, with all of its component variables, this technique can detect highly unusual observations that at first sight look perfectly normal.

The correlation

Take a daily aspirin and you are less likely to succumb to a heart attack. Higher temperatures, maintained for longer, kill more bacteria. Machinery subjected to heavier loads will break down sooner. A relationship between any two variables is an association, an association between two quantities (not gender or color but distance or weight) is a correlation.

Correlations are typically between -1 and 1, or ±100% if you prefer. Negative correlations simply mean that as one thing goes up, the other goes down. The longer folks have to wait for the bus, the worse their mood will be. 1 and -1 are both perfect correlations, meaning you can predict one variable from the other with absolute certainty, whereas a correlation between two variables of 0 means that knowing one variable provides you with absolutely no information about the other.

Digging a little deeper, we see that the Pearson product-moment correlation is a measure of linear association. The most popular flavor of statistical regression builds on Pearson's correlation by means of the variance-covariance matrix and is known as first-order linear regression. It works by drawing lines. Lines are really simple mathematical objects, y = ax + b, which is why we like them so much. Statisticians can do all sorts of crazy things with lines that make them not lines anymore while they get to pretend that they still are. The squiggly curves of polynomial regression still count as linear regression, for one.

Fundamentally, though, a correlation is still just a line, and not every relationship between two variables can be captured by a linear relationship that states for each additional x, increase y by this amount. Toxins are generally harmless below a certain treshold and then very quickly become dangerous. Cheaper goods sell more, but below a certain price point other factors weigh more heavily on our purchasing decisions. It's not unusual for a strong nonlinear relationship to have a correlation close to zero.

So you might think, okay, easy fix, we just need a number that reflects nonlinear correlation, like the Spearman correlation, which, sure, is better as long as you're dealing with a monotonic relationship. Or you might remember from an intro to stats that taking the square root or the logarithm of the dependent variable in a hockey stick graph will straighten it out, and Pearson may live to fight another day. (Not really though, Karl Pearson died in 1936.)

But why do you want a number at all? When describing a dataset, as opposed to running statistical tests, there really isn't the need to condense data down to a number because you don't need that number for anything, it's not the basis for any additional mathematics. Instead, just draw a scatterplot, which shows the relationship in all its messy glory, no matter how bendy or how straight.

Anscombe's quartet is a famous example of four two-variable datasets that look very different when graphed but that nevertheless have an identical x-y correlation, as well as identical means and standard deviations.

If you do need to emphasize the underlying pattern and don't care for all of the little dots of a scatterplot, use software to draw a spline over the conditional median or mean of y at every value of x.

Still not happy and absolutely want a number? You would do well to shun correlations even so. While statisticians are generally quite good at estimating a correlation from a picture and vice versa, most people are not. There's a little game called Guess the Correlation, give it a try. Communicate the linear relationship between two variables through its slope instead, the for each additional x, y will increase by this amount thing we mentioned earlier. To calculate the slope of a linear association, multiply the correlation by the standard deviation of the variable on the y-axis and divide it by the standard deviation of the variable on the x-axis. Do this for many variables simultaneously, each while holding the others constant, and lo and behold, you've invented regression analysis!

Postscript: why did nobody tell me this?

The discipline we call statistics is a two-headed beast. Descriptive statistics is the attempt to make sense of large amounts of data. Each observation brings its own ideosyncracies, so we must distill the data down to easier to read summaries, charts and comparisons between groups. Inferential statistics then takes these summaries and judges whether they are likely to hold true in general or whether they contain quirks, patterns that are particular to just your data.

Descriptive statistics is when you ask five people and they all tell you coffee makes them sleepy. Inferential statistics is the realization that a survey of five people isn't much information to go on and that actually, no, coffee is not a great sleeping aid. Means and medians are descriptive, hypotheses and margins of error are inferential.

Statistics attracts people from many different backgrounds but above all it attracts mathematicians. Descriptive statistics is a matter of communication, cognition, numeracy, even user experience. It's quite the challenge to do right but for those with a mathematical bent, descriptive statistics can be, well, a snoozefest, a discipline that barely rates above elementary arithmetic. Inferential statistics, on the other hand, is a theoretical delight.

(Similarly, much of Bayesian statistics relies on brute force simulation in lieu of the elegant little theorems of frequentist statistics. This sheds some light on the psychology of the statistician who turns sour at the first mention of a posterior probability.)

The disdain of statisticians for descriptive work has contributed to a peculiar situation where innovations in visualization are generally the work of outsiders and fringe figures like John Tukey, William Cleveland and Edward Tufte. Another consequence is that the descriptive statistics we use so much – the mean, the standard deviation, the correlation – are our go-to numbers not because they are the nicest way to describe a dataset, but because they are useful building blocks for statistical inference.

It would be nice to have numbers that can do double duty, statistics that work equally well for description and inference. But those numbers do not exist. As a result, everything you know about descriptive statistics is biased towards inference. If you want to become truly great at communicating quantitative information, these not-quite-descriptives have got to go.

share on twitter

Unlearning descriptive statistics debrouwere.org/6n by @stdbrouw

Stijn Debrouwere writes about statistics, computer code and the future of journalism. Used to work at the Guardian, Fusion and the Tow Center for Digital Journalism, now a data scientist for hire. Stijn is @stdbrouw on Twitter.