# Glossary

These definitions refer to this set of numbers: 5, 5, 5, 8, 12, 14, 21, 33, 38.

Mean (arithmetic mean)

The mean is the most commonly used type of average. It is total of all the numbers divided by the how many numbers there are. In this case there are 9 numbers with a total of 5+5+5+8+12+14+21+33+38 = 141. The mean is therefore 141 divided by 9, or 15.6444 (rounded to 15.6).

Median

For an odd number of numbers, the median is the middle number when they are arranged in order. The above set has 9 numbers, and they are in order: the middle number, i.e. the median, is 12. (For an even number of numbers the median is mid-way between the two middle numbers. So if we had just 8 numbers, 5, 5, 5, 8, 12, 14, 21, 33, the median would be mid-way between 8 and 12, i.e. 10.)

Mode

The mode is the number that occurs most frequently in a data set. For this set, the number 5 occurs most often, so the mode is 5. (If two or more numbers occur jointly most often the data set is bi-modal or multi-modal. In this case the mode may be of limited usefulness.)

In every day speech, an event is significant if it is important or meaningful in some way. So, the election of Barack Obama was significant because he was the first black US president.

In statistics, however, significance is a technical concept – and one that is quite commonly misunderstood.

According to a current Gallup opinion poll, 45% of American adults think Barack Obama is doing a good job – his ‘approval rating’ is 45%. Suppose that in a week’s time Gallup carry out a new poll and it gives Obama an approval rating of 46%. A statistician might do a calculation or two and say that the rise from 45% to 46% is not significant. But what does that mean?

If you take a series of opinion polls the figures you get will vary even if people as a whole have not changed their views. Opinion polls use random samples, and in random samples the numbers vary. So what we want to know is whether a change (45% to 46% in this case) is the sort of thing you would expect to see from one sample to another, or indicates a genuine shift in public opinion. And that is what a statistician can calculate.

Without going into the calculation, the shift from 45% to 46% is well within the sort of variation you would expect from one sample to another – so we say it is not significant. It is the sort of change that could easily happen in a sample without there being any real increase in Obama’s approval rating in the population as a whole.

A change from 45% to 50%, however, is bigger than would be expected in a typical opinion poll. It could happen by chance, but the probability of it doing so is small. So a change like that would be described as significant, and if it happened we would suspect that there had been an increase in the approval of Obama in the American population.

If you see a result described as significant at the 5% level, it means there is only a 5% chance of getting such a result purely from random variation. You might suspect, therefore, that there is some real underlying change responsible for the result. If the significance level is given as 1%, the result is even less likely to have arisen by chance, so you will have a correspondingly stronger suspicion that there has been some real underlying change.

Finally, there is a potential source of confusion in the terminology. A smaller significance level indicates greater significance! But that does actually make sense: a result at the 1% level is more likely to indicate a real change than a result at the 5% level. The 1% level gives stronger evidence than the 5% level; that is, it is more significant.

More on significance: hypothesis testing

In statistical tests we usually work with two hypotheses, the null and the alternative. The null hypothesis is something like the status quo; it is the assumption we would make unless there was sufficient evidence to suggest otherwise. The alternative hypothesis represents a new state of affairs that we suspect (or perhaps hope) might be true.

For example, suppose we are testing a new medical treatment to see if it performs better than the existing standard treatment. The null hypothesis would be that the new treatment is no better (or worse) than the old; the alternative would be that it performs better.

A significance test assesses the evidence in relation to the two competing hypotheses. A significant result is one which favours the alternative rather than the null hypothesis. A highly significant result strongly favours the alternative.

The strength of the evidence for the alternative hypothesis is often summed up in a ‘P value’ (also called the significance level) – and this is the point where the explanation has to become technical. If an outcome O is said to have a P value of 0.05, for example, this means that O falls within the 5% of possible outcomes that represent the strongest evidence in favour of the alternative hypothesis rather than the null. If O has a P value of 0.01 then it falls within the 1% of possible cases giving the strongest evidence for the alternative. So the smaller the P value the stronger the evidence.

Of course an outcome may not have a small significance level (or P value) at all. Suppose the outcome is not significant at the 5% level. This is sometimes – and quite wrongly – interpreted to mean that there is strong evidence in favour of the null hypothesis. The proper interpretation is much more cautious: we simply don’t have strong enough evidence against the null. The alternative may still be correct, but we don’t have the data to justify that conclusion.

There are interesting parallels here with criminal cases in a court of law. The null hypothesis in court is that I am not guilty; this is the assumption we start with, the assumption we hold to unless there is sufficient evidence otherwise. The alternative is that I am guilty, and the court accepts that conclusion only if my guilt is shown ‘beyond reasonable doubt’. But if the prosecution fails to obtain a guilty verdict this does not show that I am innocent. Perhaps I am innocent; or perhaps I am guilty but the evidence is not strong enough to show that beyond reasonable doubt. In the latter case, additional evidence may emerge later and I may face a re-trial.

Likewise, if the evidence in favour of the new medical treatment is strong enough, we will want to adopt it. But if the evidence is weak we will stick with the standard treatment, at least until additional experimental evidence emerges to suggest that the new treatment may be better.

The standard deviation (SD) is a measure of how spread out a dataset is. It tells us how far items in the dataset are, on average, from the mean. So the larger the SD, the more spread out the data are.

Though the SD is an average, it is an average calculated in a somewhat unusual way – and if you don’t want the technical detail you should skip to the next paragraph now. For the dataset {2, 3, 5, 8, 12} the mean is 6. So the deviations from the mean are {–4, –3, –1, 2, 6}. We square these deviations, {16, 9, 1, 4, 36}, and take their average, 13.2. Finally we take the square root of 13.2 to give us the SD, 3.63.

As an example, consider intelligence quotient or IQ (which remains popular despite the doubts of many psychologists and educationalists). IQ is usually measured on a scale with mean 100 and standard deviation 15, and the distribution of IQs in a population is, to a good approximation, Normal. We can therefore say that about 2/3 of people will have an IQ within 1 SD, that is 15 units, of the mean; so 2/3 of IQs will be between 85 and 115. About 95% of people will have IQs within 2 SDs of the mean, that is between 70 and 130. And it will be very rare to have an IQ more than 3 SDs from the mean, that is below 55 or above 145 (it’s about 1/8 of 1% in each case). The requirement to join Mensa is an IQ in the top 2% of the distribution; that amounts to an IQ just a little more than 2 SDs above the mean – about 131.

In statistics we often look for connections between two or more variables; we often want to know if we can say something about one variable if we know something about another. For example, we might be interested in how weight varies with height, or how blood cholesterol level varies with the dose of statins taken. The exploration of relationships between variables is called **regression analysis** or **regression modelling**. The examples given below are for a simple regression analysis involving just two variables and a simple straight line model. More complex situations arise very often in statistics, but the same general principles apply.

Figure 1 illustrates some (fictitious) data of the type that might arise from an experiment.

On the horizontal axis we have a controlled or non-random variable, x, taking values 1, 2, …, 10. On the vertical axis the values, y, are not so neat: it looks as though the y values tend to increase as x increases, but there is clearly some other variation too giving rise to fluctuations in the values of y. If we can assume that these fluctuations are random, and if we can assume that without the fluctuations in y the points would lie on a straight line, then we can calculate the regression line for y on x as shown. This line is the best estimate we can make from the given data of the relationship between y and x. If we were able to eliminate the random variation in y, the data points would lie on a straight line; it is this straight line we are estimating.

The regression line can be used to predict the y value for a given x value. Predicting y for a value of x within the range of the data (e.g. x = 6.5 as shown) is called interpolation, and it is generally pretty reliable. Predicting y for a value of x way beyond the data (e.g. x = 20) is extrapolation and it can be very inaccurate as the simple straight line relationship between x and y may break down. (Many statistical howlers arise from extrapolation: a favourite example extrapolates the growth to date in the number of Elvis impersonators to predict that 1 in 3 of the world’s population will be an Elvis impersonator by 2019.)

In figure 2, again for fictitious data, we have a rather different situation.

Here neither variable is controlled. Perhaps we have selected a random sample of individuals and measured two things, x and y. (In a real example we might measure height and weight, or IQ and salary, hoping in each case to spot any connections there might be.) Here the variation represents the differences between individuals rather than random fluctuations or errors of measurement, and no amount of fine measurement will eliminate it. So it doesn’t make sense to think in terms of a straight line relationship between y and x. Instead, we think in terms of the average (or mean) value of y for a given x. Under the right conditions (x and y coming from a two dimensional normal distribution) the mean value of y as x varies will form a straight line. It is this regression line that is shown in the diagram. And we can use the regression line to predict the mean value of y for a given value of x (e.g. x = 6.5 as shown.) As before, this is interpolation and fairly safe. Extrapolation to values of x beyond the data is unsafe and best avoided.

Because we have two random variables here, we could perfectly well be interested in the mean value of x as y varies. That is, we could be interested in the regression line of x on y. (Note that this would not have made sense for the situation shown in figure 1 where x was a controlled variable.) Figure 3 shows the regression line for x on y in red.

This regression line would be used to estimate the mean value of x for a given y (e.g. y = 4.5 as shown).

The fact that the two regression lines are quite distinct is sometimes found puzzling, but it need not be. Asking two distinct questions – how does the mean value of y vary with x? and how does the mean value of x vary with y? – should be expected to give two distinct answers!

Children of tall parents are, on average, shorter than their parents. This is a simple example of a common phenomenon: regression to the mean or the tendency for extremes to be pulled back towards the average. And for the same reason children of short parents are, on average, taller than their parents.

In the case of heights, regression to the mean is just a fact and nobody worries too much about it. However, there are many other cases in which regression to the mean is seriously misunderstood.

One famous example is the Sports Illustrated ‘jinx’.

Appearing on the cover of Sports Illustrated is said to be a ‘kiss of death’: it is frequently followed by a loss of form, by failure of some sort. But of course the jinx is just regression to the mean. You only get to be on the cover if you are doing exceptionally well. And if you are doing exceptionally well now, then when regression to the mean kicks in you will be doing worse. All good runs come to an end. If Sports Illustrated started to feature those who were doing exceptionally badly on the front cover then people would be queuing up to appear, because if you are having a bad run then you will improve when regression to the mean occurs. (Of course good and bad runs don’t come to an end in a predictable way and some runs last longer than others. As ever in statistics we are talking about averages and randomness here.)

The Sports Illustrated ‘jinx’ is a case of mistaking correlation for causation. Appearing on the cover may be correlated with a loss of form but it is not the cause of a loss of form. This misinterpretation is much more worrying when it affects serious areas such as health, where the fallacious reasoning can appear to defend quackery.

'I was seriously ill and tried all sorts of things to no avail. But then I tried psychic surgery and I recovered. Therefore psychic surgery works.' No, many people who are seriously ill get better eventually. It’s regression to the mean which is responsible.