8  Estimating standard deviation

Estimating variability from data

“As we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns—the ones we don’t know we don’t know.”

― Donald Rumsfeld (U.S. Secretatry of Defense, 2000-2008)

In the previous chapter, we learned about the variability within a sample, and the variability of a statistic, the sample mean, measured from that sample. We relied upon knowing the standard deviation of the variable being measured to be able to discuss variability: we assumed that the SD of human heights was about 4 inches.

What if we don’t know the standard deviation already? Are there any other measurements of variability instead of standard deviation? And what do we do about categorical variables, or combinations of variables?

Although data is random, and much is unknown, we can still always find ways to quantify what we do know about the uncertainty in the data!

Recall that the standard deviation, usually written as s, is a measure of how far from the true mean we think values typically fall. But what happens when we have no guesses about the true mean or the true standard deviation?

Let’s consider the following dataset, which consists of measurements regarding the self-care of a very studious college student named Lori Bilmore. For a week, we track Lori’s hours of sleep the previous night, as well as how many coffees she drinks that day and what she ate for breakfast (besides coffee!) that day.

Day of Week

Hours of Sleep

Number of Coffees

Breakfast

Sun

9

0

Cereal

Mon

7

3

Cereal

Tue

4

4

Muffin

Wed

10

2

None

Thu

5

5

Cereal

Fri

8

0

None

Sat

6

1

Cereal

We can begin by finding the usual summary statistics for our three variables: the sample mean for Hours of Sleep and Number of Coffees; and the sample proportions for the categories of the variable Breakfast

Hours of Sleep

Number of Coffees

7

2.142857

Breakfast

n

percent

Cereal

4

0.5714286

Muffin

1

0.1428571

None

2

0.2857143

To describe Lori’s typical weekly behavior more thoroughly, we’d also like to report the uncertainty in each of these variables. Let’s start with Hours of Sleep.

If the standard deviation is a measure of how far values are from the true mean, perhaps we can look at how far each observed value fell from the sample mean as a good approximation.

Day of Week

Hours of Sleep

Dist from mean

Sun

9

2

Mon

7

0

Tue

4

-3

Wed

10

3

Thu

5

-2

Fri

8

1

Sat

6

-1

Okay, we can see from the Dist from mean column, which subtracts 7 from each value, that some days she got more hours of sleep than average (positive numbers), and some she got less (negative numbers).

We might be tempted to average these values - go ahead and try adding them all up. You’ll see that they add up to 0! Of course, it’s not true that the standard deviation of Lori’s hours of sleep is 0; there is plenty of variability day-to-day. It’s simply that the negatives balance out the positives.

This fact is exactly what makes the mean special as a measure of center. It is the number that balances out the negative and positive differences in the data.

We don’t want negative values to “cancel out” positive ones - being above or below the sample mean of 7 both represent variation. Thus, we’ll take the squared values instead, and look at their average.

Day of Week

Hours of Sleep

Dist from mean

Squared dist

Sun

9

2

4

Mon

7

0

0

Tue

4

-3

9

Wed

10

3

9

Thu

5

-2

4

Fri

8

1

1

Sat

6

-1

1

The average of the Squared dist column turns out to be 4. (Check this for yourself!) We therefore would say that the variable Hours of Sleep has a variance of 4.

We aren’t quite finished, though. The variance is not a very intuitive summary of the uncertainty in our Hours of Sleep variable. The “typical distance” that observations fall from the mean is not 4 - in fact, there weren’t any nights were Lori got 3 hours of sleep (4 below the mean) or 11 hours of sleep (4 above). The number 4 is not quite on the same scale as the variable Hours of Sleep, because of how we squared the distances.

Thus, to get the standard deviation instead, we’ll take the square root of the variance. The estimated standard deviation for Lori’s hours of sleep is 2.

Try this out! Make a guess about the standard deviation of the Cups of Coffee variable. Then, calculate the actual estimated standard deviation from the data. Were you close?

8.0.1 Statistics and parameters revisited

Don’t forget that when we ask research questions, like “What is Lori’s average hours of sleep?”, we are implicitly asking about the parameter, i.e. the true mean hours per night over all her nights of sleep in college. We use our data to estimate that parameter.

Although we rarely ask research questions like “What is the standard deviation of Lori’s sleep?”, it’s still true that we are estimating a parameter - the true standard deviation over all her nights of sleep - by calculating the sample standard deviation from our data.

In other words, if you took a different sample the next week, you might get a slightly different estimate for both the sample mean and the sample standard deviation!

Just as you will often see the Greek letter \mu representing the true mean of a variable, you will sometimes see the Greek letter \sigma (“sigma”) representing the true standard deviation.

Since we rarely ask questions about \sigma, we only estimate s from the data, we won’t use this symbol in the rest of this class.

8.0.2 Other measures of variability

You might reasonably ask why, in the process of calculating the standard deviation, we squared the values to make them all positive, instead of taking the absolute value, or raising them to the 4th power, or any other math that makes them positive.

The short answer is: there are some mathematical reasons, which you will probably never encounter in your work, why squaring these distances is convenient.

The longer answer is: those other suggestions for how to make the distances positive are in fact totally reasonable - just different - ways to quantify variability! In fact, the option of taking the absolute value of each distance before averaging leads to a measure called the mean absolute deviation, which is less commonly used than the standard deviation, but is still perfectly valid.

Try Exercises 2.2.1 now.

8.1 Standard deviation in dummy variables

How will we measure variability in a categorical variable?

Let’s think back to how we transformed categorical variables into dummy or one-hot-encoded variables:

Day of Week

Breakfast

Day of Week_Sun

Day of Week_Mon

Day of Week_Tue

Day of Week_Wed

Day of Week_Thu

Day of Week_Fri

Day of Week_Sat

Breakfast_Cereal

Breakfast_Muffin

Breakfast_None

Sun

Cereal

1

0

0

0

0

0

0

1

0

0

Mon

Cereal

0

1

0

0

0

0

0

1

0

0

Tue

Muffin

0

0

1

0

0

0

0

0

1

0

Wed

None

0

0

0

1

0

0

0

0

0

1

Thu

Cereal

0

0

0

0

1

0

0

1

0

0

Fri

None

0

0

0

0

0

1

0

0

0

1

Sat

Cereal

0

0

0

0

0

0

1

1

0

0

Now that our categorical variable has been converted to three separate quantitative variables, we may proceed with finding the standard deviation of each of these variables. For example, looking at the Breakfast_Cereal column, we know the mean has to be 0.57 (4/7). We can therefore take the usual steps to calculate the variance and standard deviation:

Day of Week

Breakfast_Cereal

Dist from mean

Squared dist

Sun

1

0.43

0.1849

Mon

1

0.43

0.1849

Tue

0

-0.57

0.3249

Wed

0

-0.57

0.3249

Thu

1

0.43

0.1849

Fri

0

-0.57

0.3249

Sat

1

0.43

0.1849

\text{variance = } 0.24

\text{standard deviation = } \sqrt{\text{variance}} [1] “ = `r .QuartoInlineRender(sd)`

Do you think the standard deviation for the Breakfast_Muffin and Breakfast_None variables will be larger or smaller than it was for Breakfast_Cereal? Why?

Hint: Think about how different most of the observed values seem to be from each other.

8.1.1 Sample size and sample mean revisited

In the previous chapter, we saw how the standard deviation of \bar{x}, the sample mean, was smaller when the sample size was larger.

You may recall from the Unit One discussion of dummy variables that the mean of a dummy variable is the same thing as the proportion for that category. The mean of the Breakfast_Cereal variable was 0.57, because 57% of Lori’s breakfasts were cereal.

Thus, we can compute the standard deviation of the sample proportion in the exact same way we did for the mean, since the sample proportion is just a sample mean of a dummy variable!

\text{standard deviation of sample proportion} = \frac{\text{standard deviation of one observation}}{\sqrt{n}}

We estimated that 57% of Lori’s breakfasts are cereal. The standard deviation of our estimate was

s_{\hat{p}} = \frac{s_x}{\sqrt{n}} = \frac{0.49}{\sqrt{7}} = 0.185.

Here we are using the notation \hat{p} instead of \bar{x}, to remind ourselves that we are talking about a sample proportion, not a sample mean from a quantitative variable.

Can you guess what symbol we might use to represent the true proportion, instead of \mu for a true mean? If you guessed that it’s a Greek letter, you’re right! We often use \pi (“pi”) for the parameter.

8.1.2 Making arguments about proportions

Suppose Lori tells you, “I only eat cereal half the time.” Do you believe her? Well, if she is telling the truth, her true proportion is 0.5.

Let’s calculate the z-score of our sample proportion, in exactly the same way we did for an ordinary sample mean:

\text{z-score of } \hat{p} =\frac{\hat{p} - \pi}{S_{\hat{p}}} = \frac{0.57 - 0.5}{0.185} = 0.378

Don’t let the math notation scare you - this is still our old familiar friend the z-score.

\text{z-score} = \frac{ \text{observed value} - \text{presumed parameter}}{\text{SD of observed value}} = \frac{\hat{p} - \pi}{s_{\hat{p}}}

We have found that our sample proportion of 0.57 was less than a standard deviation from Lori’s proposed proportion of 0.5. That’s pretty close! Maybe Lori is telling the truth and she eats cereal half the time - after all, if that is true, we shouldn’t be at all surprised that she ate is 4/7 days this week.

8.1.3 Shortcut formula

Because the sample standard deviation of a dummy variable is a very specific special case, it turns out there is a nice shortcut for calculating it, using only the corresponding sample proportion, \hat{p}:

s_{\hat{p}} = \sqrt{\hat{p}*(1-\hat{p})}

Then, to find the standard deviation of the sample proportion, we simply divide by the square root of the sample size as usual:

s_{\hat{p}} = \sqrt{\frac{\hat{p}*(1-\hat{p})}{n}}

Try Exercises 2.2.2 now.

8.2 The Intra-Quartile Range

All our analysis so far has used the standard deviation to measure uncertainty or variability. But the “typical distance from the mean” is not the only way to quantify variation among observed values!

Let’s turn now to a measure of variability that is designed to go along with the median: The intra-quartile range (IQR)

8.2.1 Quartiles

Recall that the median is the value that splits the observed values into half above it, half below it. For example, the median of the variable Cups of Coffee is 1.

Now let’s focus only on the values below (or equal to) the median:

\;\; 0\;\; 0\;\; 1\;\; 2\;\;

The median of these values is 0.5. This is called first quartile

\;\; 2\;\; 3\;\; 4 \;\; 5

The median of these values is 3.5. This is called the third quartile

These values are called quartiles, because along with the median, the split the observed values into (approximately) one-quarter sections:

0 \;\; 0 \;\; | \;\; 1 \;\; [2] \;\; 3 \;\; | \;\; 4 \;\; 5 If the 1st quartile (Q1) has approximately 25% of the observations below it, and the 3rd quartile (Q3) has approximately 75% of the observations below it, what do you think the 2nd quartile would be? What about the 4th quartile? The 0th quartile?

It’s logical that the 2nd quartile should have 50% of the values below it - this is just the median! The 4th quartile should have 100% below or at it - this is the maximum! And the 0th quartile is, of course, the minimum.

Often, when describing a particular quantitative variable, we like to give the five-number summary of that variable:

Minimum: 0 Q1: 0.5 Median: 2 Q3: 3.5 Maximum 5

This gives us a pretty good feel for the distribution of the variable Cups of Coffee: The minimum and Q1 are close to the median, but Q3 and the maximum are a bit further from it, indicating that the values are a bit more spread out on the high end, so this variable is a little right-skewed.

(Of course, this example is a bit silly - we are summarizing 7 observations with a five-number summary! In larger datasets, it becomes more useful to limit our summary to only five important numbers.)

8.2.2 The IQR

If we would like to think about how uncertain, or “spread out”, values of a variable are, one thing we can look at is how close together the “middle” or “typical” values are.

Try to fill in the blank: In half of the days measured, Lori got between ____ and ____ cups of coffee. There are multiple ways to fill in the blanks that are technically correct - but if we think about the middle 50% of all observed values, we would say “Lori got between 0.5 and 3.5 cups of coffee”.

The size of this middle range, between the first and third quartile, is a measure of spread called the Intra-Quartile Range (IQR). In this case, our IQR is 3.5 - 0.5 = 3.

Similarly to how we count standard deviations from the mean with the z-score, we can think about counting IQRs away from the edge of the intra-quartile zone. That is, we can quantify extreme values by asking how far they are from either Q1 (if they are extreme on the small end) or from Q3 (on the large end).

For example, on Thursday, Lori had a tough day, and she drank 5 cups of coffee. How extreme is this?

The distance between the value 5 and the 3rd quartile (3.5) is 1.5. This is equal to (1.5/3 =) half an IQR above the 3rd quartile. Thus, we might say that Thursday was not very extreme.

8.3 Outliers

It is common in data analysis to attempt to identify outliers, or values that are exceptionally extreme in our data.

There are two possible reasons we might get an outlier value in a dataset:

  1. A mistake in the data collection or recording. Perhaps a value was entered wrong into a spreadsheet - for example, a 10 was typed instead of 100. This is a particularly sinister type of missing data that we would need to deal with.

  2. Sampling variability: It could simply be that we happened to sample an observation that is extreme. For example, if we were studying people’s heights and happened to sample 7’4” tall basketball player Yao Ming, he would certainly appear to be an outlier!

8.3.1 Identifying outliers

There is not a strict definition of what makes a value an outlier, rather than simply more extreme than the rest. The most common “rule of thumb” for extreme values is to check whether they fall outside the fences:

  • lower fence = 1st quartile - (1.5 times IQR)
  • upper fence = 3rd quartile + (1.5 times IQR)

In the Cups of Coffee variable, our lower fence is (0.5 - 1.5*3 =) -4. Of course, we can’t have negative cups of coffee, so this isn’t very useful. Our upper fence is (3.5 + 1.5*2.5 = 8). So we might say that if Lori has over 8 cups of coffee, we should either check our data input, or be very concerned about her! (Eight cups of coffee sounds concerning even without the math!)

It’s important to emphasize that the rule of 1.5*IQR has no secret mathematical reason - it’s just something that scientists over the years have agreed on as a good cutoff.

We can also sometimes identify outliers using the mean and standard deviation, rather than the median and IQR. There is no “rule of thumb” in this case, but anything above four standard deviations from the mean is probably worth taking a look at.

The Six Sigma is a training program for data analysis in business. It takes its name from a common cutoff of six standard deviations - a.k.a., 6*\sigma - for a process to fail a quality control check.

8.3.2 Handling outliers

Much like with missing data, we must be very cautious when deciding whether to change or remove outlier values. If we truly believe the outlier was a data mistake, then certainly the mistake should be either fixed or removed. However, if it’s possible we simply have an extreme value in our sample by luck, this requires more thought.

If our extreme value is an influential outlier, perhaps we should remove it. For example, if I want to estimate average human height by sampling 10 people, and I happen to sample Yao Ming, my estimate is going to be much higher than is reasonable! I might be better off using only the 9 non-outlier samples for my analysis.

The possibility of removing outliers presents an ethics challenge for researchers. Imagine that I hope to prove that Lori doesn’t drink much coffee, and I’m willing to be unethical in my analysis. I might declare the value of 5 to be an “outlier” and remove it, taking the sample mean from 2.14 cups per day to 1.67 cups per day.

Try Exercises 2.2.3 now.