11 Review

Stating and testing null hypotheses

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

― Sherlock Holmes

Consider the following research study: A waiter wishes to determine which customers at his restaurant are likely to give him higher tips. For a week, he records instances of customer tables that he served. He then studies whether certain traits are associated with certain levels of tips.

This study was actually performed by a waiter in 1995. A few rows of the dataset are shown below, for reference:

total_bill	tip	percent_tip	smoker	day	time	size
17.59	2.64	0.1500853	No	Sat	Dinner	3
25.21	4.29	0.1701706	Yes	Sat	Dinner	2
31.27	5.00	0.1598977	No	Sat	Dinner	3
20.08	3.15	0.1568725	No	Sat	Dinner	3
16.00	2.00	0.1250000	Yes	Thur	Lunch	2

Suppose our waiter wanted to ask the question, Do people tend to spend more on their meal at Dinner than at Lunch?

We have spent the first two units of this class learning how to address that question.

First, we find the sample means and sample standard deviations of the total_bill variable, within each group (Lunch/Dinner):

time	total_bill_mean	total_bill_sd	total_bill_n
Dinner	21.15	10.03	39
Lunch	16.23	5.00	18

Next, we find the difference of sample means and the standard deviation of the difference of sample means:

\bar{x}_D - \bar{x}_L = 21.15 - 16.23 = 4.92

SD({\bar{x}_D}) = 10.03/\sqrt{39} = 1.6

SD({\bar{x}_L}) = 5.00/\sqrt{18} = 1.18

SD({\bar{x}_D} - {\bar{x}_L}) = \sqrt{1.6^2 + 1.18^2} = 1.99

Finally, we use the information about our statistic of interest, our expected value if there is no relationship, and the uncertainty in our statistic to calculate a standardized score.

z = \frac{4.92 - 0}{1.99} = 2.47

We conclude that our observed difference of means, 4.92, is 2.47 standard deviations away from what we would expect if people spent the same amount on lunch and dinner. Therefore, we found evidence that people spend more on dinner, on average.

11.1 The Null Hypothesis

In the above analysis, something happened that probably felt very natural, but was extremely important: We decided that the proposed mean - the value we expected if lunch and dinner were equal - was zero.

Hopefully, it is clear why this is a reasonable choice. If the true mean of bills at dinner (\mu_D) is the same as the true mean of bills at lunch (\mu_L), then the true difference of means (\mu_D - \mu_L) will be zero.

In this reasoning, we came up with a null hypothesis: a claim about the parameters involved in our research question that corresponds to the idea that “nothing interesting” is happening.

To formally state a null hypothesis, we use the symbol H, with a 0 (“zero” or “null”) beneath it. The hypothesis itself is usually written in symbols:

H_0: \mu_D = \mu_L or H_0: \mu_D - \mu_L = 0

It is also permissible to state the null hypothesis in words - but when doing so, it is important to be extremely clear that we are making a claim about parameters. That is, we are making a statement that might be true, or might not - it’s impossible to know without seeing the entire population, which we never get to do.

Good:

H_0: \text{The true mean of bill prices from dinner is the same as the true mean of bill prices from lunch.}

Not so good, because it is somewhat ambiguous about which parameters you are studying, and about the fact that you are making a claim about parameters not statistics:

H_0: \text{The average bills at dinner are the same as the average bills at lunch.}

It is absolutely incorrect to talk about statistics in your null hypothesis:

H_0: \text{The sample mean of bills at dinner are the same as the sample mean of bills at lunch.} This is not a hypothesis! We already know our sample means, $21.15 and $16.23, and they are definitely not equal. What we are trying to do is decide if these sample means provide us with convincing evidence that the true means are also not equal.

11.1.1 One proportion

Suppose our waiter asked the question,

Were there more smokers or nonsmokers at this restaurant in 1995?

How would we translate this question to a null hypothesis?

First, let’s think about what parameter would address this question. An obvious statistic to calculate is the percent of tables that came at Dinner versus Lunch:

 smoker  n   percent
     No 31 0.5438596
    Yes 26 0.4561404

Our parameter in this example is: the true percent of dining parties that include a smoker.

Next, we’d ask ourselves what situation would be “not interesting” or “no conclusion”. In this example, our null situation is one where an equal count of tables that are are smoking versus non-smoking.

If the counts are equal, then clearly the percent of smoking tables is 50%. Thus, we have a null hypothesis of:

H_0: \pi_S = 0.5

or H_0: \text{The true long-term proportion of dining parties with smokers is 0.5.}

11.1.2 Multiple proportions

Now, what if we nuanced our question a bit to ask:

Is the proportion of smokers diners higher on some days than others?

The statistic we’d like to calculate is the percent of smokers versus non-smokers on each day of the week:

  day        No       Yes
  Fri 0.1666667 0.8333333
  Sat 0.6086957 0.3913043
  Sun 0.6153846 0.3846154
 Thur 0.5333333 0.4666667

Now, what is our null hypothesis? If there is no relationship between day of the week and smokers versus nonsmokers, we would see the exact same percentage of smokers on every day. In symbols,

H_0: \pi_F = \pi_{Sa} = \pi_{Su} = \pi_T

Notice that we are not saying that every day has 50% smokers. We have not made any claims about the percentage of smokers on any given day; only that they are all equal. Perhaps it is true that 10% of diners are smokers every single day. Perhaps it is true that 85% of diners are smokers every single day. Both of these would be situations where the day of the week has no impact on the type of customers who show up.

Try Exercises 3.1.1 now.

11.2 Testing hypotheses

When we perform a formal hypothesis test, we state a null hypothesis, then we see if the data observed is inconsistent with the null. That is, we are interested in the question:

Is it reasonable to think the statistic we calculated happened in a world where the null hypothesis is true?

Essentially, there are two possible truths about the universe we live in:

The null hypothesis is true. The only reason we don’t see a summary statistic exactly equal to our hypothesized parameter is random variability.
The null hypothesis is not true. We saw a summary statistic that is different from the hypothesized parameter because our hypothesized parameter is not reality.

Our goal is to decide which of these possibilities we most believe, based on the data we collected.

Take, for example, our question about whether people spend more money at dinner than lunch. Imagine that two people are having an argument:

Earl says:

People spend the same amount of money at lunch and at dinner, on average. Yes, in this data we saw an average of $4.92 more at dinner, but that’s not much money. Different groups of people spend different amounts, of course there’s going to be some fluctuations due to sampling variability.

Lulu says:

No way! $4.92 is not much money, but it’s over 2 standard deviations away from a difference of 0. If this fluctuation is due to sampling variability, we must have gotten really unlucky in our sample! It seems unlikely that we just randomly sampled some big spenders at dinner. It’s more reasonable to say that this data tells the real story: overall, people spend more at dinner.

Which argument seems the most reasonable to you? Hopefully, Lulu’s argument convinced you! Even if $4.92 doesn’t seem like much, it was a pretty unusual number to observe in a world where the true means bill totals are equal. The explanation that this difference is due only to “luck of the draw” seems rather suspicious.

11.2.1 Data from the null hypothesis

Let’s now focus in on something Lulu said in her argument:

If this fluctuation is due to sampling variability, we must have gotten really unlucky in our sample! It seems unlikely that we just randomly sampled some big spenders at dinner.

Here, Lulu is claiming that we would have to get quite unlucky to, by pure chance, sample 57 dining parties and see a mean difference of $4.92.

Wouldn’t it be nice if we knew exactly how unlucky that would be?

It would be convenient if we could transport ourselves to an alternate universe, where we know for sure that the true mean bill totals are equal at dinner and at lunch. Then, we could take a sample of 57 dining parties, and see what we observe for the difference of sample means.

If we observe something bigger than $4.92, we will have demonstrated that getting $4.92 just by luck is totally possible, as Earl claims.

If we observe something smaller than $4.92, then maybe it really is unlikely, as Lulu claims.

But if we’re giving ourselves the magic power of alternate universes, why stop at one sample? Let’s take, say, 1000 samples of size 57. For each one, let’s find the difference of sample means.

Suppose we carry out this magical alternate universe process and we make a histogram of our resulting statistics. That is, we calculate 1000 differences of sample means, and put those 1000 values on a number line:

Each dot represents one “redo” of the study - 57 samples of tables - from the null distribution, or the universe where the null hypothesis is true.

How many times did our simulated difference of sample means exceed the statistic we observed in the real data, $4.92?

It was certainly possible for a difference of $4.92 to occur in the null distribution; in fact, we even saw one simulated scenario where the difference was almost $8! However, these extreme values were uncommon. In the imaginary data for this visualization, a difference greater than $4.92 occurred only 18 out of the 1000 times we tried. (That is, there are 18 red dots to the right of the cutoff line.) This tells us that our observed statistic of a difference of $4.92 is something that would not happen very often by pure luck, if the null hypothesis is true.

11.2.2 p-values

The analysis above, if we really could simulate the process 1000 times in the null universe, would prove Lulu’s point very nicely.

Instead of saying,

It seems unlikely that we just randomly sampled some big spenders at dinner.

She could argue,

If you are right and people spend equally at lunch and dinner, there would only be a 1.8% chance of taking a sample of 57 meals that had $4.92 more on average at dinner.

In other words, it’s still possible that Earl is correct - but he’s claiming we are experiencing a situation that only happens 1.8% of the time!

The measurement we just made, 18/1000 = 1.8%, is a p-value, . The p-value measures the probability of getting a statistic more extreme than we saw, in a world where the null hypothesis is true.

In other words, the p-value measures how inconsistent the data is with the null hypothesis. A p-value of 0.25 tells us that the data is reasonably consistent: If the null were true, we’d see a summary statistic like the one we calculated from our data about 1 out of every 4 times we did a study. A p-value of 0.01 tells us that the data is very inconsistent: If the null were true, there would only be about a 1% chance of us seeing the summary statistic we found!

The discussion above all relied on one made-up idea: we pretended that we could transport to an alternate universe, sample 57 dining parties, calculate the difference of mean bill for lunch and dinner parties, and then repeat that process 1000 times. Obviously, this is not possible in reality.

For now, we are asking you to simply accept the simulated data given to you by “magic”. Soon, you will learn about the tricks we use to create simulated data.

Try Exercises 3.1.2 now.