total_bill | tip | percent_tip | smoker | day | time | size |
---|---|---|---|---|---|---|
17.59 | 2.64 | 0.1500853 | No | Sat | Dinner | 3 |
25.21 | 4.29 | 0.1701706 | Yes | Sat | Dinner | 2 |
31.27 | 5.00 | 0.1598977 | No | Sat | Dinner | 3 |
20.08 | 3.15 | 0.1568725 | No | Sat | Dinner | 3 |
16.00 | 2.00 | 0.1250000 | Yes | Thur | Lunch | 2 |
3.1 Exercises
The following questions refer to the dataset from Chapter 3.1, in which a waiter collected information about bills and tips at his tables for a week. A few rows of the dataset are shown below:
Some summary statistics for this dataset are given below:
smoker | total_bill_mean | total_bill_sd | total_bill_n |
---|---|---|---|
No | 19.01 | 8.34 | 31 |
Yes | 20.29 | 9.87 | 26 |
day | percent_tip_mean | percent_tip_sd | percent_tip_n |
---|---|---|---|
Fri | 0.15 | 0.03 | 6 |
Sat | 0.17 | 0.06 | 23 |
Sun | 0.20 | 0.09 | 13 |
Thur | 0.17 | 0.04 | 15 |
time | smoker | n |
---|---|---|
Dinner | No | 22 |
Dinner | Yes | 17 |
Lunch | No | 9 |
Lunch | Yes | 9 |
Correlations:
variable | total_bill | tip | percent_tip | size |
---|---|---|---|---|
total_bill | 1.0000000 | 0.5706443 | -0.4267555 | 0.6181611 |
tip | 0.5706443 | 1.0000000 | 0.4040693 | 0.4565800 |
percent_tip | -0.4267555 | 0.4040693 | 1.0000000 | -0.2095724 |
size | 0.6181611 | 0.4565800 | -0.2095724 | 1.0000000 |
Exercises 3.1.1
- Write the following null hypotheses in symbols:
The true mean tip amount is the same on Saturdays and Sundays.
The probability of a table having smokers the same at lunch and at dinner.
The true mean percent tip amount is the same on Saturdays than on Sundays. (Careful! Although this has “percent” in the name, we are measuring a quantitative variable, not a categorical one. The variable
percent_tip
contains numbers; those numbers just happen to be percents.)People tend to tip the same percentage no matter how expensive their total bill is. (Hint: What variables are involved in this question, what types are they, and how do we measure their relationship?)
- Translate the following research questions into a null hypothesis:
Do smoker tables have different spending habits than non-smokers?
Do dining parties with more people tend to tip higher percentages?
Are there more smoker or non-smoker groups?
Exercises 3.1.2
- Consider the research question,
Is the mean tip percent higher on Saturday than on Sunday?
The following histogram shows the results of simulating data 1000 times from the null distribution.
Of all these simulated statistics, what appears to be the center value? Why does this make sense?
These simulated statistics show some random variability. What do you think is the (approximate) standard deviation of the simulated differences of sample means? Why?
In our real data, we observed a difference of sample means of
\bar{x}_{Sat} - \bar{x}_{Sun} = 0.03
What is the approximate p-value of our study? (You should “guesstimate” this from the plot, not count dots for an exact answer!)
- What do you conclude? (Give a short one-sentence answer to the research question; no need to explain your answer.)
- Consider the research question,
Is a table less likely to have smokers at dinner than at lunch?
The following histogram shows the results of simulating data 1000 times from the null distribution.
Of all these simulated statistics, what appears to be the center value? Why does this make sense?
These simulated statistics show some random variability. What do you think is the (approximate) standard deviation of the simulated differences of sample proportions? Why?
In our real data, we observed a difference of proportions of
\hat{p}_{D} - \hat{p}_L = -0.064
What is the approximate p-value of our study?
- What do you conclude?
- Consider the research question,
Do people tend to give lower percent tips when their total bills are higher?
The following histogram shows the results of simulating data 1000 times from the null distribution.
Of all these simulated statistics, what appears to be the center value? Why does this make sense?
These simulated statistics show some random variability. What do you think is the (approximate) standard deviation of the simulated differences of sample proportions? Why?
In our real data, we observed a sample correlation between
total_bill
andpercent_tip
of -0.42.
What is the approximate p-value of our study?
- What do you conclude?