1.1 Exercises – Statistical Concepts for Data Science

Exercise 1.1.1

The following dataset consists of sales at a certain company in Indiana, with locations in five different states, over the course of a few months.

Name two ways that this data is not tidy, and how you would fix that problem.

Exercise 1.1.2

The following data contains information about flights leaving NYC airports (JFK, LGA or EWR) in 2013.

(Source: nycflights13 R package)

First, consider the first few rows of the flights dataset:

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	2	22	1319	1302	17	1422	1404	18	EV	4129	N13958	EWR	DCA	45	199	13	2	2013-02-22 13:00:00
2013	3	17	1825	1830	-5	2015	2015	0	MQ	4674	N503MQ	LGA	CLE	83	419	18	30	2013-03-17 18:00:00
2013	5	27	2125	2124	1	2353	21	-28	UA	1692	N27239	EWR	MCO	124	937	21	24	2013-05-27 21:00:00
2013	8	5	1713	1712	1	1854	1844	10	B6	408	N190JB	JFK	PWM	49	273	17	12	2013-08-05 17:00:00
2013	9	17	1713	1713	0	1852	1856	-4	EV	4202	N22909	EWR	STL	132	872	17	13	2013-09-17 17:00:00
2013	9	21	1441	1450	-9	1540	1602	-22	EV	4372	N13118	EWR	DCA	38	199	14	50	2013-09-21 14:00:00
2013	10	22	1232	1232	0	1538	1530	8	B6	1401	N583JB	JFK	FLL	160	1069	12	32	2013-10-22 12:00:00
2013	11	11	1503	1505	-2	1843	1830	13	AA	145	N5FRAA	JFK	MIA	154	1089	15	5	2013-11-11 15:00:00
2013	11	29	1410	1415	-5	1616	1636	-20	DL	673	N939DL	EWR	ATL	107	746	14	15	2013-11-29 14:00:00
2013	12	22	2018	1901	77	2342	2251	51	UA	1489	N24202	EWR	SFO	351	2565	19	1	2013-12-22 19:00:00

Important: These are just a few rows, the full dataset contains thousands of rows of data!

Click here for descriptions of the variables in the dataset. Note that:

The tail number of a plane is a unique number identifying that specific physical airplane, much like a person’s name.
The flight number of a flight is a label that an airline gives to a particular route. For example, “United Flight 52” goes from Washington, DC to Zurich, Switzerland; while “American Airlines Flight 52” goes from Indianapolis to Chicago.

What are the cases for this dataset?
1. The routes served by airlines from NYC airports
2. The departure and arrival times of the flights
3. The dates on which at least one flight left an NYC airport
4. The flights that left NYC airports in 2013
Choose one or more columns below that, together, would uniquely identify each case.
1. year
2. month
3. day
4. hour
5. minute
6. dep_time
7. arr_time
8. carrier
9. flight
10. tailnum
11. origin
12. dest

Next, consider the planes dataset. This contains information about the airplanes that were used in flights from New York in 2013. The first few rows are shown below:

tailnum	year	type	manufacturer	model	engines	seats	speed	engine
N344AA	1992	Fixed wing	GULFSTREAM AEROSPACE	G-IV	2	22	NA	Turbo-fan
N351NW	1997	Fixed wing	AIRBUS INDUSTRIE	A320-212	2	182	NA	Turbo-fan
N12563	2002	Fixed wing	EMBRAER	EMB-145LR	2	55	NA	Turbo-fan
N568UW	2013	Fixed wing	AIRBUS	A321-231	2	379	NA	Turbo-fan
N834MH	2000	Fixed wing	BOEING	767-432ER	2	300	NA	Turbo-jet
N8977A	2004	Fixed wing	BOMBARDIER INC	CL-600-2B19	2	55	NA	Turbo-fan

What are the cases for this dataset?
1. The years that planes were built
2. The planes that flew from NYC
3. The companies that make planes
4. The types of planes that fly from NYC
Which of the following column(s) are needed to uniquely identify each case? (check all that apply, but do not check more than the minimum needed to identify the case)
1. tailnum
2. year
3. type
4. manufacturer
5. model

Last, consider a dataset about the weather at the airports that flights from NYC leave from. The first few rows are shown below:

origin	year	month	day	hour	temp	dewp	humid	wind_dir	wind_speed	wind_gust	pressure	visib	time_hour
EWR	2013	1	1	1	39.02	26.06	59.37	270	10.35702	NA	1012.0	10	2013-01-01 01:00:00
EWR	2013	1	1	2	39.02	26.96	61.63	250	8.05546	NA	1012.3	10	2013-01-01 02:00:00
EWR	2013	1	1	3	39.02	28.04	64.43	240	11.50780	NA	1012.5	10	2013-01-01 03:00:00
JFK	2013	1	1	1	39.02	26.06	59.37	260	12.65858	NA	1012.6	10	2013-01-01 01:00:00
JFK	2013	1	1	2	39.02	26.06	59.37	270	11.50780	NA	1012.4	10	2013-01-01 02:00:00
JFK	2013	1	1	3	39.92	26.96	59.50	260	14.96014	NA	1012.7	10	2013-01-01 03:00:00
LGA	2013	1	1	1	39.92	26.06	57.33	260	13.80936	23.01560	1011.9	10	2013-01-01 01:00:00
LGA	2013	1	1	2	41.00	26.06	54.97	260	17.26170	25.31716	1011.5	10	2013-01-01 02:00:00
LGA	2013	1	1	3	41.00	26.06	54.97	260	16.11092	24.16638	1012.0	10	2013-01-01 03:00:00

What are the cases for this dataset?
1. Days on which the weather was measured
2. Days on which the weather was measured at a particular airport
3. Hours when the weather was measured
4. Hours when the weather was measured at a particular airport
Which of the following column(s) are needed to uniquely identify each case? (check all that apply, but do not check more than the minimum needed to identify the case)
1. origin
2. year
3. month
4. day
5. hour
6. temp

Exercise 1.1.3

India is the world’s third-largest producer and third-largest consumer of electricity. The national electric grid in India has an installed capacity of 370.106 GW as of 31 March 2020.

The data in these exercises was collected to study the energy consumption in four of of India’s largest states - Punjab, Delhi, Gujarat, and Goa - over the time period of Jan 2, 2019 to May 23, 2020.

(Data source: Kaggle)

Date	Punjab	Delhi	Gujarat	Goa
02/01/2019 00:00:00	119.9	85.8	319.5	12.8
03/01/2019 00:00:00	121.9	85.5	316.7	13.7
04/01/2019 00:00:00	118.8	83.5	301.9	12.6
05/01/2019 00:00:00	121.0	79.2	313.2	13.0
06/01/2019 00:00:00	121.4	76.6	320.7	12.9
07/01/2019 00:00:00	118.0	71.1	319.4	12.7

This dataset is currently in… (circle all that apply)
1. Long form
2. Wide form
3. Un-tidy format
4. Tidy format
What are the cases in this dataset?
1. States that were studied
2. Energy consumption of states
3. Days on which energy consumption was measured
4. Total energy consumed on each date
If we were to convert this dataset from it’s current (long/wide) form to the other form, what would possibly be the new column names?
1. “State”, “Day”, “Month”, “Year”
2. “Date”, “State”, “Consumption”
3. “Date”, “Punjab Consumption”, “Delhi Consumption”, “Gujarat Consumption”, “Goa Consumption”
4. “State”, “Consumption on 02/01/2019”, “Consumption on 02/02/2019”, …, “Consumption on 03/23/2020”
What would be the cases in this new dataset?
1. States that were studied
2. Energy consumption of states on each date
3. Days on which energy consumption was measured for a particular state
4. Total energy consumed over time by each state

Exercise 1.1.4

Let’s revisit the flights dataset from Exercise 1.1.2

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	2013-01-01 05:00:00
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	2013-01-01 05:00:00
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	2013-01-01 05:00:00
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	2013-01-01 05:00:00
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	2013-01-01 06:00:00
2013	1	1	554	558	-4	740	728	12	UA	1696	N39463	EWR	ORD	150	719	5	58	2013-01-01 05:00:00

Name all variables from this dataset that might be considered quantitative.
Name all variables from this dataset that might be considered categorical.
Name all variables from this dataset that might be considered other.
What are the levels of the variable origin?

Exercise 1.1.5

Let’s revisit the planes dataset from Exercise 1.1.2

tailnum	year	type	manufacturer	model	engines	seats	speed	engine
N10156	2004	Fixed wing	EMBRAER	EMB-145XR	2	55	NA	Turbo-fan
N102UW	1998	Fixed wing	AIRBUS INDUSTRIE	A320-214	2	182	NA	Turbo-fan
N103US	1999	Fixed wing	AIRBUS INDUSTRIE	A320-214	2	182	NA	Turbo-fan
N104UW	1999	Fixed wing	AIRBUS INDUSTRIE	A320-214	2	182	NA	Turbo-fan
N10575	2002	Fixed wing	EMBRAER	EMB-145LR	2	55	NA	Turbo-fan
N105UW	1999	Fixed wing	AIRBUS INDUSTRIE	A320-214	2	182	NA	Turbo-fan

Below, we can see the possible values for some of the variables in this dataset:

# A tibble: 6 × 1
  engine       
  <chr>        
1 Turbo-fan    
2 Turbo-jet    
3 Reciprocating
4 4 Cycle      
5 Turbo-shaft  
6 Turbo-prop

# A tibble: 2 × 1
  type      
  <chr>     
1 Fixed wing
2 Rotorcraft

# A tibble: 4 × 1
  engines
    <int>
1       2
2       1
3       4
4       3

Name all variables in this dataset that could be considered discrete.
Name all variables in this dataset that could be considered ordinal.
Name all variables in this dataset that could be considered binary.

Now, let’s consider altering the flights dataset from Exercise 1.1.2

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	2	22	1319	1302	17	1422	1404	18	EV	4129	N13958	EWR	DCA	45	199	13	2	2013-02-22 13:00:00
2013	3	17	1825	1830	-5	2015	2015	0	MQ	4674	N503MQ	LGA	CLE	83	419	18	30	2013-03-17 18:00:00
2013	5	27	2125	2124	1	2353	21	-28	UA	1692	N27239	EWR	MCO	124	937	21	24	2013-05-27 21:00:00
2013	8	5	1713	1712	1	1854	1844	10	B6	408	N190JB	JFK	PWM	49	273	17	12	2013-08-05 17:00:00
2013	9	17	1713	1713	0	1852	1856	-4	EV	4202	N22909	EWR	STL	132	872	17	13	2013-09-17 17:00:00
2013	9	21	1441	1450	-9	1540	1602	-22	EV	4372	N13118	EWR	DCA	38	199	14	50	2013-09-21 14:00:00
2013	10	22	1232	1232	0	1538	1530	8	B6	1401	N583JB	JFK	FLL	160	1069	12	32	2013-10-22 12:00:00
2013	11	11	1503	1505	-2	1843	1830	13	AA	145	N5FRAA	JFK	MIA	154	1089	15	5	2013-11-11 15:00:00
2013	11	29	1410	1415	-5	1616	1636	-20	DL	673	N939DL	EWR	ATL	107	746	14	15	2013-11-29 14:00:00
2013	12	22	2018	1901	77	2342	2251	51	UA	1489	N24202	EWR	SFO	351	2565	19	1	2013-12-22 19:00:00

Suppose I wanted to make a new variable called late, that shows whether the flight arrived late or not. What quantitative variable would we need to bin to create this new variable? What would those bins be?
Suppose I wanted to convert the origin column to dummy variables. What would be the new column names in the dataset after this process, and what possible values could they have?
Suppose I wanted to make a new variable called flight_type, with possible values “International” or “Domestic”. What categorical variable would I use to combine categories and make this new variable? Explain.