1.1 Exercises

Exercise 1.1.1

The following dataset consists of sales at a certain company in Indiana, with locations in five different states, over the course of a few months.

Indiana Sales Data

Name two ways that this data is not tidy, and how you would fix that problem.

Exercise 1.1.2

The following data contains information about flights leaving NYC airports (JFK, LGA or EWR) in 2013.

(Source: nycflights13 R package)

First, consider the first few rows of the flights dataset:

year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
2013 2 22 1319 1302 17 1422 1404 18 EV 4129 N13958 EWR DCA 45 199 13 2 2013-02-22 13:00:00
2013 3 17 1825 1830 -5 2015 2015 0 MQ 4674 N503MQ LGA CLE 83 419 18 30 2013-03-17 18:00:00
2013 5 27 2125 2124 1 2353 21 -28 UA 1692 N27239 EWR MCO 124 937 21 24 2013-05-27 21:00:00
2013 8 5 1713 1712 1 1854 1844 10 B6 408 N190JB JFK PWM 49 273 17 12 2013-08-05 17:00:00
2013 9 17 1713 1713 0 1852 1856 -4 EV 4202 N22909 EWR STL 132 872 17 13 2013-09-17 17:00:00
2013 9 21 1441 1450 -9 1540 1602 -22 EV 4372 N13118 EWR DCA 38 199 14 50 2013-09-21 14:00:00
2013 10 22 1232 1232 0 1538 1530 8 B6 1401 N583JB JFK FLL 160 1069 12 32 2013-10-22 12:00:00
2013 11 11 1503 1505 -2 1843 1830 13 AA 145 N5FRAA JFK MIA 154 1089 15 5 2013-11-11 15:00:00
2013 11 29 1410 1415 -5 1616 1636 -20 DL 673 N939DL EWR ATL 107 746 14 15 2013-11-29 14:00:00
2013 12 22 2018 1901 77 2342 2251 51 UA 1489 N24202 EWR SFO 351 2565 19 1 2013-12-22 19:00:00

Important: These are just a few rows, the full dataset contains thousands of rows of data!

Click here for descriptions of the variables in the dataset. Note that:

  • The tail number of a plane is a unique number identifying that specific physical airplane, much like a person’s name.

  • The flight number of a flight is a label that an airline gives to a particular route. For example, “United Flight 52” goes from Washington, DC to Zurich, Switzerland; while “American Airlines Flight 52” goes from Indianapolis to Chicago.

  1. What are the cases for this dataset?

    1. The routes served by airlines from NYC airports

    2. The departure and arrival times of the flights

    3. The dates on which at least one flight left an NYC airport

    4. The flights that left NYC airports in 2013

  2. Choose one or more columns below that, together, would uniquely identify each case.

    1. year

    2. month

    3. day

    4. hour

    5. minute

    6. dep_time

    7. arr_time

    8. carrier

    9. flight

    10. tailnum

    11. origin

    12. dest

Next, consider the planes dataset. This contains information about the airplanes that were used in flights from New York in 2013. The first few rows are shown below:

tailnum year type manufacturer model engines seats speed engine
N344AA 1992 Fixed wing GULFSTREAM AEROSPACE G-IV 2 22 NA Turbo-fan
N351NW 1997 Fixed wing AIRBUS INDUSTRIE A320-212 2 182 NA Turbo-fan
N12563 2002 Fixed wing EMBRAER EMB-145LR 2 55 NA Turbo-fan
N568UW 2013 Fixed wing AIRBUS A321-231 2 379 NA Turbo-fan
N834MH 2000 Fixed wing BOEING 767-432ER 2 300 NA Turbo-jet
N8977A 2004 Fixed wing BOMBARDIER INC CL-600-2B19 2 55 NA Turbo-fan
  1. What are the cases for this dataset?

    1. The years that planes were built

    2. The planes that flew from NYC

    3. The companies that make planes

    4. The types of planes that fly from NYC

  2. Which of the following column(s) are needed to uniquely identify each case? (check all that apply, but do not check more than the minimum needed to identify the case)

    1. tailnum

    2. year

    3. type

    4. manufacturer

    5. model

Last, consider a dataset about the weather at the airports that flights from NYC leave from. The first few rows are shown below:

origin year month day hour temp dewp humid wind_dir wind_speed wind_gust precip pressure visib time_hour
EWR 2013 1 1 1 39.02 26.06 59.37 270 10.35702 NA 0 1012.0 10 2013-01-01 01:00:00
EWR 2013 1 1 2 39.02 26.96 61.63 250 8.05546 NA 0 1012.3 10 2013-01-01 02:00:00
EWR 2013 1 1 3 39.02 28.04 64.43 240 11.50780 NA 0 1012.5 10 2013-01-01 03:00:00
JFK 2013 1 1 1 39.02 26.06 59.37 260 12.65858 NA 0 1012.6 10 2013-01-01 01:00:00
JFK 2013 1 1 2 39.02 26.06 59.37 270 11.50780 NA 0 1012.4 10 2013-01-01 02:00:00
JFK 2013 1 1 3 39.92 26.96 59.50 260 14.96014 NA 0 1012.7 10 2013-01-01 03:00:00
LGA 2013 1 1 1 39.92 26.06 57.33 260 13.80936 23.01560 0 1011.9 10 2013-01-01 01:00:00
LGA 2013 1 1 2 41.00 26.06 54.97 260 17.26170 25.31716 0 1011.5 10 2013-01-01 02:00:00
LGA 2013 1 1 3 41.00 26.06 54.97 260 16.11092 24.16638 0 1012.0 10 2013-01-01 03:00:00
  1. What are the cases for this dataset?
    1. Days on which the weather was measured

    2. Days on which the weather was measured at a particular airport

    3. Hours when the weather was measured

    4. Hours when the weather was measured at a particular airport

  2. Which of the following column(s) are needed to uniquely identify each case? (check all that apply, but do not check more than the minimum needed to identify the case)
    1. origin

    2. year

    3. month

    4. day

    5. hour

    6. temp

Exercise 1.1.3

India is the world’s third-largest producer and third-largest consumer of electricity. The national electric grid in India has an installed capacity of 370.106 GW as of 31 March 2020.

The data in these exercises was collected to study the energy consumption in four of of India’s largest states - Punjab, Delhi, Gujarat, and Goa - over the time period of Jan 2, 2019 to May 23, 2020.

(Data source: Kaggle)

Date Punjab Delhi Gujarat Goa
02/01/2019 00:00:00 119.9 85.8 319.5 12.8
03/01/2019 00:00:00 121.9 85.5 316.7 13.7
04/01/2019 00:00:00 118.8 83.5 301.9 12.6
05/01/2019 00:00:00 121.0 79.2 313.2 13.0
06/01/2019 00:00:00 121.4 76.6 320.7 12.9
07/01/2019 00:00:00 118.0 71.1 319.4 12.7
  1. This dataset is currently in… (circle all that apply)

    1. Long form

    2. Wide form

    3. Un-tidy format

    4. Tidy format

  2. What are the cases in this dataset?

    1. States that were studied

    2. Energy consumption of states

    3. Days on which energy consumption was measured

    4. Total energy consumed on each date

  3. If we were to convert this dataset from it’s current (long/wide) form to the other form, what would possibly be the new column names?

    1. “State”, “Day”, “Month”, “Year”

    2. “Date”, “State”, “Consumption”

    3. “Date”, “Punjab Consumption”, “Delhi Consumption”, “Gujarat Consumption”, “Goa Consumption”

    4. “State”, “Consumption on 02/01/2019”, “Consumption on 02/02/2019”, …, “Consumption on 03/23/2020”

  4. What would be the cases in this new dataset?

    1. States that were studied

    2. Energy consumption of states on each date

    3. Days on which energy consumption was measured for a particular state

    4. Total energy consumed over time by each state

Exercise 1.1.4

Let’s revisit the flights dataset from Exercise 1.1.2

year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00
  1. Name all variables from this dataset that might be considered quantitative.
  2. Name all variables from this dataset that might be considered categorical.
  3. Name all variables from this dataset that might be considered other.
  4. What are the levels of the variable origin?

Exercise 1.1.5

Let’s revisit the planes dataset from Exercise 1.1.2

tailnum year type manufacturer model engines seats speed engine
N10156 2004 Fixed wing EMBRAER EMB-145XR 2 55 NA Turbo-fan
N102UW 1998 Fixed wing AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N103US 1999 Fixed wing AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N104UW 1999 Fixed wing AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan
N10575 2002 Fixed wing EMBRAER EMB-145LR 2 55 NA Turbo-fan
N105UW 1999 Fixed wing AIRBUS INDUSTRIE A320-214 2 182 NA Turbo-fan

Below, we can see the possible values for some of the variables in this dataset:

# A tibble: 6 × 1
  engine       
  <chr>        
1 Turbo-fan    
2 Turbo-jet    
3 Reciprocating
4 4 Cycle      
5 Turbo-shaft  
6 Turbo-prop   
# A tibble: 2 × 1
  type      
  <chr>     
1 Fixed wing
2 Rotorcraft
# A tibble: 4 × 1
  engines
    <int>
1       2
2       1
3       4
4       3
  1. Name all variables in this dataset that could be considered discrete.
  2. Name all variables in this dataset that could be considered ordinal.
  3. Name all variables in this dataset that could be considered binary.

Now, let’s consider altering the flights dataset from Exercise 1.1.2

year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
2013 2 22 1319 1302 17 1422 1404 18 EV 4129 N13958 EWR DCA 45 199 13 2 2013-02-22 13:00:00
2013 3 17 1825 1830 -5 2015 2015 0 MQ 4674 N503MQ LGA CLE 83 419 18 30 2013-03-17 18:00:00
2013 5 27 2125 2124 1 2353 21 -28 UA 1692 N27239 EWR MCO 124 937 21 24 2013-05-27 21:00:00
2013 8 5 1713 1712 1 1854 1844 10 B6 408 N190JB JFK PWM 49 273 17 12 2013-08-05 17:00:00
2013 9 17 1713 1713 0 1852 1856 -4 EV 4202 N22909 EWR STL 132 872 17 13 2013-09-17 17:00:00
2013 9 21 1441 1450 -9 1540 1602 -22 EV 4372 N13118 EWR DCA 38 199 14 50 2013-09-21 14:00:00
2013 10 22 1232 1232 0 1538 1530 8 B6 1401 N583JB JFK FLL 160 1069 12 32 2013-10-22 12:00:00
2013 11 11 1503 1505 -2 1843 1830 13 AA 145 N5FRAA JFK MIA 154 1089 15 5 2013-11-11 15:00:00
2013 11 29 1410 1415 -5 1616 1636 -20 DL 673 N939DL EWR ATL 107 746 14 15 2013-11-29 14:00:00
2013 12 22 2018 1901 77 2342 2251 51 UA 1489 N24202 EWR SFO 351 2565 19 1 2013-12-22 19:00:00
  1. Suppose I wanted to make a new variable called late, that shows whether the flight arrived late or not. What quantitative variable would we need to bin to create this new variable? What would those bins be?
  2. Suppose I wanted to convert the origin column to dummy variables. What would be the new column names in the dataset after this process, and what possible values could they have?
  3. Suppose I wanted to make a new variable called flight_type, with possible values “International” or “Domestic”. What categorical variable would I use to combine categories and make this new variable? Explain.