1  Introduction to tidy tabular data

Preparing and describing datasets

“From the moment you start tidying, you will be compelled to reset your life. As a result, your life will start to change… Tidying is just a tool, not the final destination. The true goal should be to establish the lifestyle you want most once your house has been put in order.”

― Marie Kondō, The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing

Most data you will work with is stored - perhaps after some processing! - in what we call tabular format. This simply means that we can write it down in a grid with rows and columns.

For example, a teacher wishing to keep track of her students might create a data table as follows:

Student ID

Name

Age

Classes Taken

Favorite Subject

Notes

00626

Stitch

unknown

3

Math

does not play well with others

00011

Jane Hopper

10

0

Reading

prefers to go by “El”

00006

26

5

Math

24601

Jean Valjean

50

5

Music

only speaks French

Fictitious dataset based on characters from Lilo & Stitch, Stranger Things, Battlestar Galactica, and Les Miserables.

In this class roster, each row represents a single student, each column represents some type of information collected about the students, and each section in the grid is filled out with that information for a specific student.

To use formal vocabulary, we would say that each column is a variable, i.e., the name for a piece of information we are collecting, like “Age” or “Name”. Each row is a case or observational unit; these are the things that we collect information about. The cells of the table contain the values for a particular variable and case. When a dataset is recorded in this way, we call that tidy data.

You might have noticed that some information in this table is missing. We will talk about that problem later in this chapter!

1.1 Un-tidy Data

Out in the wild world of real data, you may encounter tabular datasets that are recorded in an “un-tidy” way. For example, sometimes the table is “flipped” or “transposed”, meaning that the variables are written on the rows, and each column is a case:

ID

00626

00011

00006

24601

Name

Stitch

Jane Hopper

Jean Valjean

Age

unknown

10

26

50

Classes Taken

3

0

5

5

Favorite Subject

Math

Reading

Math

Music

Notes

does not play well with others

prefers to go by “El”

only speaks French

This might seem easier to read to some, but when it comes to doing Data Science using computational tools, a flipped table won’t be understood correctly by most software.

Sometimes, you might see multiple variables recorded together, or column labels split across two rows:

Student ID

Name, Age

Subject

Notes

X.classes.taken.

X__

Favorite

Least.Favorite

X_

00626 (3)

Stitch

Math

Writing

does not play well with others

00011 (0)

Jane Hopper, Age 10

Reading

Public Speaking

prefers to go by “El”

00006 (5)

Age 26

Math

Ethics

24601 (5)

Jean Valjean, Age 50

Music

Law

only speaks French

This example presents us with some challenges: How do we separate the Name, Age column into two separate ones? How do we fix the two Subjects columns, so that a computer reading this dataset does not think the first row represents a student? What do we do about Classes Taken being included in parentheses in the Student ID column?

When you are presented with data for analysis, your first step will likely be to tidy it up into a format that is consistent and understandable by machines.

Complete Exercise 1.1.1 now.

1.2 Identifying cases

Of course, in order to tidy up a dataset, you need to have a good understanding of what each row should represent. Sometimes, this is straightforward, as in the student roster dataset - each row of our dataset is a different individual student. We even have a column containing unique labels (also known as an index column) for each case: the Student ID column.

Why do you think the Student ID column might be a better unique label than the Name column?

Other times, though, it is not so clear what our cases are. Suppose, for example, the teacher of our strange class listed above is also recording student participation in class that week.

Student

Day

Type of Participation

Jean Valjean

Monday

Class presentation

Jane Hopper

Monday

Asked question

Stitch

Tuesday

Handed out pencils

Jane Hopper

Thursday

Asked question

Jane Hopper

Thursday

Asked question

At first glance, it may be tempting to say that the cases for this tabular dataset are once again individual students. But that doesn’t quite work - we see that “Jane Hopper” is listed twice! Therefore, the rows must represent something different. Here, our cases are moments of class participation. Each row is a unique instance of a student interacting with the class.

In this dataset, we unfortunately do not have an index column. We cannot even use two or three of the columns together to make unique labels - we can see that Jane Hopper asked two questions on Thursday, and so the last two rows of the dataset are identical to each other, even though they represent two different cases of a moment of participation.

When identifying cases, it can help to make a new index column, especially if no unique label exists already. For example, we might add to the above dataset:

Moment of Participation

Student

Day

Type of Participation

1

Jean Valjean

Monday

Class presentation

2

Jane Hopper

Monday

Asked question

3

Stitch

Tuesday

Handed out pencils

4

Jane Hopper

Thursday

Asked question

5

Jane Hopper

Thursday

Asked question

The new variable Moment of Participation is not really containing any meaningful information; it is simply helping us to keep track of our cases and what each one represents.

Complete Exercise 1.1.2 now.

1.3 Long versus wide format

The quote at the start of this chapter tells us,

“The true goal [of tidying] should be to establish the lifestyle you want most once your house has been put in order.”

The same idea applies when talking about datasets. There is not always one single “best” tidy format; it depends on the questions you are trying to answer, or the analyses you are trying to do.

For example, suppose a teacher wants to record grades on assignments in the class. She might choose to write down the data like this:

ID Number

Name

Quiz 1 Grade

Quiz 2 Grade

Quiz 3 Grade

00626

Stitch

95

96

51

00011

Jane Hopper

72

86

93

00006

100

80

100

24601

Jean Valjean

87

67

78

Certainly, this dataset could be considered tidy: Each row is a student, and each column is information about that student, including their grades on each quiz. This format makes a lot of sense if we want to answer questions about the progress of each individual student. We can see that Jane Hopper steadily improved over the three quizzes, while Stitch had a bit of trouble with the third one.

But what if we want to study overall class performance? Maybe we would like to know the average grade across all quizzes. From a computational standpoint, this data format makes our job harder. We would like to be able to use a command like

> Take the average of the Grade column.

Alas, we don’t have a Grade column. Our data is in wide form: the grades for each quiz are spread out across three different columns. We would instead have to ask a computer to

> Add the Quiz 1 Grade column to the Quiz 2 Grade column.

> Add this total to the Quiz 3 Grade column.

> Divide the total by 12. (4 students times 3 quizzes = 12 total grades)

That is much more annoying, isn’t it? And imagine if we had ten quizzes instead of three!

Instead, we’d like to have our data in long form, like this:

ID Number

Name

Quiz Number

Grade

00626

Stitch

1

95

00011

Jane Hopper

1

72

00006

1

100

24601

Jean Valjean

1

87

00626

Stitch

2

96

00011

Jane Hopper

2

86

00006

2

80

24601

Jean Valjean

2

67

00626

Stitch

3

51

00011

Jane Hopper

3

93

00006

3

100

24601

Jean Valjean

3

78

This long format certainly feels a bit less concise. We’ve traded out a table with 4 rows and 5 columns for one with 12 rows and 4 columns, which takes up more space. However, for some analyses we might want to do, the data is now tidier.

In this long form table, the rows now represent unique quizzes taken. We have no single label or index column, but we do have each row uniquely defined by which student is taking which quiz. In other words, the Student ID and Quiz Number columns can be combined to label each case.

Ultimately, the “tidiest” format for your data is the one that makes your analysis understandable and easy to compute.

Complete Exercise 1.1.3 now.

2 Describing variables

Once we have our data neat, tidy, and ready to go, it’s time to think about what information is contained in this dataset, and what we might be able to do with it. We would like to identify what types of variables are present in our dataset, and what typical values we could see for those variables.

2.1 Labels and Indices

We have already seen one important variable type: a label. This is a column that does not truly represent information being studied; instead, it is providing some kind of identifying information about the cases. In our class roster, the Student ID and Name columns are both labels, since they tell us which student the row represents.

When a label is a unique identifier for the cases, we can also call it an index. In our class roster, Student ID is probably an index, since we wouldn’t expect two students to have the same Student ID. However, Name is probably not an index, since it’s possible to have, for example, two different students both named “Jane Hopper”.

2.2 Quantitative Variables

When we gather information about our cases, sometimes this comes in quantitative also called numeric, form. For example, in our class roster, the Age variable is a number in years and the Classes Taken variable is a number of previous courses; and in our quiz scores dataset, the Grade column contains the points on the exam out of 100.

At first glance, it may seem easy to identify which variables are quantitative: it’s the ones with numbers! Not so fast. Remember the Student ID column in the class roster? That was a number, too.

To be a truly quantitative variable, the number needs to be measuring a quantity or amount. Some questions you can ask yourself are:

  • What units is this variable measured in? For the Age variable, the answer is “years”. For the Grade variable, it is points. For the Student ID variable, it is… not answerable, because this is not a quantitative variable.

  • Is one value “bigger” than another? A Grade of 85 is higher than a grade of 72. But a Student ID of 00006 is not “smaller” or “less” than an ID of 00011.

  • Can we add two values together? Jane Hopper and Jean Valjean have been alive for a total of 10 + 50 = 60 years. However, we would not say that their total Student ID is 24601 + 00011 = 24612.

2.3 Categorical Variables

The other way we might record information, besides numeric, is by noting which of many possible categories is relevant. These variables are called categorical or qualitative. Some coding languages also call them factor variables.

A good example of a categorical variable is the column Favorite Subject in the class roster. Two of our students have the same favorite subject, “Math”, and it is easy to imagine a new student joining the class who also has “Music” as their favorite. Each of our students is choosing a favorite from among the available categories.

The difference between a categorical variable and a label can seem a little fuzzy - recall that we said that Name was a label, even though we could imagine a new student also named “Jane Hopper”. One question to ask yourself is, Is there a limited number of possible values? We could not write down every single possible Name that any new student might have. But we could write down a list of all the school subjects offered at our school.

These possible values, or categories, for a categorical variable are often called the levels of the variable.

Note that levels of a categorical variable can be represented by numbers! For example, consider the long form of the quiz grades dataset. The variable Quiz Number appears as numbers, but it represents categories: Quiz 1, Quiz 2, or Quiz 3. Remember the “questions to ask yourself”; it’s clear that you would not, for example, add Quiz 1 + Quiz 2 to get Quiz 3; so this cannot be a true quantitative variable!

Would it be reasonable to think of the Quiz Number variable as a label, since it helps us identify the cases of the dataset? Sure, it might!

However, we might actually want to study that variable as information, such as asking “Were the average grades different across the three quizzes?” In this case, since it is real information for our study rather than simply an identifier, it probably makes more sense to call it a categorical variable.

2.4 Other types of variables

So far, there is one variable in our original class roster dataset that we have not yet described: Notes. This contains free text, written by the teacher as reminders about her students. We wouldn’t really say this is categorical, since there is not a limited set of levels that could be used. Certainly it isn’t quantitative! And we couldn’t call it a label or an index either, since it doesn’t necessary uniquely identify each student; student 00006 has no Notes at all!

In the vast modern world of Data Science, you will encounter all sorts of ways to store information, that go beyond what we normally see in tabular data. Perhaps this is something simple, like a Notes column containing free-form text, or column giving a date or a time of an event. Perhaps this is something complex, like an image or a sound file.

When this happens, don’t be afraid to simply describe a column as “Other”.

For the purposes of this textbook, we will focus on how to analyze variables that can reasonably be called quantitative or categorical; as you become more advanced in your data analysis skills, you will learn how to study these “other” types as well.

3 Converting variables

3.1 Combining categories

Categorical variables that have a lot of different categories can be difficult to analyze. Sometimes, we may want to combine categories, so we have fewer to deal with. In particular, it might be simplest to turn our variable into a binary variable, i.e., one that has only two possible categories.

Suppose the teacher creating the datasets in this chapter is teaching a math class. She might then be interested in whether the student loves math or not, and the other favorite subjects are irrelevant. Then, she might want to make a binary variable in her dataset called Loves Math:

ID.Number

Name

Favorite Subject

Loves Math

00626

Stitch

Math

Yes

00011

Jane Hopper

Reading

No

00006

Math

Yes

24601

Jean Valjean

Music

No

This new binary variable makes it easy for us to see which students will probably enjoy the math class - two of them love math, and two of them have a different favorite subject.

3.2 Dummy variables

In your Data Science journey, you will likely eventually encounter a situation where it is convenient, or even necessary computationally, to have all your variables be quantitative. But of course, we don’t want to simply throw out the information in our categorical variables! Instead, we will convert them to “sort of” numeric variables, called dummy variables. For our class roster, that might look something like this:

ID Number

Name

Favorite is Math

Favorite is Reading

Favorite is Music

00626

Stitch

1

0

0

00011

Jane Hopper

0

1

0

00006

1

0

0

24601

Jean Valjean

0

0

1

Notice that the levels of the original categorical variable Favorite Subject have now each become a separate variable! The values of each of these dummy variables can only be either 1 (if that subject is the student’s favorite) or 0 (if it is not).

We sometimes call this process one-hot encoding because only one of the Favorite is [subject] columns is activated, or “hot”, for each of the students.

3.3 Discrete variables

Just as it is sometimes convenient to turn a categorical variable into quantitative variables, it may be of interest to turn something measured numerically into categories.

For some quantitative variables, it is already true that only a few values are possible, much like a categorical variable. For example, our Classes Taken variable counts the number of previous classes that student has taken. It is impossible (presumably) for a student to have taken 1.5 or 3.63 classes, so this variable can only have values that are whole numbers, also known as integers. A quantitative variable that can only use integers is called discrete. Depending on the goals of your analysis, it might be reasonable to think of a discrete variable as actually being categorical. Perhaps we could regard our Classes Taken variable as having the categories/levels of “0 classes”, “1 class”, “2 classes”, etc.

Do you find yourself wondering why we might ever want to do this switch from quantitative to categorical? Chapters 1.2 and 1.4 will show you some examples!

3.4 Binning

For quantitative variables that are not discrete, which we call continuous, we can’t just pretend the numbers are labels. For example, if we look at the grades on the first exam in this class, we wouldn’t want to have 100 categories called “100 points”, “99.5 points”, etc. That would be way too many categories to study! Instead, we need to do some binning where we combine ranges of values into categories. A natural binning of the Exam 1 Grade variable might be the traditional letter grade ranges, where 90-100 is an “A”, 80-90 is a “B”, etc.

ID Number

Name

Quiz 1 Grade

Quiz 1 Letter Grade

00626

Stitch

95

A

00011

Jane Hopper

72

B

00006

100

B

24601

Jean Valjean

87

D

3.5 Ordinal variables

The new categorical variable that we derived from Quiz 1 Grade has a very unique quality: it has a natural order. Although the values of this variable are not categories, not numbers, we still have a clear sequence for the levels: A, B, C, D, F. When a categorical variable has a “correct” order, we call it ordinal.

Complete Exercise 1.1.5 now.

3.6 Dates and times

One complicated variable variety you might see is related to date or time.

For example, imagine our student Jean Valjean took a particular class many years in a row. We might want to keep track of his progress:

ID Number

Name

Year

Grade

24601

Jean Valjean

1815

66

24601

Jean Valjean

1816

73

24601

Jean Valjean

1817

84

24601

Jean Valjean

1818

71

24601

Jean Valjean

1819

89

Consider the Year variable. We might reasonably treat this variable as:

Quantitative: In some sense, time is an “amount” of something, as it “counts” the number of days/months/years since the first one. We might reasonably say that year is a quantitative variable, measuring the time since Year 0. We could also consider this to be continuous, since it’s technically possible to have a year “1815.5”, i.e. 1,815 and a half years since year 0. However, in some contexts we might treat it as discrete, in the sense that we only measure once per whole year, and so we will never record numbers like “1815.5”.

Categorical: Often, we are trying to compare data over different periods of time. Perhaps we are asking questions like “Did Jean Valjean get the best grades in 1815, 1816, or 1817?” In this situation, we are studying three categories: one for each of the three years. Of course

Other: In many applications, we are simply keeping track of the date as a way to label our data instances. Or, perhaps we want to do something very complex that studies behavior over time. If we don’t have a clear reason to treat our date or time variable as quantitative or categorical for a particular analysis, it is reasonable to just call it “Other” or “Datetime” type.

Complete Exercise 1.1.4 now.

4 Missing data

It’s now time for us to tackle a glaring problem in our class roster - something you will encounter in nearly every real world dataset you analyze: missing data.

There are three instances of incomplete information in the class roster, each exemplifying a different type of missing data.

4.1 “Not Available” data

Notice that the name of student number 00006 is listed as NA. While it’s possible that her name genuinely is spelled “NA”, it’s more likely that this is a placeholder in the dataset, that stands for Not Available or Not Applicable.

There are all kinds of reasons that information may be replaced with NA in a cell in a data table. Perhaps the teacher creating the class roster forgot to ask Student 00006 her name. Perhaps there was an error in typing the data into a computer spreadsheet, and that cell was accidentally left blank. Perhaps Student 00006 does not actually have a given name, and prefers to be known only as “Six”.

Regardless of the reason, we are left with a data entry that is explicitly marked as missing. Unless we have a way to go back and talk to the original creator of the dataset - which is rarely feasible in real data science settings - we will never know why the information was not recorded. We only know that it is an inconvenience to deal with.

While NAis the most common way to denote unavailable data in a tidy data table, you will sometimes see other placeholders. One common version of this is to use a nonsensical number in place of missing information; for example, you might see a person’s age recorded as -1 or an exam score written as -99.

This approach is not recommended, since it can be easy to overlook these numbers and accidentally use them in calculations. However, it’s important to keep an eye out in your data for values that stand out or seem illogical; these might be coding for missing information.

4.2 Unknowable information

While an NA marks missing data, it doesn’t help us understand why the data is missing. Sometimes, missing data is recorded more specifically, in a way that includes the reasons. For example, in our class roster, the age of the student Stitch is listed as “unknown”. This tells us that the missing information - Stitch’s age - was not accidentally deleted or forgotten. In fact, nobody seems to know his true age, and so it was impossible to collect that information.

In this case, we know that the data is missing “on purpose”. It’s not just a gap in our dataset; it’s actually an interesting piece of information: in the same what that we know more about Jane Hopper when we know she is 10 years old, we also learn more about Stitch when we hear that his age is a mystery.

Even though more explanation feels better than an NA, we run into a problem when we think about variable types. Would you say that Age is a quantitative or categorical variable? Of course, it’s a countable number, so it’s quantitative! But the value unknown is a word, not a number. It doesn’t fit the variable type!

This can be a major inconvenience when it comes to reading data into a computer with code. A computer might see the value unknown, and thus wrongly assume that the entire Age variable was intended to be non-numeric!

4.3 Blank cells

Lastly, we may have situations where there is simply nothing written in a data cell. We see this in the class roster, where Student “Six” has no Notes.

Is this truly missing information? Or is it simply that the teacher did not have any additional notes to recorded about Six? There’s simply no way to tell from the dataset alone.

Typically, blank information will be interpreted by computers and data analysts as missing, and will often be converted to NA. It’s always important, though, to keep context in mind. A blank cell in the Name column would probably be missing data - but a blank cell in the Notes column may very well be a legitimate observation of “No notes needed”.

5 Summary: communicating with data collaborators

Phew! That’s a lot of variable descriptions and a lot of vocabulary. Let’s bring it all together now.

Suppose you are the teacher, and you have created some tidy datasets for your course. You would like to describe these to a collaborator. Your documentation might say something like this:

5.1 Class roster

Student ID

Name

Age

Classes Taken

Favorite Subject

Notes

00626

Stitch

unknown

3

Math

does not play well with others

00011

Jane Hopper

10

0

Reading

prefers to go by “El”

00006

26

5

Math

24601

Jean Valjean

50

5

Music

only speaks French

Fictitious dataset based on characters from Lilo & Stitch, Stranger Things, Battlestar Galactica, and Les Miserables.

My first dataset is a class roster. There are four rows (cases) and five columns (variables). Each row represents a student in the class. Our index column is called Student ID, and it contains the unique Student ID of the student. We also have a label for the student’s name. There is one observation missing in this column, indicated by an “NA” value.

The variables in the dataset are:

  • Age: The age of the student, in years. This continuous, quantitative variable ranges from 10 to 50 in our data. There is one missing observation, indicated by an “unknown” value.

  • Classes Taken: The number of previous courses the student has taken at our school. This discrete, quantitative variable ranges from 0 to 5 in our data.

  • Favorite Subject: The student’s favorite subject. This categorical variable has levels “Math”, “Music” and “Reading”.

  • Is Child: A binary variable derived from binning the Age variable, indicating whether the student is below 18 or not.

  • Notes: This column contains free text with notes about each student, where needed. Some notes are blank.

5.2 Quiz grades dataset

ID Number

Name

Quiz Number

Grade

00626

Stitch

1

95

00011

Jane Hopper

1

72

00006

1

100

24601

Jean Valjean

1

87

00626

Stitch

2

96

00011

Jane Hopper

2

86

00006

2

80

24601

Jean Valjean

2

67

00626

Stitch

3

51

00011

Jane Hopper

3

93

00006

3

100

24601

Jean Valjean

3

78

My second dataset shows quiz scores for each student. Each row represents an instance of a student taking a particular quiz. This data is in long form. The Student ID and Name columns are the same as those in the class roster dataset. The cases can be uniquely identified by the Student ID column and the Quiz Number column.

The two variables are:

  • Quiz Number: An ordinal categorical variable with possible levels “1”, “2”, and “3”, indicating which quiz was taken.
  • Grade: A continuous quantitative variable indicating the student’s score on the quiz.

5.3 Participation Dataset

Index

Student

Day

Asked Question

Presentation

Handout

1

Jean Valjean

Monday

0

1

0

2

Jane Hopper

Monday

1

0

1

3

Stitch

Tuesday

0

0

0

4

Jane Hopper

Thursday

1

0

0

5

Jane Hopper

Thursday

1

0

0

My third dataset shows class participation of the students. Each row represents an instance of a student participating in class. This data is in wide form. The Student ID and Name columns are the same as those in the class roster dataset. The Index column is an arbitrary index label.

The type of participation observed has been one-hot encoded, so the variables in this dataset are:

  • Day: The day of the week that the participation moment happened.
  • Asked Question: A dummy variable telling whether the participation type was question asking.
  • Presentation: A dummy variable telling whether the participation type was a class presentation.
  • Handout: A dummy variable telling whether the participation type was passing out supplies.