1 Introduction to tidy tabular data

Preparing and describing datasets

“From the moment you start tidying, you will be compelled to reset your life. As a result, your life will start to change… Tidying is just a tool, not the final destination. The true goal should be to establish the lifestyle you want most once your house has been put in order.”

― Marie Kondō, The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing

Most data you will work with is stored - perhaps after some processing! - in what we call tabular format. This simply means that we can write it down in a grid with rows and columns.

For example, a teacher wishing to keep track of her students might create a data table as follows:

Student ID	Name	Age	Classes Taken	Favorite Subject	Notes
00626	Stitch	unknown	3	Math	does not play well with others
00011	Jane Hopper	10	0	Reading	prefers to go by “El”
00006		26	5	Math
24601	Jean Valjean	50	5	Music	only speaks French
Fictitious dataset based on characters from Lilo & Stitch, Stranger Things, Battlestar Galactica, and Les Miserables.

In this class roster, each row represents a single student, each column represents some type of information collected about the students, and each section in the grid is filled out with that information for a specific student.

To use formal vocabulary, we would say that each column is a variable, i.e., the name for a piece of information we are collecting, like “Age” or “Name”. Each row is a case or observational unit; these are the things that we collect information about. The cells of the table contain the values for a particular variable and case. When a dataset is recorded in this way, we call that tidy data.

You might have noticed that some information in this table is missing. We will talk about that problem later in this chapter!

1.1 Un-tidy Data

Out in the wild world of real data, you may encounter tabular datasets that are recorded in an “un-tidy” way. For example, sometimes the table is “flipped” or “transposed”, meaning that the variables are written on the rows, and each column is a case:

ID	00626	00011	00006	24601
Name	Stitch	Jane Hopper		Jean Valjean
Age	unknown	10	26	50
Classes Taken	3	0	5	5
Favorite Subject	Math	Reading	Math	Music
Notes	does not play well with others	prefers to go by “El”		only speaks French

This might seem easier to read to some, but when it comes to doing Data Science using computational tools, a flipped table won’t be understood correctly by most software.

Sometimes, you might see multiple variables recorded together, or column labels split across two rows:

Student ID	Name, Age	Subject		Notes
X.classes.taken.	X__	Favorite	Least.Favorite	X_
00626 (3)	Stitch	Math	Writing	does not play well with others
00011 (0)	Jane Hopper, Age 10	Reading	Public Speaking	prefers to go by “El”
00006 (5)	Age 26	Math	Ethics
24601 (5)	Jean Valjean, Age 50	Music	Law	only speaks French

This example presents us with some challenges: How do we separate the Name, Age column into two separate ones? How do we fix the two Subjects columns, so that a computer reading this dataset does not think the first row represents a student? What do we do about Classes Taken being included in parentheses in the Student ID column?

When you are presented with data for analysis, your first step will likely be to tidy it up into a format that is consistent and understandable by machines.

Complete Exercise 1.1.1 now.

1.2 Identifying cases

Of course, in order to tidy up a dataset, you need to have a good understanding of what each row should represent. Sometimes, this is straightforward, as in the student roster dataset - each row of our dataset is a different individual student. We even have a column containing unique labels (also known as an index column) for each case: the Student ID column.

Why do you think the Student ID column might be a better unique label than the Name column?

Other times, though, it is not so clear what our cases are. Suppose, for example, the teacher of our strange class listed above is also recording student participation in class that week.

Student	Day	Type of Participation
Jean Valjean	Monday	Class presentation
Jane Hopper	Monday	Asked question
Stitch	Tuesday	Handed out pencils
Jane Hopper	Thursday	Asked question
Jane Hopper	Thursday	Asked question

At first glance, it may be tempting to say that the cases for this tabular dataset are once again individual students. But that doesn’t quite work - we see that “Jane Hopper” is listed twice! Therefore, the rows must represent something different. Here, our cases are moments of class participation. Each row is a unique instance of a student interacting with the class.

In this dataset, we unfortunately do not have an index column. We cannot even use two or three of the columns together to make unique labels - we can see that Jane Hopper asked two questions on Thursday, and so the last two rows of the dataset are identical to each other, even though they represent two different cases of a moment of participation.

When identifying cases, it can help to make a new index column, especially if no unique label exists already. For example, we might add to the above dataset:

Moment of Participation	Student	Day	Type of Participation
1	Jean Valjean	Monday	Class presentation
2	Jane Hopper	Monday	Asked question
3	Stitch	Tuesday	Handed out pencils
4	Jane Hopper	Thursday	Asked question
5	Jane Hopper	Thursday	Asked question

The new variable Moment of Participation is not really containing any meaningful information; it is simply helping us to keep track of our cases and what each one represents.

Complete Exercise 1.1.2 now.

1.3 Long versus wide format

The quote at the start of this chapter tells us,

“The true goal [of tidying] should be to establish the lifestyle you want most once your house has been put in order.”

The same idea applies when talking about datasets. There is not always one single “best” tidy format; it depends on the questions you are trying to answer, or the analyses you are trying to do.

For example, suppose a teacher wants to record grades on assignments in the class. She might choose to write down the data like this:

ID Number	Name	Quiz 1 Grade	Quiz 2 Grade	Quiz 3 Grade
00626	Stitch	95	96	51
00011	Jane Hopper	72	86	93
00006		100	80	100
24601	Jean Valjean	87	67	78

Certainly, this dataset could be considered tidy: Each row is a student, and each column is information about that student, including their grades on each quiz. This format makes a lot of sense if we want to answer questions about the progress of each individual student. We can see that Jane Hopper steadily improved over the three quizzes, while Stitch had a bit of trouble with the third one.

But what if we want to study overall class performance? Maybe we would like to know the average grade across all quizzes. From a computational standpoint, this data format makes our job harder. We would like to be able to use a command like

> Take the average of the Grade column.

Alas, we don’t have a Grade column. Our data is in wide form: the grades for each quiz are spread out across three different columns. We would instead have to ask a computer to

> Add the Quiz 1 Grade column to the Quiz 2 Grade column.

> Add this total to the Quiz 3 Grade column.

> Divide the total by 12. (4 students times 3 quizzes = 12 total grades)

That is much more annoying, isn’t it? And imagine if we had ten quizzes instead of three!

Instead, we’d like to have our data in long form, like this:

ID Number	Name	Quiz Number	Grade
00626	Stitch	1	95
00011	Jane Hopper	1	72
00006		1	100
24601	Jean Valjean	1	87
00626	Stitch	2	96
00011	Jane Hopper	2	86
00006		2	80
24601	Jean Valjean	2	67
00626	Stitch	3	51
00011	Jane Hopper	3	93
00006		3	100
24601	Jean Valjean	3	78

This long format certainly feels a bit less concise. We’ve traded out a table with 4 rows and 5 columns for one with 12 rows and 4 columns, which takes up more space. However, for some analyses we might want to do, the data is now tidier.

In this long form table, the rows now represent unique quizzes taken. We have no single label or index column, but we do have each row uniquely defined by which student is taking which quiz. In other words, the Student ID and Quiz Number columns can be combined to label each case.

Ultimately, the “tidiest” format for your data is the one that makes your analysis understandable and easy to compute.

Complete Exercise 1.1.3 now.

2 Describing variables

Once we have our data neat, tidy, and ready to go, it’s time to think about what information is contained in this dataset, and what we might be able to do with it. We would like to identify what types of variables are present in our dataset, and what typical values we could see for those variables.

2.1 Labels and Indices

We have already seen one important variable type: a label. This is a column that does not truly represent information being studied; instead, it is providing some kind of identifying information about the cases. In our class roster, the Student ID and Name columns are both labels, since they tell us which student the row represents.

When a label is a unique identifier for the cases, we can also call it an index. In our class roster, Student ID is probably an index, since we wouldn’t expect two students to have the same Student ID. However, Name is probably not an index, since it’s possible to have, for example, two different students both named “Jane Hopper”.

2.2 Quantitative Variables

When we gather information about our cases, sometimes this comes in quantitative also called numeric, form. For example, in our class roster, the Age variable is a number in years and the Classes Taken variable is a number of previous courses; and in our quiz scores dataset, the Grade column contains the points on the exam out of 100.

At first glance, it may seem easy to identify which variables are quantitative: it’s the ones with numbers! Not so fast. Remember the Student ID column in the class roster? That was a number, too.

To be a truly quantitative variable, the number needs to be measuring a quantity or amount. Some questions you can ask yourself are:

What units is this variable measured in? For the Age variable, the answer is “years”. For the Grade variable, it is points. For the Student ID variable, it is… not answerable, because this is not a quantitative variable.
Is one value “bigger” than another? A Grade of 85 is higher than a grade of 72. But a Student ID of 00006 is not “smaller” or “less” than an ID of 00011.
Can we add two values together? Jane Hopper and Jean Valjean have been alive for a total of 10 + 50 = 60 years. However, we would not say that their total Student ID is 24601 + 00011 = 24612.

2.3 Categorical Variables

The other way we might record information, besides numeric, is by noting which of many possible categories is relevant. These variables are called categorical or qualitative. Some coding languages also call them factor variables.

A good example of a categorical variable is the column Favorite Subject in the class roster. Two of our students have the same favorite subject, “Math”, and it is easy to imagine a new student joining the class who also has “Music” as their favorite. Each of our students is choosing a favorite from among the available categories.

The difference between a categorical variable and a label can seem a little fuzzy - recall that we said that Name was a label, even though we could imagine a new student also named “Jane Hopper”. One question to ask yourself is, Is there a limited number of possible values? We could not write down every single possible Name that any new student might have. But we could write down a list of all the school subjects offered at our school.

These possible values, or categories, for a categorical variable are often called the levels of the variable.

Note that levels of a categorical variable can be represented by numbers! For example, consider the long form of the quiz grades dataset. The variable Quiz Number appears as numbers, but it represents categories: Quiz 1, Quiz 2, or Quiz 3. Remember the “questions to ask yourself”; it’s clear that you would not, for example, add Quiz 1 + Quiz 2 to get Quiz 3; so this cannot be a true quantitative variable!

Would it be reasonable to think of the Quiz Number variable as a label, since it helps us identify the cases of the dataset? Sure, it might!

However, we might actually want to study that variable as information, such as asking “Were the average grades different across the three quizzes?” In this case, since it is real information for our study rather than simply an identifier, it probably makes more sense to call it a categorical variable.

2.4 Other types of variables

So far, there is one variable in our original class roster dataset that we have not yet described: Notes. This contains free text, written by the teacher as reminders about her students. We wouldn’t really say this is categorical, since there is not a limited set of levels that could be used. Certainly it isn’t quantitative! And we couldn’t call it a label or an index either, since it doesn’t necessary uniquely identify each student; student 00006 has no Notes at all!

In the vast modern world of Data Science, you will encounter all sorts of ways to store information, that go beyond what we normally see in tabular data. Perhaps this is something simple, like a Notes column containing free-form text, or column giving a date or a time of an event. Perhaps this is something complex, like an image or a sound file.

When this happens, don’t be afraid to simply describe a column as “Other”.

For the purposes of this textbook, we will focus on how to analyze variables that can reasonably be called quantitative or categorical; as you become more advanced in your data analysis skills, you will learn how to study these “other” types as well.

3 Converting variables

3.1 Combining categories

Categorical variables that have a lot of different categories can be difficult to analyze. Sometimes, we may want to combine categories, so we have fewer to deal with. In particular, it might be simplest to turn our variable into a binary variable, i.e., one that has only two possible categories.

Suppose the teacher creating the datasets in this chapter is teaching a math class. She might then be interested in whether the student loves math or not, and the other favorite subjects are irrelevant. Then, she might want to make a binary variable in her dataset called Loves Math:

ID.Number	Name	Favorite Subject	Loves Math
00626	Stitch	Math	Yes
00011	Jane Hopper	Reading	No
00006		Math	Yes
24601	Jean Valjean	Music	No

This new binary variable makes it easy for us to see which students will probably enjoy the math class - two of them love math, and two of them have a different favorite subject.

3.2 Dummy variables

In your Data Science journey, you will likely eventually encounter a situation where it is convenient, or even necessary computationally, to have all your variables be quantitative. But of course, we don’t want to simply throw out the information in our categorical variables! Instead, we will convert them to “sort of” numeric variables, called dummy variables. For our class roster, that might look something like this:

ID Number	Name	Favorite is Math	Favorite is Reading	Favorite is Music
00626	Stitch	1	0	0
00011	Jane Hopper	0	1	0
00006		1	0	0
24601	Jean Valjean	0	0	1

Notice that the levels of the original categorical variable Favorite Subject have now each become a separate variable! The values of each of these dummy variables can only be either 1 (if that subject is the student’s favorite) or 0 (if it is not).

We sometimes call this process one-hot encoding because only one of the Favorite is [subject] columns is activated, or “hot”, for each of the students.

3.3 Discrete variables

Just as it is sometimes convenient to turn a categorical variable into quantitative variables, it may be of interest to turn something measured numerically into categories.

For some quantitative variables, it is already true that only a few values are possible, much like a categorical variable. For example, our Classes Taken variable counts the number of previous classes that student has taken. It is impossible (presumably) for a student to have taken 1.5 or 3.63 classes, so this variable can only have values that are whole numbers, also known as integers. A quantitative variable that can only use integers is called discrete. Depending on the goals of your analysis, it might be reasonable to think of a discrete variable as actually being categorical. Perhaps we could regard our Classes Taken variable as having the categories/levels of “0 classes”, “1 class”, “2 classes”, etc.

Do you find yourself wondering why we might ever want to do this switch from quantitative to categorical? Chapters 1.2 and 1.4 will show you some examples!

3.4 Binning

For quantitative variables that are not discrete, which we call continuous, we can’t just pretend the numbers are labels. For example, if we look at the grades on the first exam in this class, we wouldn’t want to have 100 categories called “100 points”, “99.5 points”, etc. That would be way too many categories to study! Instead, we need to do some binning where we combine ranges of values into categories. A natural binning of the Exam 1 Grade variable might be the traditional letter grade ranges, where 90-100 is an “A”, 80-90 is a “B”, etc.

ID Number	Name	Quiz 1 Grade	Quiz 1 Letter Grade
00626	Stitch	95	A
00011	Jane Hopper	72	B
00006		100	B
24601	Jean Valjean	87	D

3.5 Ordinal variables

The new categorical variable that we derived from Quiz 1 Grade has a very unique quality: it has a natural order. Although the values of this variable are not categories, not numbers, we still have a clear sequence for the levels: A, B, C, D, F. When a categorical variable has a “correct” order, we call it ordinal.

Complete Exercise 1.1.5 now.

3.6 Dates and times

One complicated variable variety you might see is related to date or time.

For example, imagine our student Jean Valjean took a particular class many years in a row. We might want to keep track of his progress:

ID Number	Name	Year	Grade
24601	Jean Valjean	1815	66
24601	Jean Valjean	1816	73
24601	Jean Valjean	1817	84
24601	Jean Valjean	1818	71
24601	Jean Valjean	1819	89

Consider the Year variable. We might reasonably treat this variable as:

Quantitative: In some sense, time is an “amount” of something, as it “counts” the number of days/months/years since the first one. We might reasonably say that year is a quantitative variable, measuring the time since Year 0. We could also consider this to be continuous, since it’s technically possible to have a year “1815.5”, i.e. 1,815 and a half years since year 0. However, in some contexts we might treat it as discrete, in the sense that we only measure once per whole year, and so we will never record numbers like “1815.5”.

Categorical: Often, we are trying to compare data over different periods of time. Perhaps we are asking questions like “Did Jean Valjean get the best grades in 1815, 1816, or 1817?” In this situation, we are studying three categories: one for each of the three years. Of course

Other: In many applications, we are simply keeping track of the date as a way to label our data instances. Or, perhaps we want to do something very complex that studies behavior over time. If we don’t have a clear reason to treat our date or time variable as quantitative or categorical for a particular analysis, it is reasonable to just call it “Other” or “Datetime” type.

Complete Exercise 1.1.4 now.

4 Missing data

It’s now time for us to tackle a glaring problem in our class roster - something you will encounter in nearly every real world dataset you analyze: missing data.

There are three instances of incomplete information in the class roster, each exemplifying a different type of missing data.

4.1 “Not Available” data

Notice that the name of student number 00006 is listed as NA. While it’s possible that her name genuinely is spelled “NA”, it’s more likely that this is a placeholder in the dataset, that stands for Not Available or Not Applicable.

There are all kinds of reasons that information may be replaced with NA in a cell in a data table. Perhaps the teacher creating the class roster forgot to ask Student 00006 her name. Perhaps there was an error in typing the data into a computer spreadsheet, and that cell was accidentally left blank. Perhaps Student 00006 does not actually have a given name, and prefers to be known only as “Six”.

Regardless of the reason, we are left with a data entry that is explicitly marked as missing. Unless we have a way to go back and talk to the original creator of the dataset - which is rarely feasible in real data science settings - we will never know why the information was not recorded. We only know that it is an inconvenience to deal with.

While NAis the most common way to denote unavailable data in a tidy data table, you will sometimes see other placeholders. One common version of this is to use a nonsensical number in place of missing information; for example, you might see a person’s age recorded as -1 or an exam score written as -99.

This approach is not recommended, since it can be easy to overlook these numbers and accidentally use them in calculations. However, it’s important to keep an eye out in your data for values that stand out or seem illogical; these might be coding for missing information.

4.2 Unknowable information

While an NA marks missing data, it doesn’t help us understand why the data is missing. Sometimes, missing data is recorded more specifically, in a way that includes the reasons. For example, in our class roster, the age of the student Stitch is listed as “unknown”. This tells us that the missing information - Stitch’s age - was not accidentally deleted or forgotten. In fact, nobody seems to know his true age, and so it was impossible to collect that information.

In this case, we know that the data is missing “on purpose”. It’s not just a gap in our dataset; it’s actually an interesting piece of information: in the same what that we know more about Jane Hopper when we know she is 10 years old, we also learn more about Stitch when we hear that his age is a mystery.

Even though more explanation feels better than an NA, we run into a problem when we think about variable types. Would you say that Age is a quantitative or categorical variable? Of course, it’s a countable number, so it’s quantitative! But the value unknown is a word, not a number. It doesn’t fit the variable type!

This can be a major inconvenience when it comes to reading data into a computer with code. A computer might see the value unknown, and thus wrongly assume that the entire Age variable was intended to be non-numeric!

4.3 Blank cells

Lastly, we may have situations where there is simply nothing written in a data cell. We see this in the class roster, where Student “Six” has no Notes.

Is this truly missing information? Or is it simply that the teacher did not have any additional notes to recorded about Six? There’s simply no way to tell from the dataset alone.

Typically, blank information will be interpreted by computers and data analysts as missing, and will often be converted to NA. It’s always important, though, to keep context in mind. A blank cell in the Name column would probably be missing data - but a blank cell in the Notes column may very well be a legitimate observation of “No notes needed”.

5 Summary: communicating with data collaborators

Phew! That’s a lot of variable descriptions and a lot of vocabulary. Let’s bring it all together now.

Suppose you are the teacher, and you have created some tidy datasets for your course. You would like to describe these to a collaborator. Your documentation might say something like this:

5.1 Class roster

Student ID	Name	Age	Classes Taken	Favorite Subject	Notes
00626	Stitch	unknown	3	Math	does not play well with others
00011	Jane Hopper	10	0	Reading	prefers to go by “El”
00006		26	5	Math
24601	Jean Valjean	50	5	Music	only speaks French
Fictitious dataset based on characters from Lilo & Stitch, Stranger Things, Battlestar Galactica, and Les Miserables.

My first dataset is a class roster. There are four rows (cases) and five columns (variables). Each row represents a student in the class. Our index column is called Student ID, and it contains the unique Student ID of the student. We also have a label for the student’s name. There is one observation missing in this column, indicated by an “NA” value.

The variables in the dataset are:

Age: The age of the student, in years. This continuous, quantitative variable ranges from 10 to 50 in our data. There is one missing observation, indicated by an “unknown” value.
Classes Taken: The number of previous courses the student has taken at our school. This discrete, quantitative variable ranges from 0 to 5 in our data.
Favorite Subject: The student’s favorite subject. This categorical variable has levels “Math”, “Music” and “Reading”.
Is Child: A binary variable derived from binning the Age variable, indicating whether the student is below 18 or not.
Notes: This column contains free text with notes about each student, where needed. Some notes are blank.

5.2 Quiz grades dataset

ID Number	Name	Quiz Number	Grade
00626	Stitch	1	95
00011	Jane Hopper	1	72
00006		1	100
24601	Jean Valjean	1	87
00626	Stitch	2	96
00011	Jane Hopper	2	86
00006		2	80
24601	Jean Valjean	2	67
00626	Stitch	3	51
00011	Jane Hopper	3	93
00006		3	100
24601	Jean Valjean	3	78

My second dataset shows quiz scores for each student. Each row represents an instance of a student taking a particular quiz. This data is in long form. The Student ID and Name columns are the same as those in the class roster dataset. The cases can be uniquely identified by the Student ID column and the Quiz Number column.

The two variables are:

Quiz Number: An ordinal categorical variable with possible levels “1”, “2”, and “3”, indicating which quiz was taken.
Grade: A continuous quantitative variable indicating the student’s score on the quiz.

5.3 Participation Dataset

Index	Student	Day	Asked Question	Presentation	Handout
1	Jean Valjean	Monday	0	1	0
2	Jane Hopper	Monday	1	0	1
3	Stitch	Tuesday	0	0	0
4	Jane Hopper	Thursday	1	0	0
5	Jane Hopper	Thursday	1	0	0

My third dataset shows class participation of the students. Each row represents an instance of a student participating in class. This data is in wide form. The Student ID and Name columns are the same as those in the class roster dataset. The Index column is an arbitrary index label.

The type of participation observed has been one-hot encoded, so the variables in this dataset are:

Day: The day of the week that the participation moment happened.
Asked Question: A dummy variable telling whether the participation type was question asking.
Presentation: A dummy variable telling whether the participation type was a class presentation.
Handout: A dummy variable telling whether the participation type was passing out supplies.