Student ID | Name | Age | Classes Taken | Favorite Subject | Notes |
---|---|---|---|---|---|
00626 | Stitch | unknown | 3 | Math | does not play well with others |
00011 | Jane Hopper | 10 | 0 | Reading | prefers to go by “El” |
00006 | 26 | 5 | Math | ||
24601 | Jean Valjean | 50 | 5 | Music | only speaks French |
Fictitious dataset based on characters from Lilo & Stitch, Stranger Things, Battlestar Galactica, and Les Miserables. |
1 Introduction to tidy tabular data
Preparing and describing datasets
“From the moment you start tidying, you will be compelled to reset your life. As a result, your life will start to change… Tidying is just a tool, not the final destination. The true goal should be to establish the lifestyle you want most once your house has been put in order.”
― Marie Kondō, The Life-Changing Magic of Tidying Up: The Japanese Art of Decluttering and Organizing
Most data you will work with is stored - perhaps after some processing! - in what we call tabular format. This simply means that we can write it down in a grid with rows and columns.
For example, a teacher wishing to keep track of her students might create a data table as follows:
In this class roster, each row represents a single student, each column represents some type of information collected about the students, and each section in the grid is filled out with that information for a specific student.
To use formal vocabulary, we would say that each column is a variable, i.e., the name for a piece of information we are collecting, like “Age” or “Name”. Each row is a case or observational unit; these are the things that we collect information about. The cells of the table contain the values for a particular variable and case. When a dataset is recorded in this way, we call that tidy data.
1.1 Un-tidy Data
Out in the wild world of real data, you may encounter tabular datasets that are recorded in an “un-tidy” way. For example, sometimes the table is “flipped” or “transposed”, meaning that the variables are written on the rows, and each column is a case:
ID | 00626 | 00011 | 00006 | 24601 |
---|---|---|---|---|
Name | Stitch | Jane Hopper | Jean Valjean | |
Age | unknown | 10 | 26 | 50 |
Classes Taken | 3 | 0 | 5 | 5 |
Favorite Subject | Math | Reading | Math | Music |
Notes | does not play well with others | prefers to go by “El” | only speaks French |
This might seem easier to read to some, but when it comes to doing Data Science using computational tools, a flipped table won’t be understood correctly by most software.
Sometimes, you might see multiple variables recorded together, or column labels split across two rows:
Student ID | Name, Age | Subject | Notes | |
---|---|---|---|---|
X.classes.taken. | X__ | Favorite | Least.Favorite | X_ |
00626 (3) | Stitch | Math | Writing | does not play well with others |
00011 (0) | Jane Hopper, Age 10 | Reading | Public Speaking | prefers to go by “El” |
00006 (5) | Age 26 | Math | Ethics | |
24601 (5) | Jean Valjean, Age 50 | Music | Law | only speaks French |
This example presents us with some challenges: How do we separate the Name, Age
column into two separate ones? How do we fix the two Subjects
columns, so that a computer reading this dataset does not think the first row represents a student? What do we do about Classes Taken
being included in parentheses in the Student ID
column?
When you are presented with data for analysis, your first step will likely be to tidy it up into a format that is consistent and understandable by machines.
1.2 Identifying cases
Of course, in order to tidy up a dataset, you need to have a good understanding of what each row should represent. Sometimes, this is straightforward, as in the student roster dataset - each row of our dataset is a different individual student. We even have a column containing unique labels (also known as an index column) for each case: the Student ID
column.
Other times, though, it is not so clear what our cases are. Suppose, for example, the teacher of our strange class listed above is also recording student participation in class that week.
Student | Day | Type of Participation |
---|---|---|
Jean Valjean | Monday | Class presentation |
Jane Hopper | Monday | Asked question |
Stitch | Tuesday | Handed out pencils |
Jane Hopper | Thursday | Asked question |
Jane Hopper | Thursday | Asked question |
At first glance, it may be tempting to say that the cases for this tabular dataset are once again individual students. But that doesn’t quite work - we see that “Jane Hopper” is listed twice! Therefore, the rows must represent something different. Here, our cases are moments of class participation. Each row is a unique instance of a student interacting with the class.
In this dataset, we unfortunately do not have an index column. We cannot even use two or three of the columns together to make unique labels - we can see that Jane Hopper asked two questions on Thursday, and so the last two rows of the dataset are identical to each other, even though they represent two different cases of a moment of participation.
When identifying cases, it can help to make a new index column, especially if no unique label exists already. For example, we might add to the above dataset:
Moment of Participation | Student | Day | Type of Participation |
---|---|---|---|
1 | Jean Valjean | Monday | Class presentation |
2 | Jane Hopper | Monday | Asked question |
3 | Stitch | Tuesday | Handed out pencils |
4 | Jane Hopper | Thursday | Asked question |
5 | Jane Hopper | Thursday | Asked question |
The new variable Moment of Participation
is not really containing any meaningful information; it is simply helping us to keep track of our cases and what each one represents.
1.3 Long versus wide format
The quote at the start of this chapter tells us,
“The true goal [of tidying] should be to establish the lifestyle you want most once your house has been put in order.”
The same idea applies when talking about datasets. There is not always one single “best” tidy format; it depends on the questions you are trying to answer, or the analyses you are trying to do.
For example, suppose a teacher wants to record grades on assignments in the class. She might choose to write down the data like this:
ID Number | Name | Quiz 1 Grade | Quiz 2 Grade | Quiz 3 Grade |
---|---|---|---|---|
00626 | Stitch | 95 | 96 | 51 |
00011 | Jane Hopper | 72 | 86 | 93 |
00006 | 100 | 80 | 100 | |
24601 | Jean Valjean | 87 | 67 | 78 |
Certainly, this dataset could be considered tidy: Each row is a student, and each column is information about that student, including their grades on each quiz. This format makes a lot of sense if we want to answer questions about the progress of each individual student. We can see that Jane Hopper steadily improved over the three quizzes, while Stitch had a bit of trouble with the third one.
But what if we want to study overall class performance? Maybe we would like to know the average grade across all quizzes. From a computational standpoint, this data format makes our job harder. We would like to be able to use a command like
> Take the average of the Grade
column.
Alas, we don’t have a Grade
column. Our data is in wide form: the grades for each quiz are spread out across three different columns. We would instead have to ask a computer to
> Add the Quiz 1 Grade
column to the Quiz 2 Grade
column.
> Add this total to the Quiz 3 Grade
column.
> Divide the total by 12. (4 students times 3 quizzes = 12 total grades)
That is much more annoying, isn’t it? And imagine if we had ten quizzes instead of three!
Instead, we’d like to have our data in long form, like this:
ID Number | Name | Quiz Number | Grade |
---|---|---|---|
00626 | Stitch | 1 | 95 |
00011 | Jane Hopper | 1 | 72 |
00006 | 1 | 100 | |
24601 | Jean Valjean | 1 | 87 |
00626 | Stitch | 2 | 96 |
00011 | Jane Hopper | 2 | 86 |
00006 | 2 | 80 | |
24601 | Jean Valjean | 2 | 67 |
00626 | Stitch | 3 | 51 |
00011 | Jane Hopper | 3 | 93 |
00006 | 3 | 100 | |
24601 | Jean Valjean | 3 | 78 |
This long format certainly feels a bit less concise. We’ve traded out a table with 4 rows and 5 columns for one with 12 rows and 4 columns, which takes up more space. However, for some analyses we might want to do, the data is now tidier.
In this long form table, the rows now represent unique quizzes taken. We have no single label or index column, but we do have each row uniquely defined by which student is taking which quiz. In other words, the Student ID
and Quiz Number
columns can be combined to label each case.
Ultimately, the “tidiest” format for your data is the one that makes your analysis understandable and easy to compute.
2 Describing variables
Once we have our data neat, tidy, and ready to go, it’s time to think about what information is contained in this dataset, and what we might be able to do with it. We would like to identify what types of variables are present in our dataset, and what typical values we could see for those variables.
2.1 Labels and Indices
We have already seen one important variable type: a label. This is a column that does not truly represent information being studied; instead, it is providing some kind of identifying information about the cases. In our class roster, the Student ID
and Name
columns are both labels, since they tell us which student the row represents.
When a label is a unique identifier for the cases, we can also call it an index. In our class roster, Student ID
is probably an index, since we wouldn’t expect two students to have the same Student ID
. However, Name
is probably not an index, since it’s possible to have, for example, two different students both named “Jane Hopper”.
2.2 Quantitative Variables
When we gather information about our cases, sometimes this comes in quantitative also called numeric, form. For example, in our class roster, the Age
variable is a number in years and the Classes Taken
variable is a number of previous courses; and in our quiz scores dataset, the Grade
column contains the points on the exam out of 100.
At first glance, it may seem easy to identify which variables are quantitative: it’s the ones with numbers! Not so fast. Remember the Student ID
column in the class roster? That was a number, too.
To be a truly quantitative variable, the number needs to be measuring a quantity or amount. Some questions you can ask yourself are:
What units is this variable measured in? For the
Age
variable, the answer is “years”. For theGrade
variable, it is points. For theStudent ID
variable, it is… not answerable, because this is not a quantitative variable.Is one value “bigger” than another? A
Grade
of 85 is higher than a grade of 72. But aStudent ID
of 00006 is not “smaller” or “less” than an ID of 00011.Can we add two values together? Jane Hopper and Jean Valjean have been alive for a total of 10 + 50 = 60 years. However, we would not say that their total
Student ID
is 24601 + 00011 = 24612.
2.3 Categorical Variables
The other way we might record information, besides numeric, is by noting which of many possible categories is relevant. These variables are called categorical or qualitative. Some coding languages also call them factor variables.
A good example of a categorical variable is the column Favorite Subject
in the class roster. Two of our students have the same favorite subject, “Math”, and it is easy to imagine a new student joining the class who also has “Music” as their favorite. Each of our students is choosing a favorite from among the available categories.
The difference between a categorical variable and a label can seem a little fuzzy - recall that we said that Name
was a label, even though we could imagine a new student also named “Jane Hopper”. One question to ask yourself is, Is there a limited number of possible values? We could not write down every single possible Name
that any new student might have. But we could write down a list of all the school subjects offered at our school.
These possible values, or categories, for a categorical variable are often called the levels of the variable.
Note that levels of a categorical variable can be represented by numbers! For example, consider the long form of the quiz grades dataset. The variable Quiz Number
appears as numbers, but it represents categories: Quiz 1, Quiz 2, or Quiz 3. Remember the “questions to ask yourself”; it’s clear that you would not, for example, add Quiz 1 + Quiz 2 to get Quiz 3; so this cannot be a true quantitative variable!
2.4 Other types of variables
So far, there is one variable in our original class roster dataset that we have not yet described: Notes
. This contains free text, written by the teacher as reminders about her students. We wouldn’t really say this is categorical, since there is not a limited set of levels that could be used. Certainly it isn’t quantitative! And we couldn’t call it a label or an index either, since it doesn’t necessary uniquely identify each student; student 00006 has no Notes
at all!
In the vast modern world of Data Science, you will encounter all sorts of ways to store information, that go beyond what we normally see in tabular data. Perhaps this is something simple, like a Notes
column containing free-form text, or column giving a date or a time of an event. Perhaps this is something complex, like an image or a sound file.
When this happens, don’t be afraid to simply describe a column as “Other”.
For the purposes of this textbook, we will focus on how to analyze variables that can reasonably be called quantitative or categorical; as you become more advanced in your data analysis skills, you will learn how to study these “other” types as well.
3 Converting variables
3.1 Combining categories
Categorical variables that have a lot of different categories can be difficult to analyze. Sometimes, we may want to combine categories, so we have fewer to deal with. In particular, it might be simplest to turn our variable into a binary variable, i.e., one that has only two possible categories.
Suppose the teacher creating the datasets in this chapter is teaching a math class. She might then be interested in whether the student loves math or not, and the other favorite subjects are irrelevant. Then, she might want to make a binary variable in her dataset called Loves Math
:
ID.Number | Name | Favorite Subject | Loves Math |
---|---|---|---|
00626 | Stitch | Math | Yes |
00011 | Jane Hopper | Reading | No |
00006 | Math | Yes | |
24601 | Jean Valjean | Music | No |
This new binary variable makes it easy for us to see which students will probably enjoy the math class - two of them love math, and two of them have a different favorite subject.
3.2 Dummy variables
In your Data Science journey, you will likely eventually encounter a situation where it is convenient, or even necessary computationally, to have all your variables be quantitative. But of course, we don’t want to simply throw out the information in our categorical variables! Instead, we will convert them to “sort of” numeric variables, called dummy variables. For our class roster, that might look something like this:
ID Number | Name | Favorite is Math | Favorite is Reading | Favorite is Music |
---|---|---|---|---|
00626 | Stitch | 1 | 0 | 0 |
00011 | Jane Hopper | 0 | 1 | 0 |
00006 | 1 | 0 | 0 | |
24601 | Jean Valjean | 0 | 0 | 1 |
Notice that the levels of the original categorical variable Favorite Subject
have now each become a separate variable! The values of each of these dummy variables can only be either 1 (if that subject is the student’s favorite) or 0 (if it is not).
We sometimes call this process one-hot encoding because only one of the Favorite is [subject]
columns is activated, or “hot”, for each of the students.
3.3 Discrete variables
Just as it is sometimes convenient to turn a categorical variable into quantitative variables, it may be of interest to turn something measured numerically into categories.
For some quantitative variables, it is already true that only a few values are possible, much like a categorical variable. For example, our Classes Taken
variable counts the number of previous classes that student has taken. It is impossible (presumably) for a student to have taken 1.5 or 3.63 classes, so this variable can only have values that are whole numbers, also known as integers. A quantitative variable that can only use integers is called discrete. Depending on the goals of your analysis, it might be reasonable to think of a discrete variable as actually being categorical. Perhaps we could regard our Classes Taken
variable as having the categories/levels of “0 classes”, “1 class”, “2 classes”, etc.
3.4 Binning
For quantitative variables that are not discrete, which we call continuous, we can’t just pretend the numbers are labels. For example, if we look at the grades on the first exam in this class, we wouldn’t want to have 100 categories called “100 points”, “99.5 points”, etc. That would be way too many categories to study! Instead, we need to do some binning where we combine ranges of values into categories. A natural binning of the Exam 1 Grade
variable might be the traditional letter grade ranges, where 90-100 is an “A”, 80-90 is a “B”, etc.
ID Number | Name | Quiz 1 Grade | Quiz 1 Letter Grade |
---|---|---|---|
00626 | Stitch | 95 | A |
00011 | Jane Hopper | 72 | B |
00006 | 100 | B | |
24601 | Jean Valjean | 87 | D |
3.5 Ordinal variables
The new categorical variable that we derived from Quiz 1 Grade
has a very unique quality: it has a natural order. Although the values of this variable are not categories, not numbers, we still have a clear sequence for the levels: A
, B
, C
, D
, F
. When a categorical variable has a “correct” order, we call it ordinal.
3.6 Dates and times
One complicated variable variety you might see is related to date or time.
For example, imagine our student Jean Valjean took a particular class many years in a row. We might want to keep track of his progress:
ID Number | Name | Year | Grade |
---|---|---|---|
24601 | Jean Valjean | 1815 | 66 |
24601 | Jean Valjean | 1816 | 73 |
24601 | Jean Valjean | 1817 | 84 |
24601 | Jean Valjean | 1818 | 71 |
24601 | Jean Valjean | 1819 | 89 |
Consider the Year
variable. We might reasonably treat this variable as:
Quantitative: In some sense, time is an “amount” of something, as it “counts” the number of days/months/years since the first one. We might reasonably say that year is a quantitative variable, measuring the time since Year 0. We could also consider this to be continuous, since it’s technically possible to have a year “1815.5”, i.e. 1,815 and a half years since year 0. However, in some contexts we might treat it as discrete, in the sense that we only measure once per whole year, and so we will never record numbers like “1815.5”.
Categorical: Often, we are trying to compare data over different periods of time. Perhaps we are asking questions like “Did Jean Valjean get the best grades in 1815, 1816, or 1817?” In this situation, we are studying three categories: one for each of the three years. Of course
Other: In many applications, we are simply keeping track of the date as a way to label our data instances. Or, perhaps we want to do something very complex that studies behavior over time. If we don’t have a clear reason to treat our date or time variable as quantitative or categorical for a particular analysis, it is reasonable to just call it “Other” or “Datetime” type.
4 Missing data
It’s now time for us to tackle a glaring problem in our class roster - something you will encounter in nearly every real world dataset you analyze: missing data.
There are three instances of incomplete information in the class roster, each exemplifying a different type of missing data.
4.1 “Not Available” data
Notice that the name of student number 00006 is listed as NA
. While it’s possible that her name genuinely is spelled “NA”, it’s more likely that this is a placeholder in the dataset, that stands for Not Available or Not Applicable.
There are all kinds of reasons that information may be replaced with NA
in a cell in a data table. Perhaps the teacher creating the class roster forgot to ask Student 00006 her name. Perhaps there was an error in typing the data into a computer spreadsheet, and that cell was accidentally left blank. Perhaps Student 00006 does not actually have a given name, and prefers to be known only as “Six”.
Regardless of the reason, we are left with a data entry that is explicitly marked as missing. Unless we have a way to go back and talk to the original creator of the dataset - which is rarely feasible in real data science settings - we will never know why the information was not recorded. We only know that it is an inconvenience to deal with.
4.2 Unknowable information
While an NA
marks missing data, it doesn’t help us understand why the data is missing. Sometimes, missing data is recorded more specifically, in a way that includes the reasons. For example, in our class roster, the age of the student Stitch is listed as “unknown”. This tells us that the missing information - Stitch’s age - was not accidentally deleted or forgotten. In fact, nobody seems to know his true age, and so it was impossible to collect that information.
In this case, we know that the data is missing “on purpose”. It’s not just a gap in our dataset; it’s actually an interesting piece of information: in the same what that we know more about Jane Hopper when we know she is 10 years old, we also learn more about Stitch when we hear that his age is a mystery.
4.3 Blank cells
Lastly, we may have situations where there is simply nothing written in a data cell. We see this in the class roster, where Student “Six” has no Notes
.
Is this truly missing information? Or is it simply that the teacher did not have any additional notes to recorded about Six? There’s simply no way to tell from the dataset alone.
Typically, blank information will be interpreted by computers and data analysts as missing, and will often be converted to NA
. It’s always important, though, to keep context in mind. A blank cell in the Name
column would probably be missing data - but a blank cell in the Notes
column may very well be a legitimate observation of “No notes needed”.
5 Summary: communicating with data collaborators
Phew! That’s a lot of variable descriptions and a lot of vocabulary. Let’s bring it all together now.
Suppose you are the teacher, and you have created some tidy datasets for your course. You would like to describe these to a collaborator. Your documentation might say something like this:
5.1 Class roster
Student ID | Name | Age | Classes Taken | Favorite Subject | Notes |
---|---|---|---|---|---|
00626 | Stitch | unknown | 3 | Math | does not play well with others |
00011 | Jane Hopper | 10 | 0 | Reading | prefers to go by “El” |
00006 | 26 | 5 | Math | ||
24601 | Jean Valjean | 50 | 5 | Music | only speaks French |
Fictitious dataset based on characters from Lilo & Stitch, Stranger Things, Battlestar Galactica, and Les Miserables. |
My first dataset is a class roster. There are four rows (cases) and five columns (variables). Each row represents a student in the class. Our index column is called Student ID
, and it contains the unique Student ID of the student. We also have a label for the student’s name. There is one observation missing in this column, indicated by an “NA” value.
The variables in the dataset are:
Age
: The age of the student, in years. This continuous, quantitative variable ranges from 10 to 50 in our data. There is one missing observation, indicated by an “unknown” value.Classes Taken
: The number of previous courses the student has taken at our school. This discrete, quantitative variable ranges from 0 to 5 in our data.Favorite Subject
: The student’s favorite subject. This categorical variable has levels “Math”, “Music” and “Reading”.Is Child
: A binary variable derived from binning the Age variable, indicating whether the student is below 18 or not.Notes
: This column contains free text with notes about each student, where needed. Some notes are blank.
5.2 Quiz grades dataset
ID Number | Name | Quiz Number | Grade |
---|---|---|---|
00626 | Stitch | 1 | 95 |
00011 | Jane Hopper | 1 | 72 |
00006 | 1 | 100 | |
24601 | Jean Valjean | 1 | 87 |
00626 | Stitch | 2 | 96 |
00011 | Jane Hopper | 2 | 86 |
00006 | 2 | 80 | |
24601 | Jean Valjean | 2 | 67 |
00626 | Stitch | 3 | 51 |
00011 | Jane Hopper | 3 | 93 |
00006 | 3 | 100 | |
24601 | Jean Valjean | 3 | 78 |
My second dataset shows quiz scores for each student. Each row represents an instance of a student taking a particular quiz. This data is in long form. The Student ID
and Name
columns are the same as those in the class roster dataset. The cases can be uniquely identified by the Student ID
column and the Quiz Number
column.
The two variables are:
Quiz Number
: An ordinal categorical variable with possible levels “1”, “2”, and “3”, indicating which quiz was taken.Grade
: A continuous quantitative variable indicating the student’s score on the quiz.
5.3 Participation Dataset
Index | Student | Day | Asked Question | Presentation | Handout |
---|---|---|---|---|---|
1 | Jean Valjean | Monday | 0 | 1 | 0 |
2 | Jane Hopper | Monday | 1 | 0 | 1 |
3 | Stitch | Tuesday | 0 | 0 | 0 |
4 | Jane Hopper | Thursday | 1 | 0 | 0 |
5 | Jane Hopper | Thursday | 1 | 0 | 0 |
My third dataset shows class participation of the students. Each row represents an instance of a student participating in class. This data is in wide form. The Student ID
and Name
columns are the same as those in the class roster dataset. The Index
column is an arbitrary index label.
The type of participation observed has been one-hot encoded, so the variables in this dataset are:
Day
: The day of the week that the participation moment happened.Asked Question
: A dummy variable telling whether the participation type was question asking.Presentation
: A dummy variable telling whether the participation type was a class presentation.Handout
: A dummy variable telling whether the participation type was passing out supplies.