Unit 1 Glossary

Vocabulary Words

Blinding: Ensuring that subjects aren’t aware of which treatment they are being given.
Blocking: Separating experimental subjects into meaningful groups before dividing them into treatments, so the groups can be studied separately.
Cluster or Multistage sample: A sampling approach where groups of observations are randomly chosen instead of individuals.
Facets: Separating your visualization into different plots for each category.
False response bias: When subjects may give answers that do not represent their true opinions, due to the phrasing of the question or social stigma around certain answers.
Geometry: Another word for the type of plot chosen (e.g. bar plot, histogram)
Nonresponse bias: When subjects who are unable to respond to a survey are meaningfully different from those who are able to respond, leading to a biased sample.
Stratified random sample: A sampling technique where the population is divided into meaningful groups, and observations are then sampled from each group.
Undercoverage bias: When researchers are unable to collect information from a group of subjects that is meaningfully different from the ones they can access, leading to a biased sample.
Voluntary response bias: When subjects who choose to respond to a survey have systematically different opinions than those who do not choose to respond, leading to a biased sample.
biased sample: A sample that you might expect to be systematically different from the population.
binary variable: A categorical variable with only two possible categories.
binning: Make a categorical variable out of a quantitative variable, by treating ranges as categories.
case: An entity we collect data about
categorical variable: A variable whose values come from a limited set of categories
cause-and-effect: We can only claim two related variables have a cause-and-effect relationship if they are measured in a well-designed and randomized experiment.
continuous: A quantitative variable that is not discrete.
control group: A group of subjects whose treatment is not to get any changes to their normal behavior.
convenience sample: Choosing observations for a dataset based on what is most easy to sample.
discrete: A quantitative variable that can only use integers.
documentation: Summary information about a dataset and its source.
double-blinding: Ensuring that the subject and the researchers don’t know which subject is getting which treatment.
dummy variables: Columns consisting of only 0 or 1, representing presence or absence of a category.
experiment: Data collected in a way that includes researchers doing something that impacts the cases and studying the effects.
exploratory: Analyzing data by looking for interesting patterns or trends, rather than answering a specific research question.
factor: A programming term for a categorical variable.
index: A variable containing a unique label for each case.
label: A variable that gives identification information about the cases, rather than measured information.
legend: A section of a plot showing the viewer which colors or symbols correspond to which values or levels.
long form: A dataset with multiple measurements contained in different rows, with a corresponding name column.
lurking variable: An unmeasured variable that impacts two or more measured variables, making them appear to be related to each other.
mapping/aesthetic: Which variables are involved in the plot, and which element of the plot they impact.
missing data: Cells of a dataset where the information was not recorded, or was not properly recorded, or cannot exist.
numeric variable: A variable that measures an amount or quantity.
observational data: Data collected in a way that does not impact the cases.
observational unit: An entity we collect data about. (Warning - do not confuse this with ‘units of measure’, as in inches or centimeters.)
ordinal variable: A categorical variable where the categories have an order.
placebo: A fake treatment that has no effect, to keep subjects from knowing they are in the control group.
qualitative variable: A variable whose values come from a limited set of categories
quantitative variable: A variable that measures an amount or quantity.
reproducible: Analysis that is clearly documented, so that it can be re-created by others.
simple random sample: Selecting observations in a way that every case in the population is equally likely to be chosen.
tabular: A word to describe data that is stored in rows and columns
tidy data: Tabular data where every case is a row, every variable is a column, and every cell contains only one value.
treatments: The different conditions that researchers apply to subjects in an experiment.
variable: A measured characteristic or quantity in data.
wide form: A dataset with multiple measurements spread out over multiple columns.
x-axis: The horizontal line in a plot.

Key Skills and Concepts

Identify steps needed to make an un-tidy dataset become tidy.
Identify the cases and identify or create an index column in a dataset.
identify what types of variables are present in our dataset, and what typical values we could see for those variables
Combine categories of a categorical variable.
Convert a categorical column into several dummy variables via one-hot encoding.
Consider the data context of missing values, to determine how to interpret or edit them.
Sketch and interpret bar plots for categorical variables.
Sketch and interpret stacked dotplots.
Estimate the mean and median from a dotplot, histogram, or density.
Sketch and interpret histograms for quantitative variables.
Sketch and interpret density curves for quantitative variables.
Estimate the chances of seeing certain values by looking at a histogram or density.
Consider the source of the data you are analyzing, and the motivations of the people or organizations who created it.
Consider the impact and implications of your data collection or analysis: Who could be harmed, and who could be helped?