Unit 1 Glossary
Vocabulary Words
Blinding: Ensuring that subjects aren’t aware of which treatment they are being given.
Blocking: Separating experimental subjects into meaningful groups before dividing them into treatments, so the groups can be studied separately.
Cluster or Multistage sample: A sampling approach where groups of observations are randomly chosen instead of individuals.
Facets: Separating your visualization into different plots for each category.
False response bias: When subjects may give answers that do not represent their true opinions, due to the phrasing of the question or social stigma around certain answers.
Geometry: Another word for the type of plot chosen (e.g. bar plot, histogram)
Nonresponse bias: When subjects who are unable to respond to a survey are meaningfully different from those who are able to respond, leading to a biased sample.
Stratified random sample: A sampling technique where the population is divided into meaningful groups, and observations are then sampled from each group.
Undercoverage bias: When researchers are unable to collect information from a group of subjects that is meaningfully different from the ones they can access, leading to a biased sample.
Voluntary response bias: When subjects who choose to respond to a survey have systematically different opinions than those who do not choose to respond, leading to a biased sample.
biased sample: A sample that you might expect to be systematically different from the population.
binary variable: A categorical variable with only two possible categories.
binning: Make a categorical variable out of a quantitative variable, by treating ranges as categories.
case: An entity we collect data about
categorical variable: A variable whose values come from a limited set of categories
cause-and-effect: We can only claim two related variables have a cause-and-effect relationship if they are measured in a well-designed and randomized experiment.
continuous: A quantitative variable that is not discrete.
control group: A group of subjects whose treatment is not to get any changes to their normal behavior.
convenience sample: Choosing observations for a dataset based on what is most easy to sample.
discrete: A quantitative variable that can only use integers.
documentation: Summary information about a dataset and its source.
double-blinding: Ensuring that the subject and the researchers don’t know which subject is getting which treatment.
dummy variables: Columns consisting of only 0 or 1, representing presence or absence of a category.
experiment: Data collected in a way that includes researchers doing something that impacts the cases and studying the effects.
exploratory: Analyzing data by looking for interesting patterns or trends, rather than answering a specific research question.
factor: A programming term for a categorical variable.
index: A variable containing a unique label for each case.
label: A variable that gives identification information about the cases, rather than measured information.
legend: A section of a plot showing the viewer which colors or symbols correspond to which values or levels.
long form: A dataset with multiple measurements contained in different rows, with a corresponding name column.
lurking variable: An unmeasured variable that impacts two or more measured variables, making them appear to be related to each other.
mapping/aesthetic: Which variables are involved in the plot, and which element of the plot they impact.
missing data: Cells of a dataset where the information was not recorded, or was not properly recorded, or cannot exist.
numeric variable: A variable that measures an amount or quantity.
observational data: Data collected in a way that does not impact the cases.
observational unit: An entity we collect data about. (Warning - do not confuse this with ‘units of measure’, as in inches or centimeters.)
ordinal variable: A categorical variable where the categories have an order.
placebo: A fake treatment that has no effect, to keep subjects from knowing they are in the control group.
qualitative variable: A variable whose values come from a limited set of categories
quantitative variable: A variable that measures an amount or quantity.
reproducible: Analysis that is clearly documented, so that it can be re-created by others.
simple random sample: Selecting observations in a way that every case in the population is equally likely to be chosen.
tabular: A word to describe data that is stored in rows and columns
tidy data: Tabular data where every case is a row, every variable is a column, and every cell contains only one value.
treatments: The different conditions that researchers apply to subjects in an experiment.
variable: A measured characteristic or quantity in data.
wide form: A dataset with multiple measurements spread out over multiple columns.
x-axis: The horizontal line in a plot.
Key Skills and Concepts
- Identify steps needed to make an un-tidy dataset become tidy.
- Identify the cases and identify or create an index column in a dataset.
- identify what types of variables are present in our dataset, and what typical values we could see for those variables
- Combine categories of a categorical variable.
- Convert a categorical column into several dummy variables via one-hot encoding.
- Consider the data context of missing values, to determine how to interpret or edit them.
- Sketch and interpret bar plots for categorical variables.
- Sketch and interpret stacked dotplots.
- Estimate the mean and median from a dotplot, histogram, or density.
- Sketch and interpret histograms for quantitative variables.
- Sketch and interpret density curves for quantitative variables.
- Estimate the chances of seeing certain values by looking at a histogram or density.
- Consider the source of the data you are analyzing, and the motivations of the people or organizations who created it.
- Consider the impact and implications of your data collection or analysis: Who could be harmed, and who could be helped?