Unit 1 Glossary

Vocabulary Words

  • Blinding: Ensuring that subjects aren’t aware of which treatment they are being given.

  • Blocking: Separating experimental subjects into meaningful groups before dividing them into treatments, so the groups can be studied separately.

  • Cluster or Multistage sample: A sampling approach where groups of observations are randomly chosen instead of individuals.

  • Facets: Separating your visualization into different plots for each category.

  • False response bias: When subjects may give answers that do not represent their true opinions, due to the phrasing of the question or social stigma around certain answers.

  • Geometry: Another word for the type of plot chosen (e.g. bar plot, histogram)

  • Nonresponse bias: When subjects who are unable to respond to a survey are meaningfully different from those who are able to respond, leading to a biased sample.

  • Stratified random sample: A sampling technique where the population is divided into meaningful groups, and observations are then sampled from each group.

  • Undercoverage bias: When researchers are unable to collect information from a group of subjects that is meaningfully different from the ones they can access, leading to a biased sample.

  • Voluntary response bias: When subjects who choose to respond to a survey have systematically different opinions than those who do not choose to respond, leading to a biased sample.

  • biased sample: A sample that you might expect to be systematically different from the population.

  • binary variable: A categorical variable with only two possible categories.

  • binning: Make a categorical variable out of a quantitative variable, by treating ranges as categories.

  • case: An entity we collect data about

  • categorical variable: A variable whose values come from a limited set of categories

  • cause-and-effect: We can only claim two related variables have a cause-and-effect relationship if they are measured in a well-designed and randomized experiment.

  • continuous: A quantitative variable that is not discrete.

  • control group: A group of subjects whose treatment is not to get any changes to their normal behavior.

  • convenience sample: Choosing observations for a dataset based on what is most easy to sample.

  • discrete: A quantitative variable that can only use integers.

  • documentation: Summary information about a dataset and its source.

  • double-blinding: Ensuring that the subject and the researchers don’t know which subject is getting which treatment.

  • dummy variables: Columns consisting of only 0 or 1, representing presence or absence of a category.

  • experiment: Data collected in a way that includes researchers doing something that impacts the cases and studying the effects.

  • exploratory: Analyzing data by looking for interesting patterns or trends, rather than answering a specific research question.

  • factor: A programming term for a categorical variable.

  • index: A variable containing a unique label for each case.

  • label: A variable that gives identification information about the cases, rather than measured information.

  • legend: A section of a plot showing the viewer which colors or symbols correspond to which values or levels.

  • long form: A dataset with multiple measurements contained in different rows, with a corresponding name column.

  • lurking variable: An unmeasured variable that impacts two or more measured variables, making them appear to be related to each other.

  • mapping/aesthetic: Which variables are involved in the plot, and which element of the plot they impact.

  • missing data: Cells of a dataset where the information was not recorded, or was not properly recorded, or cannot exist.

  • numeric variable: A variable that measures an amount or quantity.

  • observational data: Data collected in a way that does not impact the cases.

  • observational unit: An entity we collect data about. (Warning - do not confuse this with ‘units of measure’, as in inches or centimeters.)

  • ordinal variable: A categorical variable where the categories have an order.

  • placebo: A fake treatment that has no effect, to keep subjects from knowing they are in the control group.

  • qualitative variable: A variable whose values come from a limited set of categories

  • quantitative variable: A variable that measures an amount or quantity.

  • reproducible: Analysis that is clearly documented, so that it can be re-created by others.

  • simple random sample: Selecting observations in a way that every case in the population is equally likely to be chosen.

  • tabular: A word to describe data that is stored in rows and columns

  • tidy data: Tabular data where every case is a row, every variable is a column, and every cell contains only one value.

  • treatments: The different conditions that researchers apply to subjects in an experiment.

  • variable: A measured characteristic or quantity in data.

  • wide form: A dataset with multiple measurements spread out over multiple columns.

  • x-axis: The horizontal line in a plot.

Key Skills and Concepts

  • Identify steps needed to make an un-tidy dataset become tidy.
  • Identify the cases and identify or create an index column in a dataset.
  • identify what types of variables are present in our dataset, and what typical values we could see for those variables
  • Combine categories of a categorical variable.
  • Convert a categorical column into several dummy variables via one-hot encoding.
  • Consider the data context of missing values, to determine how to interpret or edit them.
  • Sketch and interpret bar plots for categorical variables.
  • Sketch and interpret stacked dotplots.
  • Estimate the mean and median from a dotplot, histogram, or density.
  • Sketch and interpret histograms for quantitative variables.
  • Sketch and interpret density curves for quantitative variables.
  • Estimate the chances of seeing certain values by looking at a histogram or density.
  • Consider the source of the data you are analyzing, and the motivations of the people or organizations who created it.
  • Consider the impact and implications of your data collection or analysis: Who could be harmed, and who could be helped?