“Case Study: Hospital Medical Records”
Click here to access the data for this case study in Google Sheets.
For this Case Study, you do not need to make any edits or do any analyses on the datasets. However, you may need to click and scroll around in the spreadsheets to fully understand the data and answer the questions.
The datasets in this Case Study are taken from the MIMIC Project, a free and open dataset of anonymized hospital and medical data. (mimic1?; mimic2?; physnet?)
There are four datasets provided to you. The first, Hospital Admissions contains information on admissions of individuals to a particular hospital. This dataset includes the admit and discharge times, with a death time if applicable; the type of admission and location of discharge; demographic and insurance information about the patient; and diagnoses.
The other three datasets are results of lab tests run on blood samples during the patients’ stay at the hospital.
The Platelet Counts dataset references a test that looks under a microscope to count the number of megakaryocytes, or “platelets” present in bone marrow. It is good to have a high platelet count, as this helps your body heal injuries and stop bleeding.
The Blood Oxygen dataset references to a test where the oxygen and carbon dioxide present in the blood are measured. It is good to have 100% oxygen (and thus 0% carbon dioxide), as this means your lungs are working efficiently.
The Cholesterol dataset references to a test that measures cholesterol, a “waxy” substance present in blood. High cholesterol can increase risk of heart disease due to clogged arteries, but low cholesterol can prevent your body from building new healthy cells. There are two types of cholesterol: low-density lipoprotein (LDL), which is associated with heart disease risk and is considered bad to have; and high-density lipoprotein (HDL), which helps get rid of LDL and is considered good to have.
The following Case Study will ask you to think about the information in these datasets and how it was recorded.
Part One: Describe the datasets
Data Tables
For each of the four datasets, answer the following:
How many variables are there?
How many observations are there?
What are the cases?
Is there an index column? If not, what label columns could be combined to uniquely identify each case?
Variable Types
For each of the following variable descriptions, find an example in one of the datasets, and briefly explain why it fits the description. You may use the same variable as an answer to more than one list item, if appropriate.
A binary categorical variable
An ordinal categorical variable
A categorical variable that is neither binary nor ordinal
A quantitative variable
A dummy variable
A categorical variable that was created by binning a continuous quantitative variable.
A categorical variable that was probably created by binning a discrete quantitative variable.
A variable that is a date type.
A variable that is not a label or index; not a date; not quantitative; and not categorical.
Missing Data
Find an example of each of the following varieties of missing data. Justify your answers.
A variable where blank cells should be treated as
NA
(“Not Available”) because we have no way to know why it is blank.A variable where blank cells should be treated as
NA
(“Not Applicable”) because there is genuinely no possible value for the variable in that case.A variable where blank cells are meaningful and should be treated as a real value/category.
A variable where missing information is coded as “unknown” or similar.
Tidying up
The variable
diagnosis
in the Hospital Admissions dataset could be called “un-tidy”, because some patients have multiple diagnoses listed in that one column. Suggest a way we might tidy up this variable.Suppose we wanted to convert the
platelet_count
variable in the Platelet Count dataset to a binary variable. How would you suggest we do this?The Cholesterol dataset is currently in wide form. What would it look like in long form? Specifically, what would be the variable names in long form, and what values would those variables contain?
Part Two: Estimating parameters to answer research questions
Summarizing categorical variables
Open the Platelet Count dataset. Click on the Column D, so that the whole column is highlighted. Then, in the dropdown menu at the top of the Google Sheets window, choose “Data > Column Stats”
Report the summary statistics of the
platelet_count
variable.Choose three categorical variables from the Hospital Admissions dataset, and provide summary statistics for those.
Summarizing quantitative variables
In the Blood Oxygen dataset, get the Column Stats for the oxygen_pct_value
variable. Do the same for the LDL_value
and HDL_value
variables in the Cholesterol dataset. For each of those three variables, report the following:
The mean
The median
The mode
Whether the variable is left-skewed, right-skewed, or symmetric
Defining the population
Think about the sample represented in these datasets. What would you say is the population that we can reasonably make claims about from this data?
Parameters and Statistics
For each of the following research questions, state the parameter that would answer the question. (This will be a sentence describing an unknown number, such as “The true probability that Maria wins a game of Super Sisters.” or “The long-run average coins Maria would score.”)
Then, give the statistic that best estimates the parameter and addresses the question.
How many of the patients who are admitted to this hospital tend to survive?
Do patients generally have more HDL cholesterol, or more LDL cholesterol? By how much?
If they are measured again, do we expect Patient 10006 to have very low platelet count?
If they are measured again, do we expect Patient 10027 to have a blood oxygen percent below 95%?
Asking Research Questions
Come up with three questions of your own. State the parameter that they refer to. Then, give a one-sentence summary making a conclusion from the data.
Part Three: Visualization
Consider again the research questions above:
How many of the patients who are admitted to this hospital tend to survive?
Do patients generally have more HDL cholesterol, or more LDL cholesterol? By how much?
If they are measured again, do we expect Patient 10006 to have very low platelet count?
If they are measured again, do we expect Patient 10027 to have a blood oxygen percent below 95%?
For each of these, sketch a plot that you would use to help address the research question. These can be rough sketches, but they should use the real data as much as possible - in Part Three, you calculated summary statistics to address these questions, and the plots you draw should match these statistics.
(You are also welcome to make these plots using software, if you would like, but this is not required - a simple sketch on paper or in a “draw on the screen” application like MS Paint is sufficient.)
Part Four: Putting it all together
Choose a research question of your own invention, i.e., not one of the four that supplied above. (It can be one of the ones from the end of Part Two, if you wish.)
Write a short report for an imaginary doctor, sharing your analysis of the research question. Your report should be one or two paragraphs long and include a sketched plot. You should make sure to address:
The data source
The variable types for the variable(s) relevant to your research question
A statement of the sample and population involved in the analysis
Parameters and summary statistics that address your research question
A final conclusion in “real world” terms - that is, what one sentence would you use as the “newspaper” headline to share your findings.