Get the right vocabulary to make decisions this summer.
Ask questions!
Foundations
Statistics is Hard
The Boy-or-Girl Paradox
Mr. Smith says, “I have two children and at least one of them is a boy.”
What is the probability that the other child is a boy?
Gardner, M. (1961). The 2nd Scientific American Book of Mathematical Puzzles and Diversions. New York, NY: Simon and Schuster.
What do you think?
Mr. Smith has two children and at least one of them is a boy. Then the probability that the other child is a boy is...
\(1/2\)
\(1/3\)
\(1/4\)
\(2/3\)
Other
Solution - Count the Paths!
"Mr. Smith has two children and at least one of them is a boy"
The key is to realize we don't know which child is the boy - the first or the second.
Example - Intepreting Regression Results
Gender
Ethnicity
Year
BMI ≥ 25
All
All
2048
Men
All
2051
Non-Hisp. White
2049
Non-Hisp. Black
2095
Women
All
2044
Year when prevalence of overweight and obese individuals in US reaches 100% in linear regression model. Adapted from Wang et al (2008), DOI: 10.1038/oby.2008.351.
\[\begin{aligned}
\bar{x} &= 102 \\
n &= \text{ varies}
\end{aligned}\]
Z-Score Test
\[\begin{aligned}
z = \frac{\bar{x}-\mu}{\sigma / \sqrt{n}}
\htmlClass{fragment}{=\frac{\bar{x}-\mu}{\sigma} \sqrt{n}}
\end{aligned}\]
Minimal Sample Size for P-Value
P-Value
Z-Score
N
0.05
1.645
153
0.025
1.960
217
0.01
2.326
305
Doesn't answer the question whether the difference $\bar{x}-\mu $ actually matters clinically.
Do you think it matters how big \(|\bar{x}-\mu|/\sigma_x\) is?
Beware of False Conclusions!
More on P-Values
The P-value depends on the sample size/space.
The P-value is not the probablity that \(H_0\) is (not) true.
The P-value depends on fictive data.
The P-value is not an absolute measure.
The P-value does not take all evidence into account.
... and other issues. See chapter 1 in Lesaffre and Lawson (2012) for an accessible discussion.
Some Worthwhile Reading
Amrhein, V., Greenland, S., and McShane, B. (2019), “Scientists rise up against statistical significance,” Nature, 567, 305–307. DOI: 10.1038/d41586-019-00857-9.
Gelman, A. (2013), “Misunderstanding the p-value,” StatModeling blog.
Gelman, A. (2018), “The failure of null hypothesis significance testing when studying incremental changes, and what to do about it,” Personality and Social Psychology Bulletin, 44, 16–23. DOI: 10.1177/0146167217729162.
Gelman, A. (2019), “Thinking about "Abandon statistical significance," p-values, etc.,” StatModeling Blog.
Reproducibility
Keep All Your Materials Together
Organization
Keep your data, code, and analysis all in one folder. This helps with
Organization: no need to search all over for different components of your research project.
Backups: archive a single folder to ensure your work is not lost.
Reproducibility: others can reliably follow your steps, even years later.
Use an archiving program like zip or 7z to share/backup all or parts of your project. If your project includes PHI, there is a minimum encryption standard (AES-128).
Use Version Control
Keeps track of who changed what, when, and why.
Makes changes reversible.
Works best with text-based data.
Available as part of REDCap or manually via Git.
On that note: Don't send multiple versions of a word doc around as you edit. This leads to many problems. Use OneDrive instead.
Reproducibility
Use Science to Guide Variable Selection
Big Picture - Scientific Process
Define the estimand(s) of interest.
Create scientific model(s).
Create statistical model(s).
Analyze your models.
Define Your Estimand
Be specific. Example attributes:
Treatment condition of interest (clinical trial).
Patient Population.
Variable to be obtained.
For details, review ICH E9 (R1) addendum on estimands.
Pay attention to the difference between what you want to know and what you can measure.
Your choice of estimand alters how your study is properly interpreted.
Scientific Model ≠ Statistical Model
Hypotheses do not imply unique models, and models do not imply unique hypotheses.
Represents the qualitative aspect of the data generating process: \(X \rightarrow Y\).
Three main building blocks:
Mediators
Common Cause (forks)
Common Effects (colliders)
The Mediator
$X$ causally affects $Y$ through mediator $Z$.
Conditioning on $Z$ blocks the association.
The Common Cause (Fork)
$X$ and $Y$ share a common cause (confounder) $Z$, leading to a non-causal association between both.
Conditioning on $Z$ blocks this association.
Common Effects (Collider)
There is no association between $X$ and $Y$.
If you condition on $Z$, you introduce a non-causal association between $X$ and $Y$.
What's the point?
Depending on what variables you use as covariates, you may either induce or obscure a causal relationship. Either way, your interpretation of the coefficients will be incorrect.
Don't do a "garbage-can" regression (see Achen 2005).
Example: The Birth-Weight Paradox
Low birth weight is a strong predictor of infant mortality.
Infants born to smoking mothers have lower birth weights on average.
Low birth weight infants born to smoking mothers have a lower infant mortality rate than those born to non-smoking mothers.
Is smoking beneficial, protecting low birth weight infants against infant mortality?
Example: The Birth-Weight Paradox
$n=4\,115\,494$ infants born in USA in 1991. For $3\, 001\,621$ infants born, maternal smoking status is known.
Mean birth weight: $3\,145$ g (smokers) vs. $3\,370$ g (non-smokers).
Unadjusted infant mortality rate ratio: 1.55.
Infant mortality rate ratio, adjusted for birth weight: 1.09.
LBW infant mortality rate ratio, adjusted for birth weight: 0.79.
To understand the way this works, read Cinelli, Forney, and Pearl (2020).
Build A DAG
Start with your research question.
Build A DAG
Add treatment, outcome, and mediators to your DAG.
Build A DAG
Start adding common causes to outcome...
Build A DAG
... and treatment. Ignore most exogenous error terms.
Example DAG - Code
proc causalgraph;
model "CMSRP"
HTN => CM, CM => LPERF, LPERF => KD,
DM => MVD, MVD => LPERF KD,
CAD => MI, MI => CM KD;
IDENTIFY CM => KD;
run;
Example DAG - SAS Output
Pro-Tip: Try Simulating Assumptions
Use your statistical model to simulate fake data.
Some benefits:
Clarifies what your model implies. How big should an effect be, etc.
Allows for repeat sampling, so you can see how your model behaves.
Provides an opportunity to debug your analysis code.
Reproducibility
Have a Codebook
What's a Codebook?
Gives a summary of what your project is about.
Defines variable names, their meaning, units, and possible values.
Ideally has a machine-readable component.
If you use REDCap, it can generate this for you.
Variable Naming
Variable Names should be consistent and predictable.
Treatment and control data should be in the same data set.
Then how do I distinguish between the treatment and control?
Use a "dummy variable."
What is a dummy Variable?
A dummy variable takes either the value 0 or 1 and can be used to show group membership. Example:
Treatment
Subject
Month
Serum Cholesterol (mg/dL)
1
1
0
178
1
1
24
274
0
63
0
251
0
63
24
248
Categorical Variables
Collect these as integers, not as strings. Use dummy or index variables.
Example:
Bad Scheme
Good Scheme
Gender
Male
0
Female
1
BMI
Normal
1
Overweight
2
Obese
3
Why not just use Text?
Typing in a string is error prone. The computer treats 'y', 'Y', 'Yes', and 'yes' each as a distinct option, even if that's not what you mean.
Statistical Software needs to convert strings to numbers to work, anyways. Consider this ahead of time.
Dummy/indicator and index encodings work well for most use cases. Other options exist, but prefer to recode in analysis step over using those encodings directly in your data set.
Data Collection
Have an Abstraction Protocol
Define how you will collect your data.
Where will you get the information from?
How will the source information be mapped to your codebook?
If you search medical records, what search terms will you use? Record every variant you will be using.
This protocol needs to be in writing.
Follow this religiously. Deviations kill reproducibility and make your study arbitrary. Don't be arbitrary.
How to store your Data?
Don't use Excel sheets.
Consider using REDCap.
Widely used database software in medical research.
Helps you build a codebook that is used to validate data as it is entered, preventing mistakes.
Convenient export to statistical software such as SAS or R.
Provides tracking of changes.
CSV Files.
Software agnostic.
Could be read 30 years ago, can still be read 30 years from now.
Can provide tracking if used with version control.
Requires manual data validation.
Automate What You Can
Will you remember your choices and methods exactly, even 5 years from now?
The more you do by hand, the less reproducible your study becomes.
Computers do exactly what you tell them to, every time. But you might make typos.
Let computers handle as much of tedious tasks as possible. E.g.,
REDCap forms for data entry.
Scripts to automate data acquisition and formatting.
Validation and Completion
Check Your Data
Load everything in your software of choice.
Is everything formatted correctly?
Make plenty of graphs.
If necessary, apply fixes using scripts to any formatting issues. Don't do this by hand!
If you have missing observations, check your source docs to see if they were overlooked during abstraction.
What about Missing Data?
You probably will have missing data. That's ok.
In what way the data is missing will impact results and analysis.
Missing Completely at Random (MCAR)
Missing at Random (MAR)
Missing not at Random (MNAR)
Key difficulty: the data itself cannot show you which mechanism is applicable in your case. But you need to know which mechanism applies to interpret the analysis correctly.
NC TraCS databases: Partnership with RTI and others. Includes insurance data, hospitalization data, cancer databases, maternal and neonatal health, etc.
References
Achen, C. H. (2005), “Let's Put Garbage-Can Regressions and Garbage-Can Probits Where They Belong,” Conflict Management and Peace Science, 22, 327–339. DOI: 10.1080/07388940500339167.
Amrhein, V., Greenland, S., and McShane, B. (2019), “Scientists rise up against statistical significance,” Nature, 567, 305–307. DOI: 10.1038/d41586-019-00857-9.
Cinelli, C., Forney, A., and Pearl, J. (2020), “A Crash Course in Good and Bad Controls,” Sociological Methods & Research, 0(0). DOI: 10.1177/00491241221099552.
Lesaffre, E., and Lawson, A. B. (2007), Bayesian Biostatistics, Statistics in Practice, West Sussex, UK: John Wiley & Sons.
Motulsky, H. (2018), Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking, New York, NY: Oxford University Press.
McElreath, R. (2020), Statistical Rethinking: A Bayesian Course with Examples in R and Stan, CRC Texts in Statistical Science, Boca Raton, FL: CRC Press.
Schafer, J. (1999), “Multiple imputation: a primer,” Statistical methods in medical research, 8, 3–15.
Van Buuren, S. (2018), Flexible imputation of missing data, 2nd edition, Interdisciplinary Statistics Series, Boca Raton, FL: Chapman & Hall/CRC Press.