# 13/01/19: Causal inference as a missing data problem

Both causal inference and missing data are highly related. In a sense, casual inference is actually a missing data problem. Suppose you have a terrible headache, and your friend offers you some aspirin. You take the aspirin, hopeful it will cure your headache, and then 20 minutes later you feel better. Did the aspirin *cause* your headache to disappear? What would you need to know to answer that question?

You'd have to know what *would* have happened had you (contrary to fact) *not* taken the aspirin, and changed nothing else. If, in that non-factual case, your headache would still go away, then it casts serious doubt on the claim that the aspirin caused the headache to go away. Unfortunately, you don't have a time machine, so you cannot see what would have happened had you not taken the aspirin. Your headache status in the counterfactual world in which you did not take aspirin is missing data.

# 13/01/19: The role of semiparametric theory in causal inference and missing data

Both causal inference and missing data involve drawing inferences about quantities that depend on unobserved variables. This is where causal inference and missing data diverge from traditional statistics. In traditional statistics we make assumptions about the underlying process that generates the observed data, and we can use the observed data to check whether those assumptions make sense. For example, we can plot bivariate data to see if there is a linear trend before we estimate what that trend is. Or if our analysis assumes a variable is normally distributed, we can plot a histogram and examine whether it looks roughly normal before we continue with the analysis. However, in causal inference and missing data, we cannot check some of the assumptions we have to make, because some of those assumptions are about unobserved things.

Unfortunately, if our assumptions are incorrect, then our analysis will be biased. Therefore, we'd like to make as few assumptions as possible to avoid bias. This is where semiparametric inference comes in. Instead of specifying the entire distribution of the data (including the parts we cannot observe, like your headache status in the hypothetical world where you didn't take aspirin), we etch out a specific part we'd like to know something about (e.g., the population level relative risk of headache had everyone taken aspirin vs. everyone not taken aspirin), make as few assumptions as we need in order to estimate that quantity, and then remain agnostic about every other aspect of the data generating process. This is exactly the kind of thing semiparametric models are useful for.