An Easy Causality Approach for Beginner Data Scientists

2 min readAug 26, 2024

Source of image: https://builtin.com/data-science

People often say that correlation is not causation, but they rarely talk about the models that can actually infer causation. Many tend to think that causality can only be established through randomized controlled experiments and A/B tests, but these approaches are not always possible and there are many more strategies to be used!

Have you ever been in a situation where you needed to infer causality for something that has already happened, or in a situation where an RCT was not possible to apply? It’s tricky, right? Now, imagine if you could create your own control by combining other groups that, individually, wouldn’t be enough for the job. That’s where the idea of Synthetic Control comes into play!

Here’s an example: you launched a sales promotion in a specific state to coincide with a local holiday. Now, your task is to evaluate the effectiveness of that promotion. But choosing another state as a control doesn’t quite work because there are differences that would bias your results. This is where synthetic control steps in, offering a solution to estimate the impact of your intervention. All you need to understand is linear regression and how to use the design of the situation to make it work as an experiment!

In a nutshell, here’s how it works: you identify a group of control units — basically, a few states that didn’t participate in the sales promotion. Then, you run a linear regression where the independent variables are the sales data from these control states over time, before the promotion. The goal is to determine the weights of each state in the synthetic control so that the combination of these states best reflects the sales behavior of the treated state before the promotion.

Once this synthetic baseline is established, you compare the actual sales of the treated state during the promotion with the sales projected by the synthetic control model for the same period. The difference between these two values indicates the impact of the sales promotion!

This is just one of the methods in the causal inference toolbox that blew my mind when I first learned about it. It’s really useful when a controlled experiment isn’t possible, or even when it is, but you want to save some resources.

If you’re interested in learning more about this method, here’s a short recommendation for you to read:

Mostly Harmless Econometrics (J. Angrist)
Causal Inference in Python (M. Facure)
Causal Inference and Discovery with Python (A. Molak)

An Easy Causality Approach for Beginner Data Scientists

Written by Yukio

No responses yet