3 Panel Models Explained in an Easy and Intuitive Way

Yukio
12 min readMay 31, 2021

--

Most economics students will struggle with econometrics models. This is totally fine. They are built on advanced — or intermmediate for some, I don't know — math concepts and it takes some time for anyone to undestand them. However, this might become a problem when the students start to lose interest, drop classes or leave things undone. A good way to avoid this is to start the lessons with an intuitive explanation and only when the student gets the idea of the model, we can jump into the formulas. Today, I will present 3 panel data models without using any Math at all

Quick Note: You don't have to access any of the links I put in some parts of the text in order to understand the models. They contains some content to guide you if you are already in the mood to go deeper into the subject.

THIS ARTICLE ISN'T FOR ECONOMICS STUDENTS ONLY

Let's begin with a disclaimer that this post can help lots of aspirants to data science as well. There isn's many things to say here, the explanation is pretty straightforward, when we talk about econometrics, we are talking about inference with data. Isn't that what data scientists do?

Econometrics is not always part of data science courses. However, when we stop to think about it, the areas have many intersections — like when we talk about Linear and Logistic Regression, both part of economics and data science curriculum — and both are trying to get data information using mathematics strategies. So even though panel models are not always taught to data science students, you can definetly take advantage of them!

THE STRUCTURE OF DATA

Economics data may be collected in different ways. You might collect everything you can from individuals in a specific moment or you might follow them over time. Since each strategy requires a specific approach, we must start explaining each structure:

  • Cross-Sectional: Consist of data taken at a give point in time. This is the most used, probably because it's easier to collect. The Titanic data is a good example. You have a static information took at some point in time. However, is good to let you know that the data might not come all from the same time period, but it's a picture of a moment. Let's say you gather each customer the information one month before they unsubscribe to your product. If I unsubscribed in December/20, you will have a picture of my information in November/20. If another customer unsubscribed in November/19, his information will be from October/19. Another examples of cross-sectional data we find on Kaggle: German Credit Risk and House Prices. Some typical models we usually apply here are linear regression, logistic regression, decision tree, random forest classifier, among others.
  • Time-Series: As implied by the name, we are talking about series of information collected over time. This kind of data consists of observations of the same individual — and we are also calling stocks, prices and other things individuals — accross multiple windows. Stock prices collected on a daily basis, monthly sales from a shoes store and country yearly minimum wage are time-series data. Some time-series data found in Kaggle are Predict Future Sales and M5 Forecasting. Bonus tip: if you want to study time-series models, give PROPHET a shot. This model is beating the accuracy of other traditional approaches.
  • Panel Data: We are now going to put both strategies described above together. Panel data is about getting a time series for each cross-sectional individual in the dataset. For example — you might recognise from Introductory Econometrics by Wooldridge—, let's say the government gathered wage, education and employment history from a group of workers over a ten year period. Since we are talking about following the same individuals over time, this type of data is not that common, or at least using this type of model. However, this shouldn't be like this and I will show you how can you use them in your job.

Now that you already know what is panel data and how differs from the rest, we can finally get to know these models.

PANEL MODELS' INTUITIVE EXPLANATION

You will find below several problems and how we could solve them with panel models. As already implicit (or maybe explicit) in the name of the article, we are trying to oversimplify the models, so that the reader won't be afraid of going deeper into them later on. All examples are based on real problem, but you can think of them as fictional problem, since I might change one thing or another in order to ease things.

INSTRUMENTAL VARIABLES (IV)

The first problem we have is to identify how the police is related to crime. After collecting the data, you realize that the number of cops and crime are positively related. I.e., places with higher crimes have a greater number of cops. Can you afirm that the cops are causing higher crimes, maybe due to their truculence?

No. This could be the case, but the positive correlation could also be because places with higher number of crimes require a higher number of cops. As you can see, we have a case of omitted variable bias. Crime leads to more violence, but also to more cops. We find a similar example when talking about how college impacts wage. Do good colleges leads to higher wage or is the case that the best students goes to the college and also have higher wage?

Image from chapter 5 of the online book 'Causal Inference for the Brave and True', representing three examples of omitted variable bias

How do we deal with this?

A randomized controlled trial, just like those we normally see in medicine, is always a good approach when trying to identify causality. But we can't randomly increase and decrease the number of cops in a city, neither forbid some students of going into a college. That's being said, we will take advantages of quasi-experiments.

Let's say a city suffered a terrorist atack. This city will probably increase a lot the number of cops in it. Wait… look at this, we are doing an experiment! We've just increased the number of cops in one city regardless the crime rate there. This is exactly what an experiment in the lab looks like, you give a medicine for someone in a group and it doesn't give medicine for someone, so we can control the effects.

The attack is what we call an instrument variable. We can now estimate the effect of the number of cops in the violence, since the increase of cops is something exogenous, it doesn't have any relation with violence.

The attacked is an exogoneous shock, independent of the violence

What now? Well, you can just assess the effect of the attack on the number of cops and the impact of this in the crime rate*. These two regressions are called a two stage regression and since most softwares will do this part for you, we can celebrate the job is finally done, easy peasy.

You can apply this kind of model in your business as well. For instance, let's say you have an e-commerce, you sell clothes. You have no idea what is the price-demand elasticity of your product. But you had a bug that made everything much more expensive. You've just had an istrument variable for your analysis (the bug). Or maybe you believe your employees are less productive due to social media. Then, Facebook was down for a few hours. Again, you have an instrument variable right there. You will find lots of examples in your workplace that will spare you the time and cost of tests.

As I promised, I won't go further into the math in this text — I promise I will write that part as soon as I can — , but for those already in the mood for it, you can actually get to see the the model both theoretical and in Python by checking the chapter 8 of Causal Inference for the Brave and True. If you want to read more about Instrumental Variables, check the Chapter 3 from the book Mastering Metrics, by Angrist.

DIFFERENCES-IN-DIFFERENCES (DIFF-IN-DIFF)

Second, let's dive into what I always found to be the most intuitive among all panel models. Suppose you want to know how the increase of minimum wage impacts the unemployment rate. Also, you know that New Jersey increased minimum wage, but Pennsylvania didn't. How about exploring the event?

Well, you might think about starting by checking if after the increase of minimum wage the unemployment grew. Most people would do that, but I believe you would agree that something doesn't feel right about this approach. I mean, what if unemployment rate grew because of other events?

The good thing is, you have Pennsylvania. So maybe you could just compare the unemployment rates of the two states? Well, again, it doesn't feel right… We could also have some ommitted variable bias. Maybe one city has a worse unemployment rate prior to any minimum wage changes.

You can't just compare the two states and you also can't just see the changes in New Jersey over time, but what about putting these two together? How about comparing the minimum wage and unemployment rate of the two states before and after the change. This method is called differences-in-differences and is well represented by the illustration below:

Illustration from this great discussion: https://stats.stackexchange.com/questions/564/what-is-difference-in-differences

Again, you might wanna apply this in your job. Let's continue with your e-commerce example. You believe that product A is too expensive and it should be cheaper. However, if you give product A a discount, you might face some problems in your analysis, because product A and C are substitute goods. So people would stop buying product C and start buying product A because it's cheaper. Product A would sell more, but people would stop buying product C, even though they would buy without discount. So you are losing money due to this wrong interpretation. How about testing in similar cities, just like in a diff-in-diff? You may give the discount in New Jersey and don't give in Pennsylvania and compare the impact on sales.

This might look so simple to you that you are suspicious of it. Well, there are a few assumptions we need to follow here, besides the inclusion of covariates in the regression. You can learn more about the math and theory by checking Chapter 14 from the online book Causal Inference for the Brave and True (which also covers a Python script) or Chapter 5 from the book Mastering Metric, by Angrist.

Note: The main assumption you should remember for diff-in-diff is the PARALLEL TRENDS. It means that without the increase of minimum wage, the difference between the unemployment rates of New Jersey and Pennsylvania would be constant over time.

REGRESSION DISCONTINUITY DESIGN (RDD)

For the last problem, you will think about a way to assess the impact of going to a prestigious university in wages. To begin with, can you answer me if going to MIT or Harvard will make a big difference in my earnings? Most people would answer me by saying that those who went to these two universities have great earnings, so the answer is yes.

The answer might sound good enough for some people, but you and I are familiar with (again!) the omitted variable bias issue. People who went to these universities for sure have great earnings, but they are also those who had a better education, attended the best schools and/or are just more intelligent than the average. That being said, maybe their earnings are due to their education and/or intelligence.

Image from https://statisticsbyjim.com/regression/confounding-variables-bias/, think about education and intelligence as the confounder and going to the good university as the independente variable

How can we assess if going to a prestigious university has any impact whatsoever in people's earnings?

Now, an important disclaimer, I will think about this problem with the process of going to university in Brazil. Here, we have a 2-days test with all subjects people study in high school. Let's suppose the test has 100 points and the last position among those who got in got 55 points in the test. We can't compare those who got in with those who didn't. The student who got 90 points is very different from the one who got 20 — as we already discussed in the beginning of this section. However, I do can compare someone who got 56 with someone who got 54. They are so close and so similar when it comes to the education, that our comparison is much better now. This is RDD!

The online book Causal Inference for the Brave and True brilliant explained the intuition behid this model:

You can’t grow a business in one day, consistency and hard work are required to build wealth and it takes years before you learn how linear regression works. Under normal circumstances, nature is very cohesive and doesn’t jump around much.

Which means that when we do see jumps and spikes, they are probably artificial and often man-made situations. These events are usually accompanied by counterfactuals to the normal way of things: if a weird thing happens, this gives us some insight into what would have happened if nature was to work in a different way. Exploring these artificial jumps is at the core of Regression Discontinuity Design.

BIRTHDAYS AND FUNERAL

In America, you are allowed to drink after turning 21 (thank God in Brazil the minimum age is only 18🤣). This is definetly outside my field of study, but there are probably lots of discussions on the legal drinking age. So, how does the minimal legal drinking age impacts people?

Again, we will take advantage of quasi-experiments. We have a situation in which a very small change in age (from 20 to 21) generates a big change (you go from forbidden to allowed). The change that one day causes is presented in the graph below:

Image 4.1 from Chapter 4 from Mastering Metrics

You can see how the mortality rate changes dramatically when people become 21 and are allowed to drink. The link between a sharp and sustained rise in death rates is also perceived by the next plot:

Image 4.2 from Chapter 4 from Mastering Metrics

This figures plots death rates (deaths per 100,000 persons per year) by month of age (defined as 30-day intervals), centered around the twenty-first birthday. At ages over 21, as we can see clearly, death rates shift up, and few of those to the right of the age-21 cutoff are below 95.

Just like discussed in the first example, the important thing for you to notice is that we are assessing the changes in the "frontier" of the event, trying to compare orange with orange. By comparing data from 2 days before becoming 21 with 2 days after, or comparing a student who got 54 with a student who got 55, we are comparing similar things. This is the point in RDD, finding this frontier, this threshold where change happens and we have a fair comparison.

Again, this is an oversimplified explanation of how RDD acts. I will definetly write about the math involved and write an example in Python, but for now, if you are already in the mood for this, go to Chapter 16 from the online book Causal Inference for the Brave and True and/or to Chapter 4 from the book Mastering Metrics, by Angrist.

FINAL THOUGHTS

Approaching econometrics is always tough. Too many advanced math, letters and numbers, calculus, linear algebra, everything most students hate getting together. That's way, both students and teachers should be strategic when learning or teaching the subject. I always test several materials in order to find what suits best for me, besides doing what I just did: starting by getting the idea/intuition and just after getting it, I move to the hard part (mathematics and coding). I really hope you guys enjoyed and that my English wasn't rusty enough. Please feel free to get in touch if you need some help or any additional explanation.

--

--

Yukio
Yukio

Written by Yukio

Mathematician with a master degree in Economics. Working as a Data Scientist for the last 10 years.

Responses (1)