WEBVTT
Kind: captions
Language: en
00:00:00.090 --> 00:00:05.790
In this video, I will explain the instrumental
variable solution to the endogeneity problem.
00:00:05.790 --> 00:00:08.490
To understand the instrumental variable solution
00:00:08.490 --> 00:00:10.770
we first need to understand
the endogeneity problem.
00:00:10.770 --> 00:00:13.620
I explain the problem in
more detail in another video.
00:00:13.620 --> 00:00:15.893
But this is just a quick recap.
00:00:15.893 --> 00:00:19.530
So endogeneity occurs when
we have a regression model
00:00:19.530 --> 00:00:23.310
such as the one here, shown
graphically here as a path diagram.
00:00:23.310 --> 00:00:27.540
The error term U presents any other causes of Y,
00:00:27.540 --> 00:00:33.690
that are not included as
the explanatory variables.
00:00:33.690 --> 00:00:36.450
So if any other cause of Y
00:00:36.450 --> 00:00:40.920
is correlated with any of
the included variables X,
00:00:40.920 --> 00:00:43.320
then we have an endogeneity problem.
00:00:43.320 --> 00:00:47.040
For example, if we are trying to
explain the company's performance,
00:00:47.040 --> 00:00:49.200
let's say ROA here,
00:00:49.200 --> 00:00:51.330
and we are trying to explain the performance
00:00:51.330 --> 00:00:55.830
whether a company invests in a
new manufacturing plant or not.
00:00:55.830 --> 00:01:02.970
Then both, investment and profitability
probably depend on company's strategy,
00:01:02.970 --> 00:01:06.000
in which case strategy is an omitted variable
00:01:06.000 --> 00:01:09.180
that is correlated with X1 and
00:01:09.180 --> 00:01:10.830
that leads to an endogeneity problem.
00:01:11.654 --> 00:01:16.964
More generally, if we look
at the problem of X and Y,
00:01:16.964 --> 00:01:19.320
in just a bivariate case, in more detail.
00:01:19.320 --> 00:01:22.200
We have, the correlation with X and Y,
00:01:22.200 --> 00:01:29.460
is this direct path here plus
this correlation times 1,
00:01:29.460 --> 00:01:33.932
because the path for there is constrained to be 1.
00:01:33.932 --> 00:01:36.613
So the correlation between X and Y is
00:01:36.613 --> 00:01:40.680
the direct regression path
plus the spurious correlation,
00:01:40.680 --> 00:01:43.650
because X correlates with the omitted cause U.
00:01:44.293 --> 00:01:46.393
The problem is that,
00:01:46.393 --> 00:01:48.720
we just observed this one correlation,
00:01:48.720 --> 00:01:50.970
so we have one unit of information from the data,
00:01:50.970 --> 00:01:53.970
and we want to estimate two different parameters.
00:01:53.970 --> 00:01:56.100
This is under-identified model,
00:01:56.100 --> 00:01:57.750
the degrees of freedom is -1,
00:01:57.750 --> 00:02:00.900
which means that the model
can't be meaningfully estimated.
00:02:00.900 --> 00:02:04.290
So we can't estimate two
different things from one thing.
00:02:05.114 --> 00:02:08.340
To solve this problem we can
apply instrumental variables.
00:02:09.024 --> 00:02:11.493
The idea of an instrumental variable is that
00:02:11.493 --> 00:02:14.740
we get a third variable Z
that is correlated with X,
00:02:14.740 --> 00:02:17.903
that is a correlation that
we can test empirically,
00:02:17.903 --> 00:02:22.071
and that we can assume it's
uncorrelated with U, any other causes.
00:02:22.915 --> 00:02:28.720
What qualifies, how we find these
instruments is a difficult problem,
00:02:28.720 --> 00:02:33.130
because we cannot generally test the correlation
00:02:33.130 --> 00:02:34.660
between Z and U empirically,
00:02:34.660 --> 00:02:37.840
we have to argue that based on theory.
00:02:37.840 --> 00:02:39.400
I'll show you an example soon.
00:02:39.400 --> 00:02:41.410
But let's take a look at the principle first.
00:02:41.410 --> 00:02:45.400
So let's assume we have a
valid instrumental variable.
00:02:45.400 --> 00:02:49.630
So that the only reason why
Z and Y are correlated is
00:02:49.630 --> 00:02:51.550
because Z is correlated with X
00:02:51.550 --> 00:02:54.880
and then Z can't be correlated with Y.
00:02:54.880 --> 00:02:58.300
So when we have these correlations,
00:02:58.300 --> 00:03:00.550
the correlation between Z and Y is then
00:03:00.550 --> 00:03:05.350
correlation between X and Z comes
from the path analysis tracing rules,
00:03:05.350 --> 00:03:06.430
so we take that correlation,
00:03:06.430 --> 00:03:09.670
and then this direct path to get from Z to Y.
00:03:09.670 --> 00:03:13.000
So the correlation between Y and Z is
00:03:13.000 --> 00:03:15.430
beta times correlation X and Z.
00:03:15.430 --> 00:03:18.250
And from here we can solve for B,
00:03:18.250 --> 00:03:22.540
using correlations X, Z and correlation ZY,
00:03:22.540 --> 00:03:24.670
which are both observable quantities
00:03:24.670 --> 00:03:30.828
and that gives us a consistent estimate of beta.
00:03:30.828 --> 00:03:32.260
So that's a way to estimate beta.
00:03:32.662 --> 00:03:38.140
And a variable Z qualifies
as an instrumental variable,
00:03:38.140 --> 00:03:41.170
if it qualifies for two criteria.
00:03:41.170 --> 00:03:44.230
First, it must have relevance for X.
00:03:44.230 --> 00:03:46.750
So X and Z must be correlated,
00:03:46.750 --> 00:03:48.550
that can be checked empirically,
00:03:48.550 --> 00:03:49.960
you just have to calculate the correlation,
00:03:49.960 --> 00:03:53.230
and we do a statistical test for the correlation.
00:03:53.230 --> 00:03:55.600
Then there are exclusion criteria,
00:03:55.600 --> 00:03:58.060
which has to be argued based on the theory.
00:03:58.060 --> 00:03:59.920
Because we don't observe U,
00:03:59.920 --> 00:04:03.940
we can't test whether Z and U are uncorrelated,
00:04:03.940 --> 00:04:06.190
that has to be argued based on theory.
00:04:06.190 --> 00:04:08.080
That is difficult to do.
00:04:08.080 --> 00:04:09.910
Let's take a look at an example.
00:04:09.910 --> 00:04:12.940
So in Mochon's paper, they
apply instrumental variables.
00:04:12.940 --> 00:04:16.720
To understand the instrumental variable used here,
00:04:16.720 --> 00:04:18.130
we have to understand first,
00:04:18.130 --> 00:04:22.060
what is the endogeneity
problem, that they are doing?
00:04:22.060 --> 00:04:24.790
So what's the issue, why instrumental variables.
00:04:24.790 --> 00:04:27.850
Their dependent variable was point acquisition,
00:04:27.850 --> 00:04:29.980
so people are acquiring points in service.
00:04:29.980 --> 00:04:33.280
And they are testing,
00:04:33.280 --> 00:04:38.080
whether the decisions to like
the Facebook page of that service
00:04:38.080 --> 00:04:39.880
leads to more point acquisition.
00:04:39.880 --> 00:04:41.890
And they did an experiment,
00:04:41.890 --> 00:04:45.250
so they have this randomization step here,
00:04:45.250 --> 00:04:52.720
so they invited some people to like
the page that they were studying,
00:04:52.720 --> 00:04:55.150
and the rest were control.
00:04:55.150 --> 00:04:57.970
So this is randomization and it is exogenous,
00:04:57.970 --> 00:05:00.970
because there's no reasonable way that
00:05:00.970 --> 00:05:04.630
the computer, a random number
generated, on my computer
00:05:04.630 --> 00:05:07.360
will be correlated with the
behavior of actual people.
00:05:07.360 --> 00:05:12.610
So it's very implausible to claim
that this would not be exogenous.
00:05:12.610 --> 00:05:14.500
So randomization is exogenous.
00:05:15.244 --> 00:05:18.340
Then we have the endogenous selection.
00:05:19.868 --> 00:05:23.860
The reason why the selection
is endogenous is that,
00:05:23.860 --> 00:05:28.960
when you're invited to like
a Facebook page of a service,
00:05:28.960 --> 00:05:34.270
whether you accept the invitation
or not probably depends on,
00:05:34.270 --> 00:05:35.890
how much you like the service,
00:05:35.890 --> 00:05:38.590
how much you use the service, and so on.
00:05:38.590 --> 00:05:44.110
So there are probably multiple
different causes that influence,
00:05:44.110 --> 00:05:48.220
whether you choose to accept the
invitation to like the service.
00:05:48.220 --> 00:05:52.360
That also influences how
active you are in the service,
00:05:52.360 --> 00:05:53.410
acquiring points.
00:05:53.953 --> 00:05:57.340
So comparing those that chose not to like,
00:05:57.340 --> 00:05:59.890
against those that did like the page,
00:05:59.890 --> 00:06:02.260
is not a valid comparison,
00:06:02.260 --> 00:06:05.380
because these two groups of
people are not comparable.
00:06:05.380 --> 00:06:07.870
That is, we have an endogeneous selection here.
00:06:08.795 --> 00:06:11.050
So we have basically a few options.
00:06:11.050 --> 00:06:15.190
We can compare between treatment and control here,
00:06:15.190 --> 00:06:18.340
but that doesn't really give
us the effect of the like,
00:06:18.340 --> 00:06:21.820
because these people in the treatment,
00:06:21.820 --> 00:06:25.000
some of them chose not to like the Facebook page.
00:06:25.000 --> 00:06:28.450
Also, some people in the control
could have liked the page anyway.
00:06:28.450 --> 00:06:32.290
So comparing the treatment and
control on points acquisition,
00:06:32.290 --> 00:06:35.950
doesn't really allow us to do what we want to do.
00:06:36.895 --> 00:06:41.870
We can't compare between chose
to like and chose not to like,
00:06:41.870 --> 00:06:46.040
because this is an endogenous selection.
00:06:46.040 --> 00:06:51.380
And we can't compare these that
chose to like against control,
00:06:51.380 --> 00:06:55.700
because the control contains people
that would have chosen not to like,
00:06:55.700 --> 00:06:56.960
had they been asked.
00:06:56.960 --> 00:06:58.760
So these two are not comparable either.
00:06:59.323 --> 00:07:00.650
What we can do here,
00:07:00.650 --> 00:07:02.360
and what Mochon et al did,
00:07:02.360 --> 00:07:04.730
they applied instrumental variable technique.
00:07:05.454 --> 00:07:07.385
So the idea is that,
00:07:08.028 --> 00:07:13.340
the treatment the randomization here
is correlated with choosing to like.
00:07:13.340 --> 00:07:17.810
So if you ask some people
to like a Facebook page and
00:07:17.810 --> 00:07:20.000
you don't ask the other group,
00:07:20.000 --> 00:07:26.960
then those people that you ask are
more likely to actually like the page.
00:07:26.960 --> 00:07:29.570
And this can be established empirically.
00:07:29.570 --> 00:07:31.670
So they can calculate this correlation here,
00:07:31.670 --> 00:07:38.450
and they can establish that the treatment
is a relevant instrumental variable
00:07:38.450 --> 00:07:39.440
for choosing to like.
00:07:39.440 --> 00:07:41.810
So it fills the relevance criteria.
00:07:42.373 --> 00:07:46.040
The treatment also fills
the exclusion criteria,
00:07:46.040 --> 00:07:47.690
because the treatment is randomized,
00:07:47.690 --> 00:07:51.740
it is very unlikely
that this treatment actually
00:07:51.740 --> 00:07:56.030
correlates with any other
reason that an individual person
00:07:56.030 --> 00:07:58.190
would have used to like the page.
00:07:58.190 --> 00:08:01.340
So when we have a random number
basically on our computer,
00:08:01.340 --> 00:08:03.694
which assigns people to treatment or control.
00:08:03.694 --> 00:08:09.800
Then that is independent of
any attribute of those people
00:08:09.800 --> 00:08:10.580
that we randomized.
00:08:10.962 --> 00:08:13.362
So it fills the exclusion criteria.
00:08:13.362 --> 00:08:18.230
Then they can apply these equations to calculate,
00:08:18.230 --> 00:08:20.810
what is the effect of one way to Facebook like?
00:08:21.554 --> 00:08:24.560
In practice, we don't work with these equations,
00:08:24.560 --> 00:08:28.100
because we usually have
multiple different variables,
00:08:28.100 --> 00:08:31.340
we have controls and we can have
multiple instrumental variables as well.
00:08:31.541 --> 00:08:33.791
So we use some other technique.
00:08:33.791 --> 00:08:37.820
And one of the simplest technique is
called the two-stage least squares.
00:08:37.820 --> 00:08:40.460
The idea of a two-stage least squares is that,
00:08:40.460 --> 00:08:42.620
when we take the instrumental variable Z,
00:08:42.620 --> 00:08:46.940
then instead of just saying
that these are correlated,
00:08:46.940 --> 00:08:52.700
we regress X on Z and then we calculate
things based on these regressions.
00:08:52.700 --> 00:08:53.630
So let's see how it works.
00:08:54.170 --> 00:08:55.220
So we have first,
00:08:55.220 --> 00:08:58.160
this is endogenous regression analysis.
00:08:58.643 --> 00:09:00.205
So we have Y,
00:09:00.386 --> 00:09:03.830
if we regress Y on X, we
have an endogeneity problem,
00:09:03.830 --> 00:09:07.250
because some causes of X are
correlated with some causes of Y.
00:09:07.773 --> 00:09:11.283
Then we have the instrumental variable here, Z.
00:09:11.283 --> 00:09:18.950
So we say that X is actually a
sum of Z multiplied by beta2,
00:09:18.950 --> 00:09:22.340
plus the error term from that regression analysis.
00:09:22.642 --> 00:09:28.072
So we have the regression analysis for
the first regression of X on Z here.
00:09:28.232 --> 00:09:29.942
And then we have,
00:09:31.070 --> 00:09:33.151
that makes the second regression.
00:09:33.151 --> 00:09:35.483
Then we can multiply this out,
00:09:35.483 --> 00:09:43.580
so we have this beta1, beta2, Z that's the effect.
00:09:44.605 --> 00:09:51.080
And this is typically implemented
by running two sets of regression.
00:09:51.080 --> 00:09:58.131
So this beta2 Z is a fitted value
of a regression analysis of X on Z.
00:09:58.312 --> 00:10:00.652
So in practice, we implement this model,
00:10:00.652 --> 00:10:03.962
by first regressing X on Z,
00:10:03.962 --> 00:10:08.252
then we take the fitted values of Z,
00:10:08.252 --> 00:10:14.900
and then we regress Y on the fitted
values of X from the first regression.
00:10:14.900 --> 00:10:17.960
So we run the first regression
to get fitted values,
00:10:17.960 --> 00:10:21.380
then we run the second
regression on the fitted values
00:10:21.380 --> 00:10:24.260
and that gives us consistent
estimates of this relationship.
00:10:24.260 --> 00:10:27.740
If you have more than one independent variables,
00:10:27.740 --> 00:10:29.360
if we have five independent variables,
00:10:29.360 --> 00:10:33.080
then we regress each one of
those five independent variables
00:10:33.080 --> 00:10:35.540
on the instruments separately.
00:10:35.540 --> 00:10:38.300
If we have variables that are not endogenous,
00:10:38.300 --> 00:10:40.970
then they qualify as instruments as well.
00:10:40.970 --> 00:10:44.300
We take fitted values of each of
those five regression analyses
00:10:44.300 --> 00:10:46.880
and use those fitted values to explain why.
00:10:46.880 --> 00:10:51.590
And that will produce
consistent estimates of beta Y,
00:10:51.590 --> 00:10:56.240
under the assumption that Z is relevant
00:10:56.240 --> 00:11:00.950
and does not correlate with
the omitted causes of Y.