WEBVTT
Kind: captions
Language: en
00:00:00.660 --> 00:00:03.660
Added variable plot or partial
regression plot is one of the most useful
00:00:03.660 --> 00:00:06.330
diagnostics plots after a regression analysis.
00:00:06.840 --> 00:00:10.590
This plot also demonstrates some
features of regression analysis.
00:00:10.871 --> 00:00:14.520
So let's take a look at what partial
regression plot actually does.
00:00:14.520 --> 00:00:18.540
And we need some data and a
regression model to do the plot.
00:00:18.540 --> 00:00:20.370
So we have the data here,
00:00:20.370 --> 00:00:21.330
the prestige data,
00:00:21.330 --> 00:00:24.390
and we run a regression of prestige on
00:00:24.390 --> 00:00:26.821
income, education and share of women.
00:00:27.102 --> 00:00:31.980
And then we do the partial regression
plots or added variable plots.
00:00:31.980 --> 00:00:32.940
and they're shown here.
00:00:33.204 --> 00:00:34.554
So these plots,
00:00:34.554 --> 00:00:37.470
we usually do them for every independent variable,
00:00:37.470 --> 00:00:42.267
but these are basically three independent plots,
00:00:42.267 --> 00:00:45.000
and we'll just be looking at the first one now,
00:00:45.000 --> 00:00:47.943
because the other ones are
done the exact same way.
00:00:48.330 --> 00:00:52.290
So this is the first added variable
plot or partial regression plot.
00:00:52.290 --> 00:00:57.330
Why it's a partial regression plot
will become clear in a few moments.
00:00:57.330 --> 00:01:01.110
But the idea here is that we have
a line that goes through the data.
00:01:01.110 --> 00:01:03.750
So this is a scatter plot of data and then
00:01:03.750 --> 00:01:04.563
there's a regression line.
00:01:04.739 --> 00:01:06.839
So what are these data about?
00:01:06.839 --> 00:01:08.893
So these are not our observations it is
00:01:08.893 --> 00:01:11.455
education conditional on others,
00:01:11.455 --> 00:01:13.710
precedes conditional on others.
00:01:13.710 --> 00:01:16.140
So understanding what these observations,
00:01:16.140 --> 00:01:18.660
or what these points here signify,
00:01:18.660 --> 00:01:19.800
what the line signifies.
00:01:19.800 --> 00:01:21.690
It's useful to understand,
00:01:21.690 --> 00:01:23.130
how this is actually calculated,
00:01:23.130 --> 00:01:24.690
and it is very simple to calculate.
00:01:25.446 --> 00:01:31.110
So this is an R code for
my own added variable plot.
00:01:31.110 --> 00:01:35.550
The idea of added variable
plot is that you first,
00:01:35.550 --> 00:01:39.660
regress one of the independent variables prestige,
00:01:39.660 --> 00:01:43.320
on other independent variables,
income and women here,
00:01:43.320 --> 00:01:47.880
then we regress the dependent variable
00:01:47.880 --> 00:01:50.880
on the other independent
variables, except prestige.
00:01:50.880 --> 00:01:52.710
And then we take the residuals.
00:01:52.710 --> 00:01:55.980
So we take residual of this
regression analysis here,
00:01:55.980 --> 00:01:59.100
and then residual for this
other regression analysis here.
00:01:59.100 --> 00:02:01.350
Those of you who don't understand R,
00:02:01.350 --> 00:02:04.740
the education here is the dependent variable,
00:02:04.740 --> 00:02:08.040
then income and women are
the independent variables.
00:02:08.040 --> 00:02:11.130
So it's pretty simple to understand.
00:02:11.130 --> 00:02:13.327
It's a slightly different way of writing
00:02:13.327 --> 00:02:17.951
education equals beta1 times income
plus beta2 times women plus beta0.
00:02:18.514 --> 00:02:20.610
Then we run a regression analysis,
00:02:20.610 --> 00:02:24.930
where we simply have the prestige
residuals as the dependent variable,
00:02:24.930 --> 00:02:27.480
the residual of education
as the dependent variable,
00:02:27.480 --> 00:02:29.100
and we do a scatter plot,
00:02:29.100 --> 00:02:30.704
and we draw the regression line.
00:02:30.704 --> 00:02:33.810
Then the result is partial regression plot.
00:02:34.425 --> 00:02:37.884
So this plot here is our building plot,
00:02:38.235 --> 00:02:42.840
and this is what the R command does,
00:02:42.840 --> 00:02:44.820
so what the AV plot command does in R,
00:02:44.820 --> 00:02:47.400
and this is using the building plot command.
00:02:47.400 --> 00:02:50.629
So we can do the exact same plot.
00:02:50.629 --> 00:02:55.710
The diagnostic plot for regression
analysis just adds some,
00:02:55.710 --> 00:02:57.120
I disagreed here,
00:02:57.120 --> 00:03:00.150
and it adds nicer labels for the plot,
00:03:00.150 --> 00:03:02.100
and the plot accesses.
00:03:02.100 --> 00:03:05.711
So this is exactly the same, otherwise.
00:03:06.327 --> 00:03:10.920
So the plot here explains or tells us,
00:03:10.920 --> 00:03:15.870
what is the relationship
between education and prestige,
00:03:15.870 --> 00:03:19.680
when we eliminate all other
variables from the model.
00:03:19.680 --> 00:03:20.850
So it tells us,
00:03:20.850 --> 00:03:23.670
what is the bivariate relationship,
00:03:23.670 --> 00:03:25.500
after we control for other variables.
00:03:26.414 --> 00:03:31.680
We can also view or consider this
from the Venn diagram perspective.
00:03:31.680 --> 00:03:36.510
So the Venn diagram perspective
on regression analysis is that
00:03:36.510 --> 00:03:39.750
we have the dependent variable here,
00:03:39.750 --> 00:03:42.090
we have, that's prestige,
00:03:42.090 --> 00:03:44.820
then we have the independent variables,
00:03:44.820 --> 00:03:49.680
this is the education and
this is the other variables.
00:03:50.506 --> 00:03:54.210
So the prestige and education are correlated,
00:03:54.210 --> 00:03:55.740
this area here is the correlation.
00:03:55.740 --> 00:03:57.210
And we want to know,
00:03:57.210 --> 00:04:02.010
what part of this overall correlation is
00:04:02.010 --> 00:04:05.640
unique to education and prestige,
00:04:05.640 --> 00:04:10.290
and not accounted for by these other variables.
00:04:10.290 --> 00:04:13.380
So we can see that there's some
overlap in all the variables
00:04:13.380 --> 00:04:16.890
and there are some unique relationships
between all of these variables,
00:04:16.890 --> 00:04:19.110
and this signifies two different variables here.
00:04:19.567 --> 00:04:21.240
So what we do here is that
00:04:21.240 --> 00:04:26.400
we regress prestige on these other variables,
00:04:26.400 --> 00:04:29.310
we regress it education on these other variables,
00:04:29.310 --> 00:04:31.320
and then we take the residual.
00:04:31.320 --> 00:04:34.139
So the residual is the part here,
00:04:34.451 --> 00:04:36.990
if this is a multivariate regression model.
00:04:36.990 --> 00:04:40.860
The residual is the part that the
other variables don't explain.
00:04:41.036 --> 00:04:44.996
So if we regress education
on these other variables,
00:04:45.084 --> 00:04:47.874
prestige on these other variables,
00:04:47.874 --> 00:04:49.519
and we take the residuals,
00:04:49.519 --> 00:04:51.769
what remains is this.
00:04:52.015 --> 00:04:54.475
So we have the residual of prestige,
00:04:54.475 --> 00:04:56.190
residual of education,
00:04:56.190 --> 00:05:04.050
and now the added variable plot tells us
graphically about this bivariate relationship.
00:05:04.489 --> 00:05:09.780
So importantly the correlation
between these two variables
00:05:09.780 --> 00:05:13.410
is now the regression coefficient,
00:05:13.709 --> 00:05:16.259
if we are using standardized estimates.
00:05:16.698 --> 00:05:22.170
So the correlation tells us
this regression coefficient.
00:05:24.016 --> 00:05:25.289
And that's how.
00:05:25.289 --> 00:05:32.610
So here, income conditional
on others is this area,
00:05:32.610 --> 00:05:37.140
after we eliminated all the variations
that the other variables explained,
00:05:37.140 --> 00:05:43.290
prestige here on the y-axis
is this prestige residual,
00:05:43.290 --> 00:05:47.340
after we eliminated the influence of
all other variables from the data.
00:05:48.958 --> 00:05:51.562
Also, another interesting feature is that,
00:05:51.562 --> 00:05:53.734
the regression coefficient,
00:05:53.734 --> 00:06:00.660
if we regress prestige on
education, income and women,
00:06:01.099 --> 00:06:06.870
and then we regress the residual from
the added variable plot regression
00:06:06.870 --> 00:06:08.700
on the other residual,
00:06:08.700 --> 00:06:15.000
so we have the residual of
prestige and residual of education.
00:06:15.000 --> 00:06:21.542
This regression coefficient here is exactly
the same as the regression coefficient here.
00:06:21.542 --> 00:06:26.317
So you can calculate regression
coefficients this way as well.
00:06:26.317 --> 00:06:30.073
So you can take variation away one by one
00:06:30.073 --> 00:06:34.020
and then you get the final regression
coefficient for the final variable
00:06:34.020 --> 00:06:37.440
would be the same as the regression coefficient,
00:06:37.440 --> 00:06:39.180
if we entered all variables at the same time.
00:06:39.426 --> 00:06:43.126
So we can check that correlation
is here, same as here.
00:06:43.126 --> 00:06:45.696
The standard errors differ because here
00:06:45.696 --> 00:06:50.560
we assume that the effects of
income and women are known,
00:06:50.560 --> 00:06:52.510
but here they're estimated.
00:06:52.510 --> 00:06:56.020
So this is slightly different for that reason.
00:06:57.004 --> 00:07:03.662
And so that regression coefficient
is actually the slope of this line.
00:07:03.855 --> 00:07:05.170
So why is this useful?
00:07:05.170 --> 00:07:09.441
It is useful because it allows
you to graphically present,
00:07:10.570 --> 00:07:14.350
how one variable influences
the dependent variable.
00:07:14.772 --> 00:07:16.602
And when you have a line,
00:07:16.602 --> 00:07:20.421
then the slope tells you everything
that you need to know about the line.
00:07:20.808 --> 00:07:24.100
But when you have more
complicated relationships like
00:07:24.100 --> 00:07:27.940
when you fit a log-transformed dependent variable,
00:07:27.940 --> 00:07:29.650
or log-transformed independent variable,
00:07:29.650 --> 00:07:31.474
or you fit a u-shape,
00:07:31.474 --> 00:07:33.430
where you have a square of a variable.
00:07:33.430 --> 00:07:37.330
Then you can use the same
kind of plotting for those,
00:07:37.330 --> 00:07:38.890
when you don't have a line,
00:07:38.890 --> 00:07:40.000
but you have a curve,
00:07:40.000 --> 00:07:45.760
and then you can check how that
curve explains the data controlling
00:07:45.760 --> 00:07:46.870
for all other variables.
00:07:46.870 --> 00:07:51.070
So this is useful not only for diagnostics
00:07:51.070 --> 00:07:52.960
but also for interpretation.
00:07:52.960 --> 00:07:55.930
And I have myself used this
kind of plots in one paper,
00:07:55.930 --> 00:07:58.420
that I've written, for interpretation purposes.
00:07:59.299 --> 00:08:06.490
Also the idea that this regression
coefficient is the same as
00:08:06.490 --> 00:08:09.109
regressing one residual on another,
00:08:09.109 --> 00:08:11.119
allows you to understand,
00:08:11.119 --> 00:08:14.320
what this paper by Aguinis
and Vandenberg is saying.
00:08:14.320 --> 00:08:16.492
So they're saying that,
00:08:16.492 --> 00:08:20.950
if we have lots of controls in the model then
00:08:20.950 --> 00:08:26.860
we're basically just analyzing residuals from
00:08:26.860 --> 00:08:30.460
a model where the dependent variables
are first regressed on those controls,
00:08:30.460 --> 00:08:33.760
and the independent variable is
regressed on those controls as well.
00:08:33.760 --> 00:08:37.360
So we are analyzing the
relationship between two residuals.
00:08:37.360 --> 00:08:39.670
Whether that is problematic or not,
00:08:39.670 --> 00:08:43.810
is something that I will not
be going into in this video,
00:08:45.093 --> 00:08:48.190
but it's technically correct to
say that this is just a residual.