WEBVTT
Kind: captions
Language: en
00:00:00.150 --> 00:00:05.280
It's fairly common that authors centered their
variables before they form the interaction term.
00:00:05.280 --> 00:00:09.060
In this video I will take a look at
whether that is actually necessary
00:00:09.060 --> 00:00:14.130
and where they used to do that or not.
In Heckman's paper the authors argued
00:00:14.130 --> 00:00:17.310
that they did sender the variables
to reduce a multicollinearity.
00:00:17.310 --> 00:00:23.400
The idea of of centering are and
multicollinearity is that if you have
00:00:23.400 --> 00:00:30.720
X and M and then you form a product of X and M.
Then the product will be correlated in both X and
00:00:30.720 --> 00:00:37.110
M because those two variables form the interaction
and by centering we can reduce those correlations.
00:00:37.110 --> 00:00:43.260
So let's take a look at some data, and we
have two random numbers here X 1 and X 2.
00:00:43.260 --> 00:00:48.570
Here the X 1 and X 2 are have means
of two, and here we have centered
00:00:48.570 --> 00:00:54.000
the variables X 1 and X 2 to have means of 0.
So the idea of centering is that you take the
00:00:54.000 --> 00:00:59.070
original variable and then you substract the mean,
and that will make the mean of the variable to be
00:00:59.070 --> 00:01:05.370
0, and we say that the variable is centered.
The bar symbol over the X means that it's
00:01:05.370 --> 00:01:12.420
it's centered, it saw the mean of that variable.
And standardization is our centering and dividing
00:01:12.420 --> 00:01:20.460
by standard. We can see here that our even the
x1 and x2 are not very strongly correlated. So
00:01:20.460 --> 00:01:26.520
here that's the pattern no particle pattern.
But when we multiply x1 and x2 together,
00:01:26.520 --> 00:01:32.790
then that product is highly correlated x1 and x2.
So there's a strong statistical relationship.
00:01:32.790 --> 00:01:38.040
When we center the variables or the
bivariate relationship here stays the same.
00:01:38.040 --> 00:01:45.900
But we can see that the relationship between
x1 and x2 and their product is quite different.
00:01:45.900 --> 00:01:52.320
There's still a strong statistical
relationship, so when our x1 or x2 goes to 0.
00:01:52.320 --> 00:01:55.980
Then there's no variation
and in the data and then it's
00:01:55.980 --> 00:02:03.210
spread out spread out when x1 and x2 increases.
So there is still a strong statistical association
00:02:03.210 --> 00:02:08.400
but it is no longer linear association.
So what's the implication for regression
00:02:08.400 --> 00:02:15.660
analysis with this decentering stuff.
On the left hand side we have the
00:02:15.660 --> 00:02:20.460
variable where the data are there regression
analysis for the data that is not centered,
00:02:20.460 --> 00:02:24.060
and on the right hand side we have
regression analysis for center data.
00:02:24.060 --> 00:02:32.730
And we can see that the differencewhat the
centering does for regression of of Y on X 1
00:02:32.730 --> 00:02:39.600
and X 2 is that it just says is the intercept.
So only the intercept is different and the
00:02:39.600 --> 00:02:46.980
first-order effects of x1 and x2 are the same.
Which is quite natural because y when you send
00:02:46.980 --> 00:02:53.160
to you're simply subtracting something from X
and something form X 2, and that will simply
00:02:53.160 --> 00:02:58.500
because you subtract the same number for every
observation that will only alter the intercept,
00:02:58.500 --> 00:03:04.230
because it doesn't affect the correlations of
the covariance of X 1 with X 2 and the covariance
00:03:04.230 --> 00:03:07.620
between those two variables and why.
Those are unaffected by centering,
00:03:07.620 --> 00:03:13.470
so centering will only affect means and in normal
regression analysis it only affects the intercept.
00:03:13.470 --> 00:03:20.130
What's the downside of centering is that once
we calculate predictions, here the predictions
00:03:20.130 --> 00:03:26.370
for this model are on the original metric.
So we will get our predictions on whatever
00:03:26.370 --> 00:03:31.590
the Y is and if we calculate predictions
using this model, then the predictions
00:03:31.590 --> 00:03:37.770
will be off by the amount that we centered.
So for example if we're predicting our salary
00:03:37.770 --> 00:03:43.050
and let's say this model would
give our 10000 euros per year.
00:03:43.050 --> 00:03:48.960
Then this model could give our minus
2000 euros which doesn't make sense
00:03:48.960 --> 00:03:54.630
unless we back convert or back translate
that effect to the non standard variables.
00:03:54.630 --> 00:04:01.770
So centering makes predictions and mix are doing
plots that apply predictions more difficult and
00:04:01.770 --> 00:04:08.070
that's important for interactions for reasons
that are I'll explain in the last slide.
00:04:08.070 --> 00:04:14.850
When we are take an interaction term we can see
now that are there are some more differences.
00:04:14.850 --> 00:04:20.460
Importantly the differences are only
in the first three coefficients.
00:04:20.460 --> 00:04:27.210
So intercept again is different witches are
expected but now x1 and - coefficients are
00:04:27.210 --> 00:04:32.700
different but their interaction of
x1 and x2 is the exact same number.
00:04:32.700 --> 00:04:39.090
So the centering actually doesn't
influence the interaction term at all.
00:04:39.090 --> 00:04:45.210
It influences only the first-order coefficients.
So is that something that you want to do or not.
00:04:45.210 --> 00:04:52.290
We have to consider to answer that question we
have to consider what exactly the centering means,
00:04:52.290 --> 00:04:56.970
and what it what exactly it means that
we have this interaction term here.
00:04:56.970 --> 00:05:06.960
Let's take a look at a graph. So here there on the
x1 and x2 effects are when x1 and x2 is 0 and here
00:05:06.960 --> 00:05:14.550
the x1 and x3 effects are the main effects.
So when x1 and x2 are at their means,
00:05:14.550 --> 00:05:20.400
then that's what the x1 and x3 effects are.
What that means can be understood
00:05:20.400 --> 00:05:26.880
by looking at this graphically.
So we have here up space and there is a plane
00:05:26.880 --> 00:05:36.300
in the space. Here we have on x1 on this axis
we have x2 in this axis and then we have Y here.
00:05:36.300 --> 00:05:42.330
So when we have two coefficients or two
variables in a regression analysis as two
00:05:42.330 --> 00:05:46.320
independent variables, then the regression
is a plane in three-dimensional space.
00:05:46.320 --> 00:05:57.210
And we can see the plane here and because of the
interaction the effect of x1 on Y is the strength
00:05:57.210 --> 00:06:07.440
of that effect is contingent on the value of x2.
So here when X 2 is at 0, then our x1 simply
00:06:07.440 --> 00:06:10.590
increases a little, so the
effect is not that great.
00:06:10.590 --> 00:06:19.500
When x2 is at 5 the affect this lot like graer
so we have this see a lot steeper slope here.
00:06:19.500 --> 00:06:25.890
So the idea is that that the regression
slope of x1 changes as a function of X 2.
00:06:25.890 --> 00:06:31.020
Also the intercept changes so
this line goes on goes down here.
00:06:31.020 --> 00:06:39.720
So what's entering does is that normally when we
do an interaction term we take the effect of x1.
00:06:39.720 --> 00:06:43.980
So the interaction regression with
interaction gives you the effects of
00:06:43.980 --> 00:06:50.880
x1 effects of x2 and their product.
When we don't Center our data the
00:06:50.880 --> 00:06:56.460
effect of x1 is this blue line here
so it's the effect of x1 when x2 is 0.
00:06:56.460 --> 00:07:03.720
Similarly the effect of x2 is
the effect of extreme when x1
00:07:03.720 --> 00:07:11.910
is 0. When we Center instead of taking on
the effect of x2 is at 0 for it for x1 we
00:07:11.910 --> 00:07:19.320
take an effect of x1 when x2 is a this mean.
So we take our this green line in the middle.
00:07:19.320 --> 00:07:26.880
So the centering just influences which
of these possible lines do we take it
00:07:26.880 --> 00:07:34.020
from here from here or perhaps all the
way from the other end of the data.
00:07:34.020 --> 00:07:41.220
So it just changed this at which part of
the regression plane we are looking at
00:07:41.220 --> 00:07:47.610
But the problem is that you
have to look at multiple places.
00:07:47.610 --> 00:07:52.860
So you can't summarize this plane by
saying that the effect of x1 is this line.
00:07:52.860 --> 00:07:56.910
You have to show multiple lines.
So it doesn't really matter which of these
00:07:56.910 --> 00:08:00.510
lines you show in your regression table.
And that's the problem.
00:08:00.510 --> 00:08:07.710
So you have to do are these interaction plots.
So you have to show multiple plots, so you say
00:08:07.710 --> 00:08:16.200
you should show that our the slope of x1
depends on the value of x2. And widths of
00:08:16.200 --> 00:08:22.500
these lines we show in the in the regression table
is arbitrary, so it doesn't really matter because
00:08:22.500 --> 00:08:28.740
we have to present this kind of plots anyway.
So what we show here whether we have the effect
00:08:28.740 --> 00:08:34.140
of x1 here to be the blue, green or red
line doesn't really make a difference.
00:08:34.140 --> 00:08:38.700
We have to show all the lines anyway.
The problem with centering is that our
00:08:38.700 --> 00:08:45.360
once we central variables then the interaction
plot the values of the predictive values of Y,
00:08:45.360 --> 00:08:49.110
will be incorrect by the amount
that we are we center the data.
00:08:49.110 --> 00:08:56.580
So we can no longer do predictions or usefully
we have to convert the predictions back to
00:08:56.580 --> 00:09:02.760
the noncentral metric for them to make sense.
So centering is not useful because it doesn't
00:09:02.760 --> 00:09:05.970
do anything for the interpretation.
You will have to interpret the results
00:09:05.970 --> 00:09:11.160
with this kind of plot anyway,
and centering will be harmful
00:09:11.160 --> 00:09:15.900
for this plot because all it makes
forming these plots more difficult,
00:09:15.900 --> 00:09:20.040
because you have to are back convert
your variables of the original
00:09:20.040 --> 00:09:25.980
metric to get the predictions correct.
So because of these our consideration or
00:09:25.980 --> 00:09:31.770
my recommendation is never Center your
data it's not useful and it is harmful.