WEBVTT Kind: captions Language: en 00:00:00.150 --> 00:00:05.280 It's fairly common that authors centered their  variables before they form the interaction term.  00:00:05.280 --> 00:00:09.060 In this video I will take a look at  whether that is actually necessary   00:00:09.060 --> 00:00:14.130 and where they used to do that or not. In Heckman's paper the authors argued   00:00:14.130 --> 00:00:17.310 that they did sender the variables  to reduce a multicollinearity.  00:00:17.310 --> 00:00:23.400 The idea of of centering are and  multicollinearity is that if you have   00:00:23.400 --> 00:00:30.720 X and M and then you form a product of X and M. Then the product will be correlated in both X and   00:00:30.720 --> 00:00:37.110 M because those two variables form the interaction  and by centering we can reduce those correlations.  00:00:37.110 --> 00:00:43.260 So let's take a look at some data, and we  have two random numbers here X 1 and X 2.  00:00:43.260 --> 00:00:48.570 Here the X 1 and X 2 are have means  of two, and here we have centered   00:00:48.570 --> 00:00:54.000 the variables X 1 and X 2 to have means of 0. So the idea of centering is that you take the   00:00:54.000 --> 00:00:59.070 original variable and then you substract the mean,  and that will make the mean of the variable to be   00:00:59.070 --> 00:01:05.370 0, and we say that the variable is centered.  The bar symbol over the X means that it's   00:01:05.370 --> 00:01:12.420 it's centered, it saw the mean of that variable. And standardization is our centering and dividing   00:01:12.420 --> 00:01:20.460 by standard. We can see here that our even the  x1 and x2 are not very strongly correlated. So   00:01:20.460 --> 00:01:26.520 here that's the pattern no particle pattern. But when we multiply x1 and x2 together,   00:01:26.520 --> 00:01:32.790 then that product is highly correlated x1 and x2. So there's a strong statistical relationship.  00:01:32.790 --> 00:01:38.040 When we center the variables or the  bivariate relationship here stays the same.  00:01:38.040 --> 00:01:45.900 But we can see that the relationship between  x1 and x2 and their product is quite different.  00:01:45.900 --> 00:01:52.320 There's still a strong statistical  relationship, so when our x1 or x2 goes to 0.  00:01:52.320 --> 00:01:55.980 Then there's no variation  and in the data and then it's   00:01:55.980 --> 00:02:03.210 spread out spread out when x1 and x2 increases. So there is still a strong statistical association   00:02:03.210 --> 00:02:08.400 but it is no longer linear association. So what's the implication for regression   00:02:08.400 --> 00:02:15.660 analysis with this decentering stuff. On the left hand side we have the   00:02:15.660 --> 00:02:20.460 variable where the data are there regression  analysis for the data that is not centered,  00:02:20.460 --> 00:02:24.060 and on the right hand side we have  regression analysis for center data.  00:02:24.060 --> 00:02:32.730 And we can see that the differencewhat the  centering does for regression of of Y on X 1   00:02:32.730 --> 00:02:39.600 and X 2 is that it just says is the intercept. So only the intercept is different and the   00:02:39.600 --> 00:02:46.980 first-order effects of x1 and x2 are the same. Which is quite natural because y when you send   00:02:46.980 --> 00:02:53.160 to you're simply subtracting something from X  and something form X 2, and that will simply   00:02:53.160 --> 00:02:58.500 because you subtract the same number for every  observation that will only alter the intercept,  00:02:58.500 --> 00:03:04.230 because it doesn't affect the correlations of  the covariance of X 1 with X 2 and the covariance   00:03:04.230 --> 00:03:07.620 between those two variables and why. Those are unaffected by centering,   00:03:07.620 --> 00:03:13.470 so centering will only affect means and in normal  regression analysis it only affects the intercept.  00:03:13.470 --> 00:03:20.130 What's the downside of centering is that once  we calculate predictions, here the predictions   00:03:20.130 --> 00:03:26.370 for this model are on the original metric. So we will get our predictions on whatever   00:03:26.370 --> 00:03:31.590 the Y is and if we calculate predictions  using this model, then the predictions   00:03:31.590 --> 00:03:37.770 will be off by the amount that we centered. So for example if we're predicting our salary   00:03:37.770 --> 00:03:43.050 and let's say this model would  give our 10000 euros per year.  00:03:43.050 --> 00:03:48.960 Then this model could give our minus  2000 euros which doesn't make sense   00:03:48.960 --> 00:03:54.630 unless we back convert or back translate  that effect to the non standard variables.  00:03:54.630 --> 00:04:01.770 So centering makes predictions and mix are doing  plots that apply predictions more difficult and   00:04:01.770 --> 00:04:08.070 that's important for interactions for reasons  that are I'll explain in the last slide.  00:04:08.070 --> 00:04:14.850 When we are take an interaction term we can see  now that are there are some more differences.  00:04:14.850 --> 00:04:20.460 Importantly the differences are only  in the first three coefficients.  00:04:20.460 --> 00:04:27.210 So intercept again is different witches are  expected but now x1 and - coefficients are   00:04:27.210 --> 00:04:32.700 different but their interaction of  x1 and x2 is the exact same number.  00:04:32.700 --> 00:04:39.090 So the centering actually doesn't  influence the interaction term at all.  00:04:39.090 --> 00:04:45.210 It influences only the first-order coefficients. So is that something that you want to do or not.   00:04:45.210 --> 00:04:52.290 We have to consider to answer that question we  have to consider what exactly the centering means,  00:04:52.290 --> 00:04:56.970 and what it what exactly it means that  we have this interaction term here.  00:04:56.970 --> 00:05:06.960 Let's take a look at a graph. So here there on the  x1 and x2 effects are when x1 and x2 is 0 and here   00:05:06.960 --> 00:05:14.550 the x1 and x3 effects are the main effects. So when x1 and x2 are at their means,   00:05:14.550 --> 00:05:20.400 then that's what the x1 and x3 effects are. What that means can be understood   00:05:20.400 --> 00:05:26.880 by looking at this graphically. So we have here up space and there is a plane   00:05:26.880 --> 00:05:36.300 in the space. Here we have on x1 on this axis  we have x2 in this axis and then we have Y here.  00:05:36.300 --> 00:05:42.330 So when we have two coefficients or two  variables in a regression analysis as two   00:05:42.330 --> 00:05:46.320 independent variables, then the regression  is a plane in three-dimensional space.  00:05:46.320 --> 00:05:57.210 And we can see the plane here and because of the  interaction the effect of x1 on Y is the strength   00:05:57.210 --> 00:06:07.440 of that effect is contingent on the value of x2. So here when X 2 is at 0, then our x1 simply   00:06:07.440 --> 00:06:10.590 increases a little, so the  effect is not that great.  00:06:10.590 --> 00:06:19.500 When x2 is at 5 the affect this lot like graer  so we have this see a lot steeper slope here.  00:06:19.500 --> 00:06:25.890 So the idea is that that the regression  slope of x1 changes as a function of X 2.  00:06:25.890 --> 00:06:31.020 Also the intercept changes so  this line goes on goes down here.  00:06:31.020 --> 00:06:39.720 So what's entering does is that normally when we  do an interaction term we take the effect of x1.  00:06:39.720 --> 00:06:43.980 So the interaction regression with  interaction gives you the effects of   00:06:43.980 --> 00:06:50.880 x1 effects of x2 and their product. When we don't Center our data the   00:06:50.880 --> 00:06:56.460 effect of x1 is this blue line here  so it's the effect of x1 when x2 is 0.  00:06:56.460 --> 00:07:03.720 Similarly the effect of x2 is  the effect of extreme when x1   00:07:03.720 --> 00:07:11.910 is 0. When we Center instead of taking on  the effect of x2 is at 0 for it for x1 we   00:07:11.910 --> 00:07:19.320 take an effect of x1 when x2 is a this mean. So we take our this green line in the middle.  00:07:19.320 --> 00:07:26.880 So the centering just influences which  of these possible lines do we take it   00:07:26.880 --> 00:07:34.020 from here from here or perhaps all the  way from the other end of the data.  00:07:34.020 --> 00:07:41.220 So it just changed this at which part of  the regression plane we are looking at  00:07:41.220 --> 00:07:47.610 But the problem is that you  have to look at multiple places.  00:07:47.610 --> 00:07:52.860 So you can't summarize this plane by  saying that the effect of x1 is this line.  00:07:52.860 --> 00:07:56.910 You have to show multiple lines. So it doesn't really matter which of these   00:07:56.910 --> 00:08:00.510 lines you show in your regression table. And that's the problem.  00:08:00.510 --> 00:08:07.710 So you have to do are these interaction plots. So you have to show multiple plots, so you say   00:08:07.710 --> 00:08:16.200 you should show that our the slope of x1  depends on the value of x2. And widths of   00:08:16.200 --> 00:08:22.500 these lines we show in the in the regression table  is arbitrary, so it doesn't really matter because   00:08:22.500 --> 00:08:28.740 we have to present this kind of plots anyway. So what we show here whether we have the effect   00:08:28.740 --> 00:08:34.140 of x1 here to be the blue, green or red  line doesn't really make a difference.   00:08:34.140 --> 00:08:38.700 We have to show all the lines anyway. The problem with centering is that our   00:08:38.700 --> 00:08:45.360 once we central variables then the interaction  plot the values of the predictive values of Y,   00:08:45.360 --> 00:08:49.110 will be incorrect by the amount  that we are we center the data.  00:08:49.110 --> 00:08:56.580 So we can no longer do predictions or usefully  we have to convert the predictions back to   00:08:56.580 --> 00:09:02.760 the noncentral metric for them to make sense. So centering is not useful because it doesn't   00:09:02.760 --> 00:09:05.970 do anything for the interpretation. You will have to interpret the results   00:09:05.970 --> 00:09:11.160 with this kind of plot anyway, and centering will be harmful   00:09:11.160 --> 00:09:15.900 for this plot because all it makes  forming these plots more difficult,  00:09:15.900 --> 00:09:20.040 because you have to are back convert  your variables of the original   00:09:20.040 --> 00:09:25.980 metric to get the predictions correct. So because of these our consideration or   00:09:25.980 --> 00:09:31.770 my recommendation is never Center your  data it's not useful and it is harmful.