WEBVTT Kind: captions Language: en 00:00:01.344 --> 00:00:04.020 One of the key assumptions in regression analysis 00:00:04.020 --> 00:00:07.538 with all as estimation is  that there's no endogeneity. 00:00:07.538 --> 00:00:10.860 The endogeneity issue has  been ignored in the past, 00:00:10.860 --> 00:00:14.370 but it has been receiving  increased attention in recent years 00:00:14.370 --> 00:00:15.790 in many editorials. 00:00:15.888 --> 00:00:18.960 And journals are increasingly requiring, 00:00:18.960 --> 00:00:21.180 that the authors who submitted the journals, 00:00:21.180 --> 00:00:24.630 address the issue of endogeneity explicitly. 00:00:24.630 --> 00:00:28.980 This has been a difficult to identify issue, 00:00:28.980 --> 00:00:33.621 because endogeneity cannot be tested  directly from the regression results, 00:00:33.621 --> 00:00:36.450 it will require more advanced modeling techniques. 00:00:36.862 --> 00:00:39.390 Let's look at, what the issue  of endogeneity is about, 00:00:39.390 --> 00:00:40.740 and I will explain, 00:00:40.740 --> 00:00:43.620 how you can deal with this issue in later videos. 00:00:45.000 --> 00:00:51.060 Understanding endogeneity is useful  to start from an experimental design. 00:00:51.060 --> 00:00:54.450 So in experimental design, the assignment, 00:00:54.450 --> 00:00:59.199 here are, we have a random  assignment, to treatment and control, 00:00:59.199 --> 00:01:02.310 then we administer some kind  of treatment to one group, 00:01:02.310 --> 00:01:04.110 the other group doesn't receive the treatment, 00:01:04.110 --> 00:01:07.080 we measure the outcome variable of interest, 00:01:07.080 --> 00:01:10.980 then the difference between these two measures, 00:01:10.980 --> 00:01:14.550 post-treatment, can be  interpreted as a causal effect. 00:01:15.079 --> 00:01:18.390 So what justifies interpreting  this difference as causal? 00:01:18.743 --> 00:01:20.903 It is the assumption that 00:01:20.903 --> 00:01:22.463 R is exogenous. 00:01:22.659 --> 00:01:25.530 So the R here, the random assignment, 00:01:25.530 --> 00:01:30.259 doesn't depend on the variable  that we are interested in studying. 00:01:30.416 --> 00:01:33.337 For example, if we test the medicine then, 00:01:33.415 --> 00:01:36.349 who gets the medicine, who gets a placebo, 00:01:36.349 --> 00:01:39.240 shouldn't depend on the  initial health of the people. 00:01:39.240 --> 00:01:43.710 So it's important that this is  randomized independently of, 00:01:43.710 --> 00:01:44.790 what we are studying, 00:01:44.790 --> 00:01:46.807 and that guarantees exogeneity. 00:01:47.415 --> 00:01:49.445 If R is endogenous, 00:01:49.445 --> 00:01:52.740 it means that the R depends somehow 00:01:52.740 --> 00:01:55.470 on the variable that we're studying, 00:01:55.470 --> 00:01:57.146 for example people's health. 00:01:57.695 --> 00:02:01.187 Let's say that we have medicine, 00:02:01.971 --> 00:02:04.374 that has some side effects. 00:02:04.394 --> 00:02:06.614 And we have people who vary, 00:02:06.614 --> 00:02:07.584 how sick they are, 00:02:08.074 --> 00:02:11.130 and we have people who can choose 00:02:11.130 --> 00:02:13.200 whether they go to the treatment or control. 00:02:13.965 --> 00:02:15.765 In that scenario, 00:02:16.204 --> 00:02:18.900 people who are not that sick 00:02:18.900 --> 00:02:21.990 will choose to go to the control  to avoid the side effects, 00:02:21.990 --> 00:02:24.360 and only those people who are really sick, 00:02:24.360 --> 00:02:26.340 choose to go to the treatment group. 00:02:26.987 --> 00:02:31.830 If that happens then the assignment  to the treatment and control 00:02:31.830 --> 00:02:33.990 is no longer exogenous, 00:02:33.990 --> 00:02:35.940 instead, it's endogenous, 00:02:35.940 --> 00:02:39.150 because it depends on the health of the people, 00:02:39.150 --> 00:02:41.130 the characteristic that we study. 00:02:42.463 --> 00:02:45.228 Because R is endogenous, 00:02:45.228 --> 00:02:48.330 there are initial differences in  health between these two groups, 00:02:48.330 --> 00:02:53.760 and then we cannot anymore  interpret this difference 00:02:53.760 --> 00:02:55.890 after the treatment as a causal effect. 00:02:56.400 --> 00:02:58.530 So that's clearly a problem. 00:02:59.059 --> 00:03:03.060 Another way of understanding endogeneity  in multiple regression context 00:03:03.060 --> 00:03:04.290 is to look at the error term. 00:03:04.290 --> 00:03:08.580 So here we have a regression model  in a path diagram presentation, 00:03:08.580 --> 00:03:10.830 so we have the y, the dependent variable, 00:03:10.830 --> 00:03:13.350 three x's, the independent variables, 00:03:13.350 --> 00:03:15.510 we have the intercept and the error term. 00:03:16.020 --> 00:03:20.250 And the error term here represents  all possible causes of y 00:03:20.250 --> 00:03:22.770 that are not included in the model. 00:03:22.770 --> 00:03:25.470 So everything that can cause y, 00:03:25.470 --> 00:03:28.560 that is not included in the list of x's here, 00:03:28.560 --> 00:03:30.120 goes to the error term. 00:03:30.846 --> 00:03:34.890 If the error term or any of  these omitted causes correlate 00:03:34.890 --> 00:03:36.420 with any of the included causes, 00:03:36.420 --> 00:03:40.980 then we say that this for example x1 here 00:03:40.980 --> 00:03:44.100 becomes an endogenous explanatory variable. 00:03:44.100 --> 00:03:46.710 So a variable is correlated with the error term, 00:03:46.710 --> 00:03:49.500 its endogenous, and that causes problems. 00:03:49.500 --> 00:03:54.780 The general condition that  one or more of these variables 00:03:54.780 --> 00:03:58.170 are correlated to the error  term is called endogeneity. 00:03:59.405 --> 00:04:00.960 So that's the problem. 00:04:00.960 --> 00:04:05.640 Endogeneity, we assume that  the error term does not depend, 00:04:05.640 --> 00:04:09.120 or is not correlated with any  of the explanatory variables, 00:04:09.120 --> 00:04:14.261 if it is, overall regression will  be inconsistent and biased. 00:04:15.000 --> 00:04:18.270 So how does endogeneity arise? 00:04:18.270 --> 00:04:23.483 There are three basic mechanisms  that are useful to understand. 00:04:23.483 --> 00:04:25.560 A first simple mechanism is that 00:04:25.560 --> 00:04:26.880 there is a common cause, 00:04:27.213 --> 00:04:30.003 let's call it E of X and Y, 00:04:30.003 --> 00:04:32.280 that is not included in the model. 00:04:32.280 --> 00:04:37.381 For example, if we're studying the  effects of CEO gender on profitability, 00:04:37.832 --> 00:04:41.580 and there's a common cause, industry, 00:04:41.580 --> 00:04:44.822 so that some industries are  more likely to hire women, 00:04:44.822 --> 00:04:48.155 some industries are more profitable than others. 00:04:48.449 --> 00:04:52.470 That's a common cause of X and Y  producing a spurious correlation. 00:04:52.470 --> 00:04:56.790 If we don't include industry  as a control variable, 00:04:56.790 --> 00:05:00.690 then it's an omitted cause that goes to the error term, 00:05:00.690 --> 00:05:02.280 and it's correlated with X. 00:05:02.947 --> 00:05:06.007 A more general presentation is that 00:05:06.007 --> 00:05:10.200 X is simply correlated with  some unmodeled causes of Y, 00:05:10.200 --> 00:05:13.110 and that could happen for  multiple different reasons. 00:05:13.110 --> 00:05:15.450 I'll provide examples at the end of this video. 00:05:16.097 --> 00:05:20.425 Then a special case that is sometimes  of interest is simultaneity. 00:05:20.582 --> 00:05:24.300 So X and Y have a reciprocal causal relationship, 00:05:24.712 --> 00:05:28.012 so X causes Y and Y causes X. 00:05:28.208 --> 00:05:30.038 So if X causes Y, 00:05:30.038 --> 00:05:34.920 then the error term of Y must  include the error term of X, 00:05:34.920 --> 00:05:37.866 because X is the sum of Y  plus the error term for X, 00:05:37.866 --> 00:05:39.006 and the other way around. 00:05:39.006 --> 00:05:41.820 So that causes an endogeneity problem, 00:05:41.820 --> 00:05:43.116 if you have these two-way paths. 00:05:43.586 --> 00:05:46.200 All of these issues, we can deal with, 00:05:46.200 --> 00:05:48.570 if we understand the issue and we know, 00:05:48.570 --> 00:05:50.130 where the problem is, 00:05:50.130 --> 00:05:53.490 and we have a bit more data, more variables, 00:05:53.490 --> 00:05:56.070 but that's all going to be  covered in later issues. 00:05:56.070 --> 00:05:58.050 Now it's important to understand, 00:05:58.050 --> 00:05:59.160 what is the problem, 00:05:59.160 --> 00:06:02.310 and then, later on, we will talk on 00:06:02.310 --> 00:06:03.870 how to deal with the problem. 00:06:04.419 --> 00:06:07.590 Let's take a look at Deephouses paper, 00:06:07.590 --> 00:06:08.730 the market share. 00:06:08.730 --> 00:06:12.840 So I demonstrated this before in  the context of control variables. 00:06:13.350 --> 00:06:18.297 The idea is that larger firms  are more strategically deviant. 00:06:25.101 --> 00:06:27.151 Larger firms are more strategically deviant, 00:06:27.220 --> 00:06:28.510 the positive correlation here. 00:06:28.510 --> 00:06:31.909 And larger firms are less profitable. 00:06:31.909 --> 00:06:36.070 If we omit the market share from the equation, 00:06:36.070 --> 00:06:38.196 we will get an omitted variable bias. 00:06:38.196 --> 00:06:40.450 And what will happen now, 00:06:40.450 --> 00:06:42.520 that market share is an omitted cause, 00:06:42.520 --> 00:06:46.060 it is a cause of ROA, not included in the model. 00:06:46.060 --> 00:06:50.342 And it will be included in the  error term in the regression. 00:06:50.342 --> 00:06:53.470 So anything that is supposed to be causing ROA, 00:06:53.470 --> 00:06:55.690 that is not included in the model 00:06:55.690 --> 00:06:58.630 will be represented by the error term. 00:06:59.061 --> 00:07:02.290 And we know from these empirical results that 00:07:02.290 --> 00:07:05.290 market share and strategic deviation are correlated, 00:07:05.290 --> 00:07:10.240 therefore strategic deviation is  now correlated with the error term. 00:07:10.240 --> 00:07:13.930 And strategic deviation becomes endogenous. 00:07:13.930 --> 00:07:17.440 Of course, whether a variable  really is endogenous or not, 00:07:17.440 --> 00:07:21.010 we cannot really say it based  on the regression results. 00:07:21.010 --> 00:07:25.720 We need some additional variables  called instrumental variables, 00:07:25.720 --> 00:07:26.814 that I'll talk later. 00:07:26.814 --> 00:07:30.970 Or we have to argue the no endogeneity assumption, 00:07:30.970 --> 00:07:34.240 or exogeneity based on existing theory. 00:07:36.298 --> 00:07:38.977 So this leads to omitted variable bias 00:07:38.977 --> 00:07:43.990 and the effect of strategic deviation  will be overestimated by threefold. 00:07:45.421 --> 00:07:48.340 Let's take a look at an endogeneity problem, 00:07:48.340 --> 00:07:48.850 another one. 00:07:49.340 --> 00:07:53.020 So we have an investment in new factories, 00:07:53.020 --> 00:07:56.200 whether a company decides to  invest in new factors or not. 00:07:56.200 --> 00:07:59.140 And we have these investment decisions, 00:07:59.140 --> 00:08:02.290 we are trying to explain the  company's return on assets 00:08:02.290 --> 00:08:05.770 with those investments. 00:08:09.130 --> 00:08:11.170 The question of asking, 00:08:11.170 --> 00:08:15.370 do I have an endogenous problem begins by asking, 00:08:15.370 --> 00:08:18.880 what do investments in new factories depend on? 00:08:18.880 --> 00:08:23.650 So why do some companies invest  in new factories and others don't? 00:08:23.650 --> 00:08:25.060 So what causes the variation? 00:08:26.040 --> 00:08:28.540 So what does the investment  in new factories depend on? 00:08:29.011 --> 00:08:31.801 Well probably depends on company strategy. 00:08:31.801 --> 00:08:33.972 If the company's strategy is to grow, 00:08:33.972 --> 00:08:36.130 they will probably invest in new factories. 00:08:36.130 --> 00:08:38.470 And if they don't want to grow, 00:08:38.470 --> 00:08:40.600 they are probably not investing in factories. 00:08:40.600 --> 00:08:41.770 So that's that simple. 00:08:42.437 --> 00:08:46.510 Now, what is the no endogeneity assumption here? 00:08:47.020 --> 00:08:51.070 Unless we have this firm  strategy as a control variable, 00:08:51.070 --> 00:08:54.370 we are assuming that return on assets 00:08:54.370 --> 00:08:58.720 otherwise is completely  independent of company strategy. 00:08:58.720 --> 00:09:02.830 So company strategy can influence ROA 00:09:02.830 --> 00:09:05.170 only through influencing investments. 00:09:05.503 --> 00:09:06.973 That is of course implausible. 00:09:07.052 --> 00:09:11.732 Strategy influences performance  in multiple different ways, 00:09:11.732 --> 00:09:14.784 so we have an omitted common cause strategy. 00:09:14.803 --> 00:09:17.803 Companies investments depend on strategy, 00:09:17.803 --> 00:09:20.203 ROA depends on strategy. 00:09:21.370 --> 00:09:24.880 Partly through investments  but also through other means. 00:09:24.880 --> 00:09:27.550 If we don't control for strategy, 00:09:27.550 --> 00:09:28.690 in this kind of model, 00:09:28.690 --> 00:09:30.820 we will have an endogeneity problem. 00:09:32.310 --> 00:09:37.960 So this endogeneity problem  is explained really well 00:09:37.960 --> 00:09:41.140 by this editorial by Ketokivi and Guide, 00:09:41.140 --> 00:09:43.139 in the Journal Of Operations Management. 00:09:43.139 --> 00:09:47.452 And the problem is that we assume that 00:09:47.452 --> 00:09:52.521 all other causes are independent  of the included causes. 00:09:52.521 --> 00:09:57.319 And that is implausible. And the problem is that 00:09:57.319 --> 00:10:00.460 our estimates will be biased and inconsistent. 00:10:01.578 --> 00:10:04.000 Then they explained an example, 00:10:04.000 --> 00:10:05.890 where you could reasonably also argue that 00:10:05.890 --> 00:10:09.220 a causal effect goes to a different direction than 00:10:09.220 --> 00:10:10.870 what the author said, 00:10:10.870 --> 00:10:14.500 and if the author doesn't really  take that into consideration, 00:10:14.500 --> 00:10:18.220 then that's game over for the paper. 00:10:19.200 --> 00:10:24.850 They also explained that the  endogeneity issue must be argued, 00:10:24.850 --> 00:10:26.890 if you cannot do it empirically, 00:10:26.890 --> 00:10:28.870 you have to argue based on theory. 00:10:28.870 --> 00:10:32.080 So why do you think that  your independent variables, 00:10:32.080 --> 00:10:38.140 for example, investment in a new factory 00:10:38.140 --> 00:10:41.200 is independent of any other causes of ROA. 00:10:41.200 --> 00:10:42.910 Then you have to figure out, 00:10:42.910 --> 00:10:44.740 what causes ROA differences, 00:10:44.740 --> 00:10:45.760 company strategies, 00:10:45.760 --> 00:10:49.690 you have to argue that  manufacturing plant investments 00:10:49.690 --> 00:10:50.710 or factory investments, 00:10:50.710 --> 00:10:52.150 are uncorrelated with strategy. 00:10:52.150 --> 00:10:54.160 That's an implausible assumption, 00:10:54.160 --> 00:10:55.960 so you have an endogeneity problem.