WEBVTT 00:00:00.030 --> 00:00:03.990 I will next explain the  idea of regression analysis. 00:00:03.990 --> 00:00:06.090 Regression analysis is one of the most 00:00:06.090 --> 00:00:08.520 commonly used analysis tools  in quantitative research, 00:00:08.520 --> 00:00:11.820 and most applications of quantitative techniques 00:00:11.820 --> 00:00:16.200 can be thought of as special cases or  extensions of this particular analysis. 00:00:16.200 --> 00:00:21.630 The regression analysis results are  typically presented as a table like this. 00:00:21.630 --> 00:00:24.360 So here we have four different regression models, 00:00:24.360 --> 00:00:27.120 we have different regression coefficients, 00:00:27.120 --> 00:00:29.310 we have different model indices. 00:00:29.310 --> 00:00:34.230 There are certain assumptions behind  this table that you need to understand 00:00:34.230 --> 00:00:36.030 and also you need to understand, 00:00:36.030 --> 00:00:36.960 what the numbers tell us. 00:00:36.960 --> 00:00:39.750 And we will be looking at this kind of tables 00:00:39.750 --> 00:00:40.980 in the next couple of videos, 00:00:40.980 --> 00:00:42.180 but I will first explain, 00:00:42.180 --> 00:00:45.690 what is the regression analysis actually about? 00:00:45.690 --> 00:00:51.390 So in regression analysis, we  have two kinds of variables, 00:00:51.390 --> 00:00:55.380 we have one dependent variable  that we want to explain, 00:00:55.380 --> 00:00:59.520 for example a company's profitability,  ROA could be a dependent variable. 00:00:59.520 --> 00:01:02.730 Then we have multiple independent variables. 00:01:02.730 --> 00:01:05.520 The independent variables are variables that 00:01:05.520 --> 00:01:07.920 we use to explain the dependent variable, 00:01:07.920 --> 00:01:12.180 for example, we could have  CEO gender and company size, 00:01:12.180 --> 00:01:13.440 and company industry. 00:01:13.440 --> 00:01:17.730 Then regression analysis answers the question, 00:01:17.730 --> 00:01:20.160 how much do these variables together 00:01:20.160 --> 00:01:22.980 explain the variation of the dependent variable, 00:01:22.980 --> 00:01:28.800 and which ones of the variables are the  most important ones for explaining that. 00:01:28.800 --> 00:01:32.370 So regression analysis allows us to 00:01:32.370 --> 00:01:36.480 control for alternative explanations  for an observed correlation. 00:01:36.480 --> 00:01:39.930 In the case of the paper by Hekman 00:01:39.930 --> 00:01:42.750 from which the previous table was from, 00:01:44.760 --> 00:01:48.630 they explained patient satisfaction scores with, 00:01:48.630 --> 00:01:50.550 for example physician productivity, 00:01:50.550 --> 00:01:53.460 physician quality and physician accessibility. 00:01:53.460 --> 00:01:57.660 So you have one thing that you explain with multiple things to see, 00:01:57.660 --> 00:02:02.640 which one of those multiple potential  explanatory variables actually matters? 00:02:03.894 --> 00:02:07.744 The idea of regression analysis is  commonly presented as a Venn diagram. 00:02:07.913 --> 00:02:14.310 This Venn diagram is useful for illustrating  some properties of regression analysis, 00:02:14.310 --> 00:02:16.470 but it doesn't illustrate all the properties, 00:02:16.470 --> 00:02:18.495 but it's a good starting point nevertheless. 00:02:18.495 --> 00:02:22.140 So the idea of these circles here is that, 00:02:22.140 --> 00:02:27.960 this circle here presents the variation of  company performance or return on assets. 00:02:27.960 --> 00:02:30.964 So this is the variation  of the dependent variable. 00:02:30.964 --> 00:02:35.280 This is the variation of the independent  variable that we're interested in, 00:02:35.280 --> 00:02:37.200 which in this case is the CEO gender. 00:02:37.200 --> 00:02:40.179 And this is the variation in company size. 00:02:41.026 --> 00:02:42.923 Now we are interested in, 00:02:43.093 --> 00:02:48.240 how much of this co-variation or  correlation between gender and performance 00:02:48.240 --> 00:02:50.862 is actually due to gender, 00:02:50.862 --> 00:02:54.990 and how much is due to the effect of size 00:02:54.990 --> 00:02:57.360 because size and gender are correlated. 00:02:57.360 --> 00:03:01.650 So we could say that the correlation  between gender and performance 00:03:01.650 --> 00:03:07.500 is partly due to the presumed causal  influence of CEO gender on performance, 00:03:07.500 --> 00:03:12.600 and partly because smaller companies  tend to be more profitable, 00:03:12.600 --> 00:03:14.080 this correlation here, 00:03:14.368 --> 00:03:18.545 and also because smaller companies  tend to more likely hire women CEOs, 00:03:18.545 --> 00:03:20.886 which is this correlation here. 00:03:21.463 --> 00:03:26.610 Now we want to use regression  analysis to parcel out this part 00:03:26.610 --> 00:03:30.150 that is shared by gender and size and performance, 00:03:30.150 --> 00:03:33.330 to get the unique effect of the performance. 00:03:33.330 --> 00:03:36.990 So we could think of regression  analysis as doing something like this. 00:03:36.990 --> 00:03:43.560 So it eliminates the effect of company size on the relationship between gender and performance. 00:03:43.967 --> 00:03:48.810 Of course, we are not limited to just  two independent variables, 00:03:48.810 --> 00:03:54.000 we can have multiple competing explanations for the dependent variable in the model. 00:03:54.000 --> 00:03:57.733 Typically we would have in the  ballpark of 10 or 20 variables. 00:03:58.784 --> 00:04:05.383 So we can take additional bites away to  get a cleaner estimate of this correlation 00:04:05.518 --> 00:04:07.582 between gender and performance, 00:04:07.582 --> 00:04:10.533 that is free of any third causes. 00:04:10.990 --> 00:04:17.010 Ultimately we would get a clean causal  effect between gender and performance, 00:04:17.010 --> 00:04:21.210 if we have included all  relevant controls to the model. 00:04:21.210 --> 00:04:23.491 That, of course, is easier said than done. 00:04:25.016 --> 00:04:28.230 Regression analysis is a statistical model 00:04:28.230 --> 00:04:31.080 and a model is an equation, 00:04:31.080 --> 00:04:33.930 so whenever you hear up the term model 00:04:33.930 --> 00:04:36.000 it means that there is some math, 00:04:36.000 --> 00:04:40.019 and the model can also be presented  as a path diagram like this. 00:04:41.104 --> 00:04:44.130 I will first talk about the path diagram. 00:04:44.130 --> 00:04:46.140 So the path diagram here has 00:04:46.140 --> 00:04:52.000 one dependent variable y,  three independent variables x, 00:04:52.000 --> 00:04:54.030 and the x's are independent, 00:04:54.030 --> 00:04:55.740 they are allowed to be freely correlated. 00:04:55.740 --> 00:04:59.310 Free correlation is this  double-headed curved arrow, 00:04:59.310 --> 00:05:02.222 means that we don't really care about 00:05:02.222 --> 00:05:07.248 how these different explanatory variables  usually denoted with x, are related, 00:05:07.248 --> 00:05:09.216 but we are interested in estimating, 00:05:09.216 --> 00:05:13.604 how they explain or predict  the dependent variable y. 00:05:14.028 --> 00:05:16.205 The strength of influence of each variable 00:05:16.205 --> 00:05:19.351 is quantified by a regression coefficient beta. 00:05:19.639 --> 00:05:23.910 So we have one beta for each x here, 00:05:23.910 --> 00:05:27.390 then we have beta 0 or the intercept, 00:05:27.390 --> 00:05:30.690 which tells us the base level of y, 00:05:30.690 --> 00:05:37.860 when all of these explanatory or  independent variables are at 0. 00:05:38.580 --> 00:05:41.598 And then we have some variation u, 00:05:41.751 --> 00:05:43.500 that the model doesn't explain. 00:05:43.500 --> 00:05:46.583 So this is the remaining variation  that is not explained by the model. 00:05:47.278 --> 00:05:52.517 So let's say that the model explains 20%  of the variation of the dependent variable, 00:05:52.517 --> 00:05:55.124 which is fairly typical for business research. 00:05:55.429 --> 00:06:00.360 Then the unexplained variation would  account for 80% of the true variation of y, 00:06:00.360 --> 00:06:01.680 in the data. 00:06:02.172 --> 00:06:08.130 In equation form, we can see that the  y here is a weighted sum of the x's, 00:06:08.130 --> 00:06:11.160 and the weights are the regression coefficients. 00:06:11.160 --> 00:06:14.010 And each of these regression  coefficients quantifies, 00:06:14.010 --> 00:06:16.560 what is the influence of one variable, 00:06:16.560 --> 00:06:20.100 one of the independent variables  on the dependent variable. 00:06:21.202 --> 00:06:23.910 So, for example, we can model patient satisfaction 00:06:23.910 --> 00:06:27.120 as a weighted sum of physician productivity, 00:06:27.120 --> 00:06:30.660 physician quality, physician accessibility, 00:06:30.660 --> 00:06:36.150 and some variation that the model doesn't explain. 00:06:36.150 --> 00:06:38.280 What's important to understand is that 00:06:38.280 --> 00:06:40.997 these effects are independent so 00:06:41.310 --> 00:06:44.430 when x increases one unit then that beta tells, 00:06:44.430 --> 00:06:49.290 what is the effect of one unit increase  independently of the other variables? 00:06:49.290 --> 00:06:52.380 And also they are linear, 00:06:52.380 --> 00:06:55.830 so that we always assume  that one unit increase in x 00:06:55.830 --> 00:06:58.650 is always associated to the  same amount of increase in y, 00:06:58.650 --> 00:07:01.840 which is quantified by the beta. 00:07:03.382 --> 00:07:08.640 Graphically, regression analysis  can be understood as a line. 00:07:08.640 --> 00:07:12.210 And I will show you two-variable  regression analysis. 00:07:12.210 --> 00:07:14.460 This is also called as the simple regression, 00:07:14.460 --> 00:07:17.802 because we have only one independent variable. 00:07:18.497 --> 00:07:21.114 So here the independent variable is, 00:07:21.242 --> 00:07:23.850 let's say it's years of education for example, 00:07:23.850 --> 00:07:27.000 and this dependent variable here is 00:07:27.000 --> 00:07:28.971 let's say it's salary. 00:07:28.971 --> 00:07:30.982 And we are interested in knowing, 00:07:30.982 --> 00:07:33.275 what is the linear relationship, 00:07:33.275 --> 00:07:36.603 so what's the best line that explains this data. 00:07:36.823 --> 00:07:41.029 So regression analysis in this simple  regression with one independent variable, 00:07:41.182 --> 00:07:43.920 basically, you can think of it as 00:07:43.920 --> 00:07:47.070 plotting all the data as a scatterplot here, 00:07:47.070 --> 00:07:48.990 we will show some scatter plots a bit later, 00:07:48.990 --> 00:07:51.570 and then draw a line through the data, 00:07:51.570 --> 00:07:54.104 so that gives us the regression line. 00:07:54.540 --> 00:07:57.091 The slope of this line here, 00:07:57.091 --> 00:08:00.000 how strongly it goes up or down, 00:08:00.000 --> 00:08:03.270 is quantified by the regression coefficient. 00:08:04.745 --> 00:08:08.444 We make some assumptions when  we run a regression analysis. 00:08:08.631 --> 00:08:13.410 One of the key assumptions in  justifying regression analysis is 00:08:13.410 --> 00:08:19.800 that these observations then are equally and normally distributed around the regression line. 00:08:19.800 --> 00:08:22.599 So that when we have a regression line here, 00:08:22.599 --> 00:08:26.637 the most likely case is that the  observations are close to the line, 00:08:26.724 --> 00:08:29.610 but there can be some observations  that are far from the line, 00:08:29.610 --> 00:08:31.860 but they should be relatively rare.