WEBVTT
00:00:00.030 --> 00:00:03.990
I will next explain the
idea of regression analysis.
00:00:03.990 --> 00:00:06.090
Regression analysis is one of the most
00:00:06.090 --> 00:00:08.520
commonly used analysis tools
in quantitative research,
00:00:08.520 --> 00:00:11.820
and most applications of quantitative techniques
00:00:11.820 --> 00:00:16.200
can be thought of as special cases or
extensions of this particular analysis.
00:00:16.200 --> 00:00:21.630
The regression analysis results are
typically presented as a table like this.
00:00:21.630 --> 00:00:24.360
So here we have four different regression models,
00:00:24.360 --> 00:00:27.120
we have different regression coefficients,
00:00:27.120 --> 00:00:29.310
we have different model indices.
00:00:29.310 --> 00:00:34.230
There are certain assumptions behind
this table that you need to understand
00:00:34.230 --> 00:00:36.030
and also you need to understand,
00:00:36.030 --> 00:00:36.960
what the numbers tell us.
00:00:36.960 --> 00:00:39.750
And we will be looking at this kind of tables
00:00:39.750 --> 00:00:40.980
in the next couple of videos,
00:00:40.980 --> 00:00:42.180
but I will first explain,
00:00:42.180 --> 00:00:45.690
what is the regression analysis actually about?
00:00:45.690 --> 00:00:51.390
So in regression analysis, we
have two kinds of variables,
00:00:51.390 --> 00:00:55.380
we have one dependent variable
that we want to explain,
00:00:55.380 --> 00:00:59.520
for example a company's profitability,
ROA could be a dependent variable.
00:00:59.520 --> 00:01:02.730
Then we have multiple independent variables.
00:01:02.730 --> 00:01:05.520
The independent variables are variables that
00:01:05.520 --> 00:01:07.920
we use to explain the dependent variable,
00:01:07.920 --> 00:01:12.180
for example, we could have
CEO gender and company size,
00:01:12.180 --> 00:01:13.440
and company industry.
00:01:13.440 --> 00:01:17.730
Then regression analysis answers the question,
00:01:17.730 --> 00:01:20.160
how much do these variables together
00:01:20.160 --> 00:01:22.980
explain the variation of the dependent variable,
00:01:22.980 --> 00:01:28.800
and which ones of the variables are the
most important ones for explaining that.
00:01:28.800 --> 00:01:32.370
So regression analysis allows us to
00:01:32.370 --> 00:01:36.480
control for alternative explanations
for an observed correlation.
00:01:36.480 --> 00:01:39.930
In the case of the paper by Hekman
00:01:39.930 --> 00:01:42.750
from which the previous table was from,
00:01:44.760 --> 00:01:48.630
they explained patient satisfaction scores with,
00:01:48.630 --> 00:01:50.550
for example physician productivity,
00:01:50.550 --> 00:01:53.460
physician quality and physician accessibility.
00:01:53.460 --> 00:01:57.660
So you have one thing that you
explain with multiple things to see,
00:01:57.660 --> 00:02:02.640
which one of those multiple potential
explanatory variables actually matters?
00:02:03.894 --> 00:02:07.744
The idea of regression analysis is
commonly presented as a Venn diagram.
00:02:07.913 --> 00:02:14.310
This Venn diagram is useful for illustrating
some properties of regression analysis,
00:02:14.310 --> 00:02:16.470
but it doesn't illustrate all the properties,
00:02:16.470 --> 00:02:18.495
but it's a good starting point nevertheless.
00:02:18.495 --> 00:02:22.140
So the idea of these circles here is that,
00:02:22.140 --> 00:02:27.960
this circle here presents the variation of
company performance or return on assets.
00:02:27.960 --> 00:02:30.964
So this is the variation
of the dependent variable.
00:02:30.964 --> 00:02:35.280
This is the variation of the independent
variable that we're interested in,
00:02:35.280 --> 00:02:37.200
which in this case is the CEO gender.
00:02:37.200 --> 00:02:40.179
And this is the variation in company size.
00:02:41.026 --> 00:02:42.923
Now we are interested in,
00:02:43.093 --> 00:02:48.240
how much of this co-variation or
correlation between gender and performance
00:02:48.240 --> 00:02:50.862
is actually due to gender,
00:02:50.862 --> 00:02:54.990
and how much is due to the effect of size
00:02:54.990 --> 00:02:57.360
because size and gender are correlated.
00:02:57.360 --> 00:03:01.650
So we could say that the correlation
between gender and performance
00:03:01.650 --> 00:03:07.500
is partly due to the presumed causal
influence of CEO gender on performance,
00:03:07.500 --> 00:03:12.600
and partly because smaller companies
tend to be more profitable,
00:03:12.600 --> 00:03:14.080
this correlation here,
00:03:14.368 --> 00:03:18.545
and also because smaller companies
tend to more likely hire women CEOs,
00:03:18.545 --> 00:03:20.886
which is this correlation here.
00:03:21.463 --> 00:03:26.610
Now we want to use regression
analysis to parcel out this part
00:03:26.610 --> 00:03:30.150
that is shared by gender and size and performance,
00:03:30.150 --> 00:03:33.330
to get the unique effect of the performance.
00:03:33.330 --> 00:03:36.990
So we could think of regression
analysis as doing something like this.
00:03:36.990 --> 00:03:43.560
So it eliminates the effect of company size on the relationship between gender and performance.
00:03:43.967 --> 00:03:48.810
Of course, we are not limited to just
two independent variables,
00:03:48.810 --> 00:03:54.000
we can have multiple competing explanations for the dependent variable in the model.
00:03:54.000 --> 00:03:57.733
Typically we would have in the
ballpark of 10 or 20 variables.
00:03:58.784 --> 00:04:05.383
So we can take additional bites away to
get a cleaner estimate of this correlation
00:04:05.518 --> 00:04:07.582
between gender and performance,
00:04:07.582 --> 00:04:10.533
that is free of any third causes.
00:04:10.990 --> 00:04:17.010
Ultimately we would get a clean causal
effect between gender and performance,
00:04:17.010 --> 00:04:21.210
if we have included all
relevant controls to the model.
00:04:21.210 --> 00:04:23.491
That, of course, is easier said than done.
00:04:25.016 --> 00:04:28.230
Regression analysis is a statistical model
00:04:28.230 --> 00:04:31.080
and a model is an equation,
00:04:31.080 --> 00:04:33.930
so whenever you hear up the term model
00:04:33.930 --> 00:04:36.000
it means that there is some math,
00:04:36.000 --> 00:04:40.019
and the model can also be presented
as a path diagram like this.
00:04:41.104 --> 00:04:44.130
I will first talk about the path diagram.
00:04:44.130 --> 00:04:46.140
So the path diagram here has
00:04:46.140 --> 00:04:52.000
one dependent variable y,
three independent variables x,
00:04:52.000 --> 00:04:54.030
and the x's are independent,
00:04:54.030 --> 00:04:55.740
they are allowed to be freely correlated.
00:04:55.740 --> 00:04:59.310
Free correlation is this
double-headed curved arrow,
00:04:59.310 --> 00:05:02.222
means that we don't really care about
00:05:02.222 --> 00:05:07.248
how these different explanatory variables
usually denoted with x, are related,
00:05:07.248 --> 00:05:09.216
but we are interested in estimating,
00:05:09.216 --> 00:05:13.604
how they explain or predict
the dependent variable y.
00:05:14.028 --> 00:05:16.205
The strength of influence of each variable
00:05:16.205 --> 00:05:19.351
is quantified by a regression coefficient beta.
00:05:19.639 --> 00:05:23.910
So we have one beta for each x here,
00:05:23.910 --> 00:05:27.390
then we have beta 0 or the intercept,
00:05:27.390 --> 00:05:30.690
which tells us the base level of y,
00:05:30.690 --> 00:05:37.860
when all of these explanatory or
independent variables are at 0.
00:05:38.580 --> 00:05:41.598
And then we have some variation u,
00:05:41.751 --> 00:05:43.500
that the model doesn't explain.
00:05:43.500 --> 00:05:46.583
So this is the remaining variation
that is not explained by the model.
00:05:47.278 --> 00:05:52.517
So let's say that the model explains 20%
of the variation of the dependent variable,
00:05:52.517 --> 00:05:55.124
which is fairly typical for business research.
00:05:55.429 --> 00:06:00.360
Then the unexplained variation would
account for 80% of the true variation of y,
00:06:00.360 --> 00:06:01.680
in the data.
00:06:02.172 --> 00:06:08.130
In equation form, we can see that the
y here is a weighted sum of the x's,
00:06:08.130 --> 00:06:11.160
and the weights are the regression coefficients.
00:06:11.160 --> 00:06:14.010
And each of these regression
coefficients quantifies,
00:06:14.010 --> 00:06:16.560
what is the influence of one variable,
00:06:16.560 --> 00:06:20.100
one of the independent variables
on the dependent variable.
00:06:21.202 --> 00:06:23.910
So, for example, we can model patient satisfaction
00:06:23.910 --> 00:06:27.120
as a weighted sum of physician productivity,
00:06:27.120 --> 00:06:30.660
physician quality, physician accessibility,
00:06:30.660 --> 00:06:36.150
and some variation that the model doesn't explain.
00:06:36.150 --> 00:06:38.280
What's important to understand is that
00:06:38.280 --> 00:06:40.997
these effects are independent so
00:06:41.310 --> 00:06:44.430
when x increases one unit then that beta tells,
00:06:44.430 --> 00:06:49.290
what is the effect of one unit increase
independently of the other variables?
00:06:49.290 --> 00:06:52.380
And also they are linear,
00:06:52.380 --> 00:06:55.830
so that we always assume
that one unit increase in x
00:06:55.830 --> 00:06:58.650
is always associated to the
same amount of increase in y,
00:06:58.650 --> 00:07:01.840
which is quantified by the beta.
00:07:03.382 --> 00:07:08.640
Graphically, regression analysis
can be understood as a line.
00:07:08.640 --> 00:07:12.210
And I will show you two-variable
regression analysis.
00:07:12.210 --> 00:07:14.460
This is also called as the simple regression,
00:07:14.460 --> 00:07:17.802
because we have only one independent variable.
00:07:18.497 --> 00:07:21.114
So here the independent variable is,
00:07:21.242 --> 00:07:23.850
let's say it's years of education for example,
00:07:23.850 --> 00:07:27.000
and this dependent variable here is
00:07:27.000 --> 00:07:28.971
let's say it's salary.
00:07:28.971 --> 00:07:30.982
And we are interested in knowing,
00:07:30.982 --> 00:07:33.275
what is the linear relationship,
00:07:33.275 --> 00:07:36.603
so what's the best line that explains this data.
00:07:36.823 --> 00:07:41.029
So regression analysis in this simple
regression with one independent variable,
00:07:41.182 --> 00:07:43.920
basically, you can think of it as
00:07:43.920 --> 00:07:47.070
plotting all the data as a scatterplot here,
00:07:47.070 --> 00:07:48.990
we will show some scatter plots a bit later,
00:07:48.990 --> 00:07:51.570
and then draw a line through the data,
00:07:51.570 --> 00:07:54.104
so that gives us the regression line.
00:07:54.540 --> 00:07:57.091
The slope of this line here,
00:07:57.091 --> 00:08:00.000
how strongly it goes up or down,
00:08:00.000 --> 00:08:03.270
is quantified by the regression coefficient.
00:08:04.745 --> 00:08:08.444
We make some assumptions when
we run a regression analysis.
00:08:08.631 --> 00:08:13.410
One of the key assumptions in
justifying regression analysis is
00:08:13.410 --> 00:08:19.800
that these observations then are equally and normally distributed around the regression line.
00:08:19.800 --> 00:08:22.599
So that when we have a regression line here,
00:08:22.599 --> 00:08:26.637
the most likely case is that the
observations are close to the line,
00:08:26.724 --> 00:08:29.610
but there can be some observations
that are far from the line,
00:08:29.610 --> 00:08:31.860
but they should be relatively rare.