WEBVTT
Kind: captions
Language: en

00:00:00.030 --> 00:00:04.500
Logistical regression analysis is commonly&nbsp;
used tool for binary dependent variables.&nbsp;

00:00:04.500 --> 00:00:10.170
A binary variable is a variable that receives&nbsp;
the values of 1 and 0 and it's very commonly&nbsp;&nbsp;

00:00:10.170 --> 00:00:14.640
used for decisions that are either yes&nbsp;
or no whether something happens or not.&nbsp;

00:00:14.640 --> 00:00:20.370
Whether a company decides to expand&nbsp;
internationally or whether it decide&nbsp;&nbsp;

00:00:20.370 --> 00:00:24.390
to stay in the whole markets, whether a&nbsp;
person is sick or not and that kind of data.&nbsp;

00:00:24.390 --> 00:00:30.540
To illustrate their losses regression&nbsp;
analysis technique we need to have some&nbsp;&nbsp;

00:00:30.540 --> 00:00:34.050
example data and this example&nbsp;
data are girls from Warsaw.&nbsp;

00:00:34.050 --> 00:00:40.710
And the girls range from about 10 years to&nbsp;
about 18 years and the dependent variable&nbsp;&nbsp;

00:00:40.710 --> 00:00:45.360
here is called min arts and that's whether&nbsp;
the girl has had the first period or not.&nbsp;

00:00:45.360 --> 00:00:51.570
So we can see here that girls at the age of&nbsp;
10 normally don't have had the first period,&nbsp;&nbsp;

00:00:51.570 --> 00:00:56.010
and then girls when they are 18 pretty&nbsp;
much everyone has had the first period.&nbsp;

00:00:56.010 --> 00:01:02.730
And we want to explain this relationship between&nbsp;
our age and menarche using regression analysis.&nbsp;

00:01:02.730 --> 00:01:08.130
There are a couple of problems when&nbsp;
we apply normal regression analysis.&nbsp;

00:01:08.130 --> 00:01:17.970
For this kind of data set the first problem&nbsp;
is that the regression line here goes over 1.&nbsp;

00:01:17.970 --> 00:01:23.730
So the value here, the regression&nbsp;
line gives the expected value of&nbsp;&nbsp;

00:01:23.730 --> 00:01:28.920
the dependent variable given age.
And in this case because the dependent&nbsp;&nbsp;

00:01:28.920 --> 00:01:36.360
variable is 0 and once the expected value is&nbsp;
the expected probability of having menarche.&nbsp;

00:01:36.360 --> 00:01:42.870
When we draw the line that we have a&nbsp;
problem here because the predictive&nbsp;&nbsp;

00:01:42.870 --> 00:01:48.210
probability for girls that are 18 exceeds&nbsp;
1, and probabilities bound between 1 and 0.&nbsp;

00:01:48.210 --> 00:01:55.320
Also we have negative probability here. T
This also causes a problem for regression&nbsp;&nbsp;

00:01:55.320 --> 00:02:02.340
analysis because when we have small numbers&nbsp;
small fitted values here, then all residuals&nbsp;&nbsp;

00:02:02.340 --> 00:02:09.150
are positives or they, the error term can't be&nbsp;
independent of the bow of the defeated value.&nbsp;

00:02:09.150 --> 00:02:13.510
So regression analysis we are violating&nbsp;
the noise energy assumption at least,&nbsp;&nbsp;

00:02:13.510 --> 00:02:20.440
and are the predictions don't make any sense.
So using a linear model for this kind of data&nbsp;&nbsp;

00:02:20.440 --> 00:02:25.990
is problematic for these two reasons.&nbsp;
Using this kind of linear model would&nbsp;&nbsp;

00:02:25.990 --> 00:02:32.470
be acceptable if most girls will be around&nbsp;
here, so the linear approximation would be&nbsp;&nbsp;

00:02:32.470 --> 00:02:37.660
okay because it doesn't really are predict any&nbsp;
negative values, because we can't go beyond&nbsp;&nbsp;

00:02:37.660 --> 00:02:42.910
the range of the data. But if we have negative&nbsp;
predictions on predictions that exceed one within&nbsp;&nbsp;

00:02:42.910 --> 00:02:48.280
the range of the data, then we have problems.
This model is called linear probability model&nbsp;&nbsp;

00:02:48.280 --> 00:02:53.410
and it's it can be used but there&nbsp;
are typically better alternatives.&nbsp;

00:02:53.410 --> 00:02:59.650
One better alternative is to start to start&nbsp;
discovering better alternatives we need to&nbsp;&nbsp;

00:02:59.650 --> 00:03:05.410
think about what's the relationship like and&nbsp;
we can do a nonparametric analysis, for example&nbsp;&nbsp;

00:03:05.410 --> 00:03:11.170
we take a rolling average from the data.
So the idea of rolling average is that we&nbsp;&nbsp;

00:03:11.170 --> 00:03:18.940
have here about 4,000 girls and then we take the&nbsp;
first 500 here we calculate the mean for these&nbsp;&nbsp;

00:03:18.940 --> 00:03:25.840
first 500 and then we put mark a small dot here.
The other is for these girls is zero because no&nbsp;&nbsp;

00:03:25.840 --> 00:03:33.610
one has at the menarche. Then we shift this window&nbsp;
right to a bit we check the next 500 girls so we&nbsp;&nbsp;

00:03:33.610 --> 00:03:40.900
go from the second girl to the 501st girl like&nbsp;
that we calculate the average, we mark it here.&nbsp;

00:03:40.900 --> 00:03:48.520
Then we go to the third girl to 500 second girl&nbsp;
and we calculate average for that sub sample.&nbsp;

00:03:48.520 --> 00:03:53.590
Then we continue we'll go here we can&nbsp;
see that the mean value is about 50%&nbsp;&nbsp;

00:03:53.590 --> 00:04:00.130
and our final when we calculate for all&nbsp;
possible windows, we calculate the mean.&nbsp;

00:04:00.130 --> 00:04:04.300
We get this kind of a non parametric&nbsp;
curve. It's nonparametric because we&nbsp;&nbsp;

00:04:04.300 --> 00:04:10.240
can't express this curve as a simple function.
We can see that this is an s-shaped curve.&nbsp;

00:04:10.240 --> 00:04:16.300
So first when girls get a little bit older&nbsp;
some girls start to have menarche but not&nbsp;&nbsp;

00:04:16.300 --> 00:04:22.090
many. And once you hit about 1314 then&nbsp;
the rate of having menarche increases&nbsp;&nbsp;

00:04:22.090 --> 00:04:27.790
rapidly until it starts to decrease when&nbsp;
you are about at about 15, when pretty much&nbsp;&nbsp;

00:04:27.790 --> 00:04:34.840
everyone has had menarche except for a couple&nbsp;
exceptions.a And then it flattens out at one.&nbsp;

00:04:34.840 --> 00:04:42.820
This curve is are called a logistic curve.
So here is the logistic curve and the idea&nbsp;&nbsp;

00:04:42.820 --> 00:04:47.800
of logistic regression analysis is that instead&nbsp;
of fitting a line we fit this logistic curve.&nbsp;&nbsp;

00:04:47.800 --> 00:04:52.540
The logit curve and the interpretation&nbsp;
of the result stays the same so the line&nbsp;&nbsp;

00:04:52.540 --> 00:04:58.840
gives us the expected probability of a girl&nbsp;
having had menarche given their age. But this&nbsp;&nbsp;

00:04:58.840 --> 00:05:03.340
line as we can as we saw from the previous&nbsp;
slide is a much better fit for the data.&nbsp;

00:05:03.340 --> 00:05:09.520
So the data the relationship is not linear&nbsp;
rather it follows an S shape and the logit&nbsp;&nbsp;

00:05:09.520 --> 00:05:13.990
curve is one such as safe care that we&nbsp;
could use and it's very commonly used.&nbsp;

00:05:13.990 --> 00:05:19.420
So we get the probability of having had&nbsp;
menarche given the age from the model.&nbsp;

00:05:19.420 --> 00:05:25.390
The model can be expressed mathematically&nbsp;
because all models are just equations and&nbsp;&nbsp;

00:05:25.390 --> 00:05:29.950
the mathematical expressions for this&nbsp;
logistic regression model is as follows.&nbsp;

00:05:29.950 --> 00:05:34.210
First you have the linear regression model.
So that's the linear probability model because&nbsp;&nbsp;

00:05:34.210 --> 00:05:40.720
we have one binary dependent variable and&nbsp;
the regression model extends the the logistic&nbsp;&nbsp;

00:05:40.720 --> 00:05:46.420
model extends the normal recursive model&nbsp;
by taking a function of this fitted value.&nbsp;

00:05:46.420 --> 00:05:51.250
So we calculate the linear prediction&nbsp;
using our the observed data and then&nbsp;&nbsp;

00:05:51.250 --> 00:05:58.030
we take a function here which gives&nbsp;
us the logit curve and the functions.&nbsp;

00:05:58.030 --> 00:06:02.290
The inverse of this function is called the&nbsp;
link function and that's the logit function.&nbsp;

00:06:02.290 --> 00:06:07.120
That this is the inverse whether our it's called&nbsp;
an inverse function or a function doesn't matter.&nbsp;

00:06:07.120 --> 00:06:11.770
The important thing for you to understand&nbsp;
is that the instead of using the predictions&nbsp;&nbsp;

00:06:11.770 --> 00:06:17.440
directly we apply a function that the&nbsp;
predictions that make the prediction sort&nbsp;&nbsp;

00:06:17.440 --> 00:06:25.060
transforms the predictions from a line&nbsp;
to a curve. Okay, so how do we estimate&nbsp;&nbsp;

00:06:25.060 --> 00:06:33.520
the model? We can apply OLS estimation. So we&nbsp;
apply OLS estimation, then we do Diagnostics.&nbsp;

00:06:33.520 --> 00:06:43.240
So we get the residuals here, there's a residual,&nbsp;
so we can calculate it then we can plot,&nbsp;&nbsp;

00:06:43.240 --> 00:06:47.800
residual versus Fida which is one of the&nbsp;
standard diagnostic plots and then we can&nbsp;&nbsp;

00:06:47.800 --> 00:06:53.380
check the normality of the residuals. We have&nbsp;
two violations of regression assumptions. First&nbsp;&nbsp;

00:06:53.380 --> 00:07:00.040
of all they are the residual is not normally&nbsp;
distributed, so but that's not really a big deal.&nbsp;

00:07:00.040 --> 00:07:06.430
It's only relevant in very small samples.&nbsp;
Then we have our heteroscedasticity problem,&nbsp;&nbsp;

00:07:06.430 --> 00:07:12.100
because the variation of the residuals&nbsp;
here is a lot higher than the variation&nbsp;&nbsp;

00:07:12.100 --> 00:07:16.900
here because the variance is the square&nbsp;
of the difference, square of the residual.&nbsp;

00:07:16.900 --> 00:07:24.190
Then our so we have our heteroscedasticity&nbsp;
problem. We are in violation of&nbsp;&nbsp;

00:07:24.190 --> 00:07:31.360
then MLR 5 and MLR 6 assumptions.
Whether that's a big deal or not we could&nbsp;&nbsp;

00:07:31.360 --> 00:07:36.880
use a robust and others but there are also some&nbsp;
computational difficulties when we try to apply&nbsp;&nbsp;

00:07:36.880 --> 00:07:43.630
least squares approach to this kind of problem.
And because of those computational difficulties&nbsp;&nbsp;

00:07:43.630 --> 00:07:47.620
and because OLS is not ideal anywhere&nbsp;
because of violation of these assumptions,&nbsp;&nbsp;

00:07:47.620 --> 00:07:54.280
we are estimate this using a different&nbsp;
approach called maximum likelihood estimation.