WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:04.500 Logistical regression analysis is commonly  used tool for binary dependent variables.  00:00:04.500 --> 00:00:10.170 A binary variable is a variable that receives  the values of 1 and 0 and it's very commonly   00:00:10.170 --> 00:00:14.640 used for decisions that are either yes  or no whether something happens or not.  00:00:14.640 --> 00:00:20.370 Whether a company decides to expand  internationally or whether it decide   00:00:20.370 --> 00:00:24.390 to stay in the whole markets, whether a  person is sick or not and that kind of data.  00:00:24.390 --> 00:00:30.540 To illustrate their losses regression  analysis technique we need to have some   00:00:30.540 --> 00:00:34.050 example data and this example  data are girls from Warsaw.  00:00:34.050 --> 00:00:40.710 And the girls range from about 10 years to  about 18 years and the dependent variable   00:00:40.710 --> 00:00:45.360 here is called min arts and that's whether  the girl has had the first period or not.  00:00:45.360 --> 00:00:51.570 So we can see here that girls at the age of  10 normally don't have had the first period,   00:00:51.570 --> 00:00:56.010 and then girls when they are 18 pretty  much everyone has had the first period.  00:00:56.010 --> 00:01:02.730 And we want to explain this relationship between  our age and menarche using regression analysis.  00:01:02.730 --> 00:01:08.130 There are a couple of problems when  we apply normal regression analysis.  00:01:08.130 --> 00:01:17.970 For this kind of data set the first problem  is that the regression line here goes over 1.  00:01:17.970 --> 00:01:23.730 So the value here, the regression  line gives the expected value of   00:01:23.730 --> 00:01:28.920 the dependent variable given age. And in this case because the dependent   00:01:28.920 --> 00:01:36.360 variable is 0 and once the expected value is  the expected probability of having menarche.  00:01:36.360 --> 00:01:42.870 When we draw the line that we have a  problem here because the predictive   00:01:42.870 --> 00:01:48.210 probability for girls that are 18 exceeds  1, and probabilities bound between 1 and 0.  00:01:48.210 --> 00:01:55.320 Also we have negative probability here. T This also causes a problem for regression   00:01:55.320 --> 00:02:02.340 analysis because when we have small numbers  small fitted values here, then all residuals   00:02:02.340 --> 00:02:09.150 are positives or they, the error term can't be  independent of the bow of the defeated value.  00:02:09.150 --> 00:02:13.510 So regression analysis we are violating  the noise energy assumption at least,   00:02:13.510 --> 00:02:20.440 and are the predictions don't make any sense. So using a linear model for this kind of data   00:02:20.440 --> 00:02:25.990 is problematic for these two reasons.  Using this kind of linear model would   00:02:25.990 --> 00:02:32.470 be acceptable if most girls will be around  here, so the linear approximation would be   00:02:32.470 --> 00:02:37.660 okay because it doesn't really are predict any  negative values, because we can't go beyond   00:02:37.660 --> 00:02:42.910 the range of the data. But if we have negative  predictions on predictions that exceed one within   00:02:42.910 --> 00:02:48.280 the range of the data, then we have problems. This model is called linear probability model   00:02:48.280 --> 00:02:53.410 and it's it can be used but there  are typically better alternatives.  00:02:53.410 --> 00:02:59.650 One better alternative is to start to start  discovering better alternatives we need to   00:02:59.650 --> 00:03:05.410 think about what's the relationship like and  we can do a nonparametric analysis, for example   00:03:05.410 --> 00:03:11.170 we take a rolling average from the data. So the idea of rolling average is that we   00:03:11.170 --> 00:03:18.940 have here about 4,000 girls and then we take the  first 500 here we calculate the mean for these   00:03:18.940 --> 00:03:25.840 first 500 and then we put mark a small dot here. The other is for these girls is zero because no   00:03:25.840 --> 00:03:33.610 one has at the menarche. Then we shift this window  right to a bit we check the next 500 girls so we   00:03:33.610 --> 00:03:40.900 go from the second girl to the 501st girl like  that we calculate the average, we mark it here.  00:03:40.900 --> 00:03:48.520 Then we go to the third girl to 500 second girl  and we calculate average for that sub sample.  00:03:48.520 --> 00:03:53.590 Then we continue we'll go here we can  see that the mean value is about 50%   00:03:53.590 --> 00:04:00.130 and our final when we calculate for all  possible windows, we calculate the mean.  00:04:00.130 --> 00:04:04.300 We get this kind of a non parametric  curve. It's nonparametric because we   00:04:04.300 --> 00:04:10.240 can't express this curve as a simple function. We can see that this is an s-shaped curve.  00:04:10.240 --> 00:04:16.300 So first when girls get a little bit older  some girls start to have menarche but not   00:04:16.300 --> 00:04:22.090 many. And once you hit about 1314 then  the rate of having menarche increases   00:04:22.090 --> 00:04:27.790 rapidly until it starts to decrease when  you are about at about 15, when pretty much   00:04:27.790 --> 00:04:34.840 everyone has had menarche except for a couple  exceptions.a And then it flattens out at one.  00:04:34.840 --> 00:04:42.820 This curve is are called a logistic curve. So here is the logistic curve and the idea   00:04:42.820 --> 00:04:47.800 of logistic regression analysis is that instead  of fitting a line we fit this logistic curve.   00:04:47.800 --> 00:04:52.540 The logit curve and the interpretation  of the result stays the same so the line   00:04:52.540 --> 00:04:58.840 gives us the expected probability of a girl  having had menarche given their age. But this   00:04:58.840 --> 00:05:03.340 line as we can as we saw from the previous  slide is a much better fit for the data.  00:05:03.340 --> 00:05:09.520 So the data the relationship is not linear  rather it follows an S shape and the logit   00:05:09.520 --> 00:05:13.990 curve is one such as safe care that we  could use and it's very commonly used.  00:05:13.990 --> 00:05:19.420 So we get the probability of having had  menarche given the age from the model.  00:05:19.420 --> 00:05:25.390 The model can be expressed mathematically  because all models are just equations and   00:05:25.390 --> 00:05:29.950 the mathematical expressions for this  logistic regression model is as follows.  00:05:29.950 --> 00:05:34.210 First you have the linear regression model. So that's the linear probability model because   00:05:34.210 --> 00:05:40.720 we have one binary dependent variable and  the regression model extends the the logistic   00:05:40.720 --> 00:05:46.420 model extends the normal recursive model  by taking a function of this fitted value.  00:05:46.420 --> 00:05:51.250 So we calculate the linear prediction  using our the observed data and then   00:05:51.250 --> 00:05:58.030 we take a function here which gives  us the logit curve and the functions.  00:05:58.030 --> 00:06:02.290 The inverse of this function is called the  link function and that's the logit function.  00:06:02.290 --> 00:06:07.120 That this is the inverse whether our it's called  an inverse function or a function doesn't matter.  00:06:07.120 --> 00:06:11.770 The important thing for you to understand  is that the instead of using the predictions   00:06:11.770 --> 00:06:17.440 directly we apply a function that the  predictions that make the prediction sort   00:06:17.440 --> 00:06:25.060 transforms the predictions from a line  to a curve. Okay, so how do we estimate   00:06:25.060 --> 00:06:33.520 the model? We can apply OLS estimation. So we  apply OLS estimation, then we do Diagnostics.  00:06:33.520 --> 00:06:43.240 So we get the residuals here, there's a residual,  so we can calculate it then we can plot,   00:06:43.240 --> 00:06:47.800 residual versus Fida which is one of the  standard diagnostic plots and then we can   00:06:47.800 --> 00:06:53.380 check the normality of the residuals. We have  two violations of regression assumptions. First   00:06:53.380 --> 00:07:00.040 of all they are the residual is not normally  distributed, so but that's not really a big deal.  00:07:00.040 --> 00:07:06.430 It's only relevant in very small samples.  Then we have our heteroscedasticity problem,   00:07:06.430 --> 00:07:12.100 because the variation of the residuals  here is a lot higher than the variation   00:07:12.100 --> 00:07:16.900 here because the variance is the square  of the difference, square of the residual.  00:07:16.900 --> 00:07:24.190 Then our so we have our heteroscedasticity  problem. We are in violation of   00:07:24.190 --> 00:07:31.360 then MLR 5 and MLR 6 assumptions. Whether that's a big deal or not we could   00:07:31.360 --> 00:07:36.880 use a robust and others but there are also some  computational difficulties when we try to apply   00:07:36.880 --> 00:07:43.630 least squares approach to this kind of problem. And because of those computational difficulties   00:07:43.630 --> 00:07:47.620 and because OLS is not ideal anywhere  because of violation of these assumptions,   00:07:47.620 --> 00:07:54.280 we are estimate this using a different  approach called maximum likelihood estimation.