WEBVTT Kind: captions Language: en 00:00:00.090 --> 00:00:03.240 Interpreting the logistic  regression analysis results 00:00:03.240 --> 00:00:06.860 differs a bit from normal  regression analysis interpretation. 00:00:06.860 --> 00:00:10.500 Let's take a look at the results  from logistic regression analysis, 00:00:10.500 --> 00:00:11.940 using the Menarche dataset. 00:00:12.825 --> 00:00:17.520 The R GLM -command gives us these results. 00:00:17.520 --> 00:00:21.510 So we'll just focus on the  actual coefficients for now, 00:00:21.510 --> 00:00:25.290 and leave these other things for another video. 00:00:25.961 --> 00:00:28.050 We have the estimate, 00:00:28.050 --> 00:00:30.000 which is an estimate. 00:00:30.000 --> 00:00:31.350 Then we have the standard error, 00:00:31.350 --> 00:00:36.630 which quantifies how much the estimate is  likely to change from one sample to another, 00:00:36.630 --> 00:00:38.250 if we repeat the study over. 00:00:38.250 --> 00:00:40.170 We have a Z value, 00:00:40.170 --> 00:00:43.410 which is the ratio of the estimate  divided by the standard error. 00:00:43.410 --> 00:00:47.730 So the Z value is the same as the  T value in regression analysis. 00:00:47.730 --> 00:00:52.140 It is called a Z statistic instead of T statistic, 00:00:52.140 --> 00:00:56.790 because the maximum likelihood estimates  are based on large sample theory, 00:00:56.790 --> 00:00:59.550 and instead of comparing  against the T distribution, 00:00:59.550 --> 00:01:03.840 we compare this against the normal distribution. 00:01:03.840 --> 00:01:09.690 So under the null hypothesis that this  estimate is zero in the population, 00:01:09.690 --> 00:01:12.300 and if the sample size is large enough, 00:01:12.300 --> 00:01:15.210 the Z value follows a standard normal distribution 00:01:15.210 --> 00:01:18.480 and that allows us to calculate the p-values. 00:01:18.480 --> 00:01:24.060 So whether age has an effect or not  can be interpreted from these p-values, 00:01:24.060 --> 00:01:30.300 we can see that age has a large, very  statistically significant result. 00:01:30.300 --> 00:01:32.490 So we can confidently say that 00:01:32.490 --> 00:01:37.380 age has some kind of effect on the  probability of having had menarche. 00:01:38.356 --> 00:01:43.470 What is the magnitude of that effect is a  bit more complicated question to answer. 00:01:44.111 --> 00:01:48.480 We really can't say that the  probability of having had menarche 00:01:49.166 --> 00:01:54.210 increases by 1.6 when the  girl gets one year older. 00:01:54.210 --> 00:01:59.970 One reason is that 1.6 increase gets  us beyond the range of the data. 00:01:59.970 --> 00:02:02.730 So if the probability is 0, 00:02:02.730 --> 00:02:04.950 initially you increase age by one, 00:02:04.950 --> 00:02:08.160 the predicted probability would be 1.62. 00:02:08.160 --> 00:02:10.020 So doesn't work that way. 00:02:10.798 --> 00:02:13.950 The reason why we can't  interpret this directly is, 00:02:13.950 --> 00:02:19.500 these are the effects before we  applied the logistic link function. 00:02:19.500 --> 00:02:26.040 So these are effects on the linear predictor  and not on the actual dependent variable. 00:02:26.406 --> 00:02:28.236 So it's the same thing as in 00:02:28.236 --> 00:02:31.380 when you do a log transformation  from a dependent variable, 00:02:31.380 --> 00:02:33.972 then the interpretation is in, 00:02:34.110 --> 00:02:38.576 the coefficient tells you what  is the effect in log scale, 00:02:39.000 --> 00:02:42.090 and you want to know what's the  effect on the original scale. 00:02:42.090 --> 00:02:44.160 This coefficient here tells you, 00:02:44.160 --> 00:02:47.220 what is the effect in the  scale of the linear predictor? 00:02:47.220 --> 00:02:49.350 But you are not really interested in that, 00:02:49.350 --> 00:02:50.400 you are interested in, 00:02:50.400 --> 00:02:52.770 what is the effect on the observed variable scale? 00:02:53.334 --> 00:02:57.600 So we don't interpret these directly instead 00:02:57.600 --> 00:03:00.180 we interpret them as odds ratios. 00:03:00.897 --> 00:03:06.990 So the odds ratio is a concept that  is useful for regression analysis 00:03:06.990 --> 00:03:11.820 and for some other logistic regression  analysis and for some other models as well. 00:03:13.101 --> 00:03:17.340 The idea is that odds are  the ratio of two outcomes. 00:03:17.340 --> 00:03:27.600 So here we have the outcome of girl having  had menarche and not having had menarche. 00:03:27.600 --> 00:03:31.920 If 1 in 100 girls have had menarche, 00:03:31.920 --> 00:03:36.450 then odds for having had menarche is 1 to 99, 00:03:36.450 --> 00:03:39.630 because one out of the sample, out of 100 has had it, 00:03:39.630 --> 00:03:44.070 and then remaining 99 hasn't had it. 00:03:44.680 --> 00:03:50.580 You can think of one common  use of odds is in gambling. 00:03:50.580 --> 00:03:52.860 So if you have a team, two soccer teams, 00:03:52.860 --> 00:03:55.980 one has won two matches in the past, 00:03:55.980 --> 00:03:59.400 another one has won five matches in the past. 00:03:59.400 --> 00:04:01.500 Then you say that based on that data 00:04:01.500 --> 00:04:04.410 the odds for the first team  winning is two to five. 00:04:04.410 --> 00:04:06.000 So that's the idea of odds. 00:04:06.930 --> 00:04:13.080 And more formally, if the  probability of an outcome is P, 00:04:13.080 --> 00:04:17.790 then the odds are defined  us P against one minus P. 00:04:17.790 --> 00:04:22.260 So it's the probability of one outcome  divided by the probability of another outcome, 00:04:22.260 --> 00:04:23.670 if you have only two possible outcomes. 00:04:25.256 --> 00:04:27.060 And the exponential, 00:04:27.060 --> 00:04:30.660 if you exponentiate the logistic  regression coefficients, 00:04:30.660 --> 00:04:34.020 those can be interpreted as odds ratios. 00:04:34.478 --> 00:04:39.930 And the idea is that when you  exponentiate the coefficients, 00:04:39.930 --> 00:04:42.930 then the coefficients tell you that 00:04:42.930 --> 00:04:48.600 one unit increase in independent variable causes 00:04:48.600 --> 00:04:53.610 the odds to change proportionally  to the regression coefficient. 00:04:54.000 --> 00:04:55.410 I'll show you an example. 00:04:55.807 --> 00:04:58.110 Let's take a look at the idea of odds ration, 00:04:58.110 --> 00:05:02.310 and why we can interpret these  coefficients as odds ratios. 00:05:02.829 --> 00:05:05.079 So example odds for the data. 00:05:05.079 --> 00:05:08.610 And this is some guess of the results, 00:05:08.610 --> 00:05:10.230 we have the linear prediction, 00:05:10.230 --> 00:05:12.360 we have the fitted like probability, 00:05:12.360 --> 00:05:13.800 we have fitted odds, 00:05:13.800 --> 00:05:21.450 which is the probability  against the other probability, 00:05:21.450 --> 00:05:22.954 and we calculate the value. 00:05:23.518 --> 00:05:28.328 So the odds for this first girl  having had menarche is 74% to 26%, 00:05:28.328 --> 00:05:30.128 which is 2.79. 00:05:30.128 --> 00:05:33.300 The odds for the second girl is 8% to 92%, 00:05:33.300 --> 00:05:35.520 which is 0.09 and so on. 00:05:35.520 --> 00:05:36.270 So these are the odds. 00:05:36.270 --> 00:05:40.860 And when we calculate marginal prediction. 00:05:40.860 --> 00:05:44.490 So in regression analysis, we are  interested in the marginal effect, 00:05:44.490 --> 00:05:47.370 what is the increase of one independent variable, 00:05:47.370 --> 00:05:51.510 what is the effect of increasing one  independent variable Y by one unit, 00:05:51.510 --> 00:05:53.010 holding everything else constant. 00:05:53.010 --> 00:05:55.110 So we are interested in marginal effect. 00:05:55.110 --> 00:06:00.630 And let's calculate marginal effects  now for girls of different ages. 00:06:00.630 --> 00:06:03.090 So instead of using this actual data, 00:06:03.090 --> 00:06:08.040 we have a hypothetical girl at  age of 9, 10, 11, 12 and so on. 00:06:08.040 --> 00:06:12.900 We calculate the fitted  probabilities using our model. 00:06:13.037 --> 00:06:16.682 And we calculate odds. 00:06:16.743 --> 00:06:19.495 We calculate the value of the odds, 00:06:19.495 --> 00:06:22.770 and when we compare two odds here, 00:06:22.770 --> 00:06:24.330 the ratio of these two zeros, 00:06:24.330 --> 00:06:27.352 they are actually not exactly 0, is 4.6. 00:06:27.352 --> 00:06:31.477 So every time we go and we  increase the girl's age by one, 00:06:31.477 --> 00:06:34.950 then the odds increase by 4.6. 00:06:34.950 --> 00:06:40.860 So every additional year  increases the odds by 4.6 units. 00:06:41.400 --> 00:06:44.340 So that's the odds ratio interpretation. 00:06:44.340 --> 00:06:50.520 So these always increase by 4.6. 00:06:51.084 --> 00:06:56.280 And how we use that in regression analysis. 00:06:56.280 --> 00:06:58.380 Well, we calculate the odds ratios, 00:06:58.380 --> 00:07:01.338 and this is for the actual data. 00:07:02.101 --> 00:07:04.800 So we calculate the odds ratio, 00:07:04.800 --> 00:07:06.157 which is about 5. 00:07:06.157 --> 00:07:09.210 And the interpretation is that, 00:07:09.210 --> 00:07:17.913 each additional year of age increases the  odds of having had menarche by fivefold. 00:07:18.279 --> 00:07:21.000 So that kind of quantifies, 00:07:21.084 --> 00:07:23.100 how large the effect of age is. 00:07:23.100 --> 00:07:26.109 We know that if something is increased fivefold, 00:07:26.109 --> 00:07:27.902 then it's a pretty large effect. 00:07:27.902 --> 00:07:30.720 The problem still is that with odds ratios, 00:07:30.720 --> 00:07:35.820 we can't really say, how much does  the actual probability increase, 00:07:35.820 --> 00:07:38.490 because odds and probability  are not the same things. 00:07:38.917 --> 00:07:41.407 And quite often we want to know, 00:07:42.124 --> 00:07:47.464 how much does the probability of  having had menarche depend on the age, 00:07:47.520 --> 00:07:50.190 and what does the effect look like? 00:07:50.770 --> 00:07:56.520 To do that we would need to plot the  marginal predictions from the model.