WEBVTT Kind: captions Language: en 00:00:00.090 --> 00:00:04.920 One way to interpret the regression  analysis results from logistic regression is 00:00:04.920 --> 00:00:06.990 to do marginal prediction plots. 00:00:06.990 --> 00:00:10.590 This is a very useful technique  because it's a generic technique. 00:00:10.590 --> 00:00:13.470 Instead of having to memorize, 00:00:13.470 --> 00:00:18.120 how every possible different nonlinear  regression model is interpreted, 00:00:18.120 --> 00:00:20.098 you just need one tool. 00:00:20.483 --> 00:00:25.770 Another advantage is that this  tool gives you the effects 00:00:25.770 --> 00:00:28.158 on the original scale of the dependent variable. 00:00:28.296 --> 00:00:30.390 In the case of logistic regression analysis, 00:00:30.390 --> 00:00:31.770 you will directly see, 00:00:31.770 --> 00:00:38.190 what is the effect of each independent  variable on the predicted probability? 00:00:38.837 --> 00:00:41.310 To do plotting we need some data. 00:00:41.310 --> 00:00:45.060 I will use the Hosmer and Lemeshow data. 00:00:45.060 --> 00:00:48.923 So this is from a widely cited  regression analysis book. 00:00:49.294 --> 00:00:53.549 And the data are about babies born  to different kinds of mothers. 00:00:53.797 --> 00:00:55.807 The dependent variable is, 00:00:55.807 --> 00:01:01.830 whether the baby was born as low birth  weight defined as less than 2.5 kilos. 00:01:01.830 --> 00:01:06.900 And we'll be looking at the weight of  the mother at last menstrual period, 00:01:06.900 --> 00:01:08.430 the race of the mother, 00:01:08.430 --> 00:01:11.164 and whether the mother smoked during pregnancy, 00:01:11.164 --> 00:01:15.081 as our interesting independent variables. 00:01:16.072 --> 00:01:21.870 We are first going to fit a linear probability  model and logistic regression model to this data. 00:01:22.200 --> 00:01:24.210 And I'm using Stata here. 00:01:24.210 --> 00:01:27.420 We have the linear probability model here 00:01:27.420 --> 00:01:29.790 and we have the logistic regression model here. 00:01:30.547 --> 00:01:33.030 And the dependent variable  was the low birth weight. 00:01:33.030 --> 00:01:35.520 And we can see from the linear property model, 00:01:35.520 --> 00:01:36.780 it's easy to interpret, 00:01:36.780 --> 00:01:42.840 we get the predicted probability  of having a low birth weight baby, 00:01:42.840 --> 00:01:49.322 it increases for, it is 0.22 higher  for black women than for white women, 00:01:49.390 --> 00:01:50.920 that is the reference category. 00:01:50.920 --> 00:01:55.710 It is 15% higher for smokers than for non-smokers. 00:01:56.068 --> 00:01:58.948 So we can directly interpret the effects. 00:01:59.429 --> 00:02:00.960 Here, the odds ratios, 00:02:00.960 --> 00:02:08.460 we can say that the odds for a  black mother are 3.5 times greater, 00:02:08.460 --> 00:02:10.170 than for a white mother. 00:02:10.170 --> 00:02:15.050 But that doesn't really tell us anything  about the increase in probability, 00:02:15.050 --> 00:02:19.070 because the odds are a proportional effect, 00:02:19.070 --> 00:02:21.000 you have to know, it's a relative effect. 00:02:21.000 --> 00:02:22.143 You have to know, 00:02:22.143 --> 00:02:28.617 what is the original odds that  is being increased by 3.5? 00:02:29.620 --> 00:02:31.940 Plotting is very useful to understand, 00:02:31.940 --> 00:02:33.920 what do these effects look like? 00:02:33.920 --> 00:02:37.520 So when we compare the effects of race and smoke, 00:02:37.520 --> 00:02:40.040 we can't really, these are not really comparable. 00:02:40.040 --> 00:02:47.390 So it's difficult to say whether a 3.5  increase in odds is a larger effect than 00:02:47.390 --> 00:02:50.660 22% increase in probability, 00:02:50.660 --> 00:02:52.700 because they are expressed in a different scale. 00:02:52.700 --> 00:02:56.000 And we're usually interested in  the original scale of the variable. 00:02:56.468 --> 00:03:00.500 Also, we can't, from this model directly, 00:03:00.500 --> 00:03:06.710 say what is the expected difference between  black smokers and white non-smokers. 00:03:07.426 --> 00:03:09.730 The whites are the base category here, 00:03:09.730 --> 00:03:15.538 so black mothers is a 0.22 and smokers is 0.16, 00:03:15.538 --> 00:03:20.960 so it's about 40% difference between  black smokers and white non-smokers. 00:03:20.960 --> 00:03:22.940 Easy to see from this model. 00:03:23.780 --> 00:03:31.098 Here we say that the black  mother has 3.5 times greater odds, 00:03:31.153 --> 00:03:34.363 and smokers have 2.5 times greater odds. 00:03:34.556 --> 00:03:36.656 So we multiply these together 00:03:36.656 --> 00:03:40.130 and it's about eight or nine,  something like that, times higher 00:03:40.130 --> 00:03:42.830 odds for black smokers than white smokers. 00:03:42.830 --> 00:03:44.300 But that's difficult to interpret. 00:03:44.809 --> 00:03:46.640 So how we can do that is, 00:03:46.640 --> 00:03:48.830 we can apply the marginal predictions plots. 00:03:49.284 --> 00:03:54.740 The Stata's margin command or R's effects  command will do that for you quite easily. 00:03:55.208 --> 00:03:56.420 This is from Stata, 00:03:56.420 --> 00:03:58.391 so this is the linear predictions. 00:03:58.391 --> 00:04:04.340 And we can see from the linear model  that the effect of birth weight here 00:04:04.340 --> 00:04:07.190 is the same for all kinds of mothers. 00:04:07.190 --> 00:04:08.720 So we have three races here, 00:04:08.720 --> 00:04:16.310 and the effect of weight at the last  menstruation is the same for all mothers. 00:04:16.310 --> 00:04:21.140 So the mothers only differ  with respect to the base level. 00:04:21.140 --> 00:04:22.640 So what's the intercept, 00:04:22.640 --> 00:04:25.790 because we estimated the effect of race. 00:04:26.600 --> 00:04:29.450 For the logistic regression  model, we can see that it's 00:04:29.450 --> 00:04:31.670 the same base difference is here, 00:04:31.670 --> 00:04:33.860 but the shape of these curves is different. 00:04:33.860 --> 00:04:38.420 So this is, curves flatten here more, 00:04:38.420 --> 00:04:40.340 and these are lot steeper curves. 00:04:40.932 --> 00:04:45.650 So when we have a mother that doesn't weigh much, 00:04:45.650 --> 00:04:46.970 so these are pounds, 00:04:46.970 --> 00:04:50.780 then for all races, 00:04:50.780 --> 00:04:56.000 the likelihood of having a  low weight baby is large. 00:04:56.000 --> 00:05:00.590 And we can see that for all races  the likelihood gets smaller. 00:05:00.590 --> 00:05:04.370 But also that the likelihood of  probability actually converges here. 00:05:04.824 --> 00:05:07.310 So if you are a very big mother, 00:05:07.310 --> 00:05:09.704 then you're going to have a very big child. 00:05:10.365 --> 00:05:17.480 And which one of these fits the data  better is partly an empirical question. 00:05:17.480 --> 00:05:19.430 So one way to understand, 00:05:19.430 --> 00:05:21.410 which of these plots works better, 00:05:21.410 --> 00:05:24.560 is to plot the data over these plots and just see, 00:05:24.560 --> 00:05:27.920 which two sets of lines explains the data better. 00:05:28.429 --> 00:05:32.240 We can see here that the linear probability model 00:05:32.240 --> 00:05:38.300 predicts the negative probability  for some heavy white mothers. 00:05:38.878 --> 00:05:43.859 And this model always predicts between 0 and 1. 00:05:43.859 --> 00:05:46.250 So this is statistically more appealing. 00:05:46.250 --> 00:05:49.520 But if we don't have any mothers here, 00:05:49.520 --> 00:05:53.740 so if all white mothers are quite light, 00:05:53.740 --> 00:05:57.110 then the fact that we predict implausible values, 00:05:57.110 --> 00:05:59.300 when we go beyond our data, 00:05:59.300 --> 00:06:00.470 is not really a problem. 00:06:01.089 --> 00:06:04.449 So, which one of these is better, 00:06:04.820 --> 00:06:07.250 you can justify based on a theory, 00:06:07.250 --> 00:06:09.440 but you can also check empirically, 00:06:09.440 --> 00:06:11.568 which one fits the data better. 00:06:11.568 --> 00:06:14.570 The logistic regression analysis  is typically used by default, 00:06:14.570 --> 00:06:17.433 because it's a safer choice to apply. 00:06:17.625 --> 00:06:20.780 But this linear probability  model can be used as well, 00:06:20.780 --> 00:06:25.250 as long as you don't do negative predictions, 00:06:25.250 --> 00:06:28.670 or predictions that exceed 1 for  any of the cases in your sample.