WEBVTT WEBVTT Kind: captions Language: en 00:00:00.060 --> 00:00:05.460 After watching the GLM videos you must have  the question of should you use these models   00:00:05.460 --> 00:00:09.870 and if so when and why? And that's the  question that I will answer in this video. 00:00:09.870 --> 00:00:17.400 The question number one out of two questions  of whether GLM is required or is it useful   00:00:17.400 --> 00:00:23.790 is its transformation required? So what do you  think it's the nature of the relationship with   00:00:23.790 --> 00:00:28.680 your independent variable and the dependent  variable? Is it linear and additive? So do   00:00:28.680 --> 00:00:33.030 all the independent variables work separately  so that the effect on the dependent variable   00:00:33.030 --> 00:00:37.380 is there some or is it exponential  and multiplicative which means that   00:00:37.380 --> 00:00:41.190 you multiply the effects of independent  variables together to get the effect on   00:00:41.190 --> 00:00:46.470 the dependent variable? Or is it perhaps  the S-curve when they are effect is first   00:00:46.470 --> 00:00:53.190 very small then increases and then it's very  small because everybody is at 100% already. 00:00:53.190 --> 00:00:58.530 So this is a question that is  about theory and what kind of   00:00:58.530 --> 00:01:04.290 relationships you expect. It's not about  question of how the dependent variable is   00:01:04.290 --> 00:01:09.480 distributed. So this is a primarily a  modeling decision not a data decision. 00:01:09.480 --> 00:01:15.600 My practical recommendation is that you should  always start with linear regression analysis   00:01:15.600 --> 00:01:21.630 and then do Diagnostics. Do an added variable  plot. Do a residual versus fit the plot and   00:01:21.630 --> 00:01:27.360 see if there is evidence of non-linearity. If  there is then you consider this alternative.   00:01:27.360 --> 00:01:32.940 Of course you may have a strong theoretical  reason to believe that an exponential model   00:01:32.940 --> 00:01:39.000 or an s-curve model is preferable but still doing  the regression analysis is very cheap it doesn't   00:01:39.000 --> 00:01:44.160 cost you much time and they'll tell you something  that you didn't know before. Most of the time. 00:01:44.160 --> 00:01:49.050 So starting with the regression analysis  is a nice idea. It's a good idea. Then the   00:01:49.050 --> 00:01:54.210 third consideration is that some textbooks and  some articles say that you should transform the   00:01:54.210 --> 00:01:59.070 dependent variable to reduce heteroscedasticity  so that your standard errors will be correct. 00:01:59.070 --> 00:02:07.530 This decision has nothing to do with standard  errors whatsoever. The decision of which   00:02:07.530 --> 00:02:12.660 transformation applies is driven by theory what  do you think is the best explanation for the   00:02:12.660 --> 00:02:18.780 data and the consideration for a standard  errors is secondary to that. And you can   00:02:18.780 --> 00:02:25.410 always use robust standard errors to deal with any  heteroscedasticity issue anyway. So this is driven   00:02:25.410 --> 00:02:31.230 by theory not about standard error consistency  or about the kind of data that you have. 00:02:31.230 --> 00:02:36.420 The next question is that once you have  decided that you want to transform your   00:02:36.420 --> 00:02:41.340 dependent variable somehow you can also  transfer your independent variables but this   00:02:41.340 --> 00:02:46.350 is mostly focused on the dependent variable.  Should you transform the dependent variable   00:02:46.350 --> 00:02:51.870 and then apply regression on the transform  values or should you apply generalized linear   00:02:51.870 --> 00:02:56.070 model where you transform the fitted  value instead of the dependent value? 00:02:56.070 --> 00:03:03.000 There are simple points for and against both  decisions. Simple points for transforming the   00:03:03.000 --> 00:03:07.050 dependent variable is that it's simple to  do. So there are no computational issues.   00:03:07.050 --> 00:03:11.760 Regression analysis will always give  you results and you can also use OLS   00:03:11.760 --> 00:03:16.680 Diagnostics. Regression analysis diagnostics  are very useful. They are more developed than   00:03:16.680 --> 00:03:20.790 Diagnostics for GLM and you can find  more resources on how to do those. 00:03:20.790 --> 00:03:27.210 Also regression analysis is well understood. For  example the nature of multiplicative effects as   00:03:27.210 --> 00:03:33.210 I explained in previous videos is something  that many researchers don't fully understand.   00:03:33.210 --> 00:03:40.380 So regression analysis is more commonly  understood by readers and reviewers than GLM. 00:03:40.380 --> 00:03:47.430 There are points against transforming simple  points. Transforming a variable with a few   00:03:47.430 --> 00:03:52.590 discrete values is problematic. If your account  variable with one two and three then trying to   00:03:52.590 --> 00:03:58.020 do some kind of inverse poison transformation on  that would it make much sense because it still   00:03:58.020 --> 00:04:03.120 has three three discrete values. If you have  ones and zeroes the binary dependent variable   00:04:03.120 --> 00:04:07.800 transforming a binary dependent variable will  give you another binary variables. It doesn't   00:04:07.800 --> 00:04:12.930 do anything. And then you have the issue that if  you want to for example explain company size and   00:04:12.930 --> 00:04:18.510 you want to explain that with an exponential  function. Some companies have zero revenues so   00:04:18.510 --> 00:04:23.280 how do you deal with those zeros because you  can take a log of zero and then you need this   00:04:23.280 --> 00:04:28.080 awkward workarounds where you add +1 to the  dependent variable before you take the log. 00:04:28.080 --> 00:04:35.250 So these are simple points against transport.  There's a more rigorous way of looking at this   00:04:35.250 --> 00:04:43.770 issue. It's looking at let's look at this GLM  model and the transform model. Typically we're   00:04:43.770 --> 00:04:49.410 interested in explaining what is the mean of the  data or the expected value of the data given the   00:04:49.410 --> 00:04:54.210 independent variables in which case we look at  the nonlinear regression model. If we apply this   00:04:54.210 --> 00:05:02.310 transform dependent variable model and we treat  this transformed or this coefficients here as if   00:05:02.310 --> 00:05:07.890 there were estimates for this original model of  interest then are there actually inconsistent. 00:05:07.890 --> 00:05:14.610 So the transformed equation is an inconsistent  estimator for the original request. So   00:05:14.610 --> 00:05:19.620 statistically thinking you should never  transfer the dependent variable. You should   00:05:19.620 --> 00:05:25.080 always use all the GLM because the transferred  variable is inconsistent estimator of the GLM.   00:05:25.080 --> 00:05:30.840 That may not be enough to convince all the  people but let's take a look at examples. 00:05:30.840 --> 00:05:39.960 So I have this data set here. This is all the  data set that I've used before and we have the   00:05:39.960 --> 00:05:46.620 distribution of income for professions that are  more than half men and distribution of income for   00:05:46.620 --> 00:05:53.340 professions that are more than half women. So we  have men dominated and women dominated professors   00:05:53.340 --> 00:05:58.590 and we are interested in knowing whether men  dominated professions are making more money than   00:05:58.590 --> 00:06:03.990 women dominated professors. And this is something  that we would typically want to answer with men   00:06:03.990 --> 00:06:11.130 make 20% more or 50% more instead of saying that  men make a four thousand Canadian dollars more.   00:06:11.130 --> 00:06:15.540 Because the percentage is something that we  typically think in these kind of comparisons. 00:06:15.540 --> 00:06:23.250 So how do we do it? We're going to look at  percentages and we do transform dependent   00:06:23.250 --> 00:06:28.800 variable regression analysis. We get some  estimates here. Then we can calculate predictions   00:06:28.800 --> 00:06:35.040 using these estimates. So the predicted lines  are here in the equation. In the plot here we   00:06:35.040 --> 00:06:41.148 can see that there are the predictors plated  lines here are less than the actual sample   00:06:41.148 --> 00:06:48.180 means. So the model predicts the sample means a  bit incorrectly predicting too low so they are   00:06:48.180 --> 00:06:53.790 predicted in erroneously and they are also  the model predicts the difference between   00:06:53.790 --> 00:07:00.570 the men and women dominated professions  to be smaller than what it actually is. 00:07:00.570 --> 00:07:04.890 So both the actual means and the  difference between the means are   00:07:04.890 --> 00:07:12.030 predicted incorrectly. The difference is not  great but it's noticeable. So based on these   00:07:12.030 --> 00:07:18.390 considerations the GLM approach should  always be preferred over transforming   00:07:18.390 --> 00:07:22.740 the dependent variable. Of course doing the  transformation of the dependent variable   00:07:22.740 --> 00:07:29.640 using OLS doing the Diagnostics that's a good  starting point but in the end doing the GLM   00:07:29.640 --> 00:07:34.590 is more rigorous and that's what your final  the end product of your research should be. 00:07:34.590 --> 00:07:41.340 There is a nice blog post about this from  William Gold who is the founder of Stata.   00:07:41.340 --> 00:07:47.490 And he makes a strong case and with some  nice references that that's actually how   00:07:47.490 --> 00:07:52.530 you should do it. So don't log transform  the dependent variable. Use the Poisson   00:07:52.530 --> 00:07:59.310 GLM or QML estimate instead and with with  robust than others. That gives you better   00:07:59.310 --> 00:08:02.520 estimates than the regression on  the transform dependent variable. 00:08:02.520 --> 00:08:10.440 So what are the practical recommendations. Once  you have decided that you want to use one of these   00:08:10.440 --> 00:08:17.730 transformations then what's the modeling technique  that you should apply? So linear additive model   00:08:17.730 --> 00:08:25.980 least squares always. No reason to use anything  else. That's over all best and waiting least   00:08:25.980 --> 00:08:31.290 squares could be slightly more efficient in some  scenarios but it's not worth therefore to do that. 00:08:31.290 --> 00:08:39.210 If you have exponential model with multiplicative  relationships then if you know the distribution   00:08:39.210 --> 00:08:45.150 of the dependent variable given the fitted values  then use the maximum likelihood estimation of the   00:08:45.150 --> 00:08:49.920 generalized model with the correct distributions.  So if you know that it's poisson you know it's   00:08:49.920 --> 00:08:56.370 negative binomial you know that it's something  else then apply the normal GI. If you don't   00:08:56.370 --> 00:09:01.440 know what the distribution is or you're uncertain  about the distribution of the dependent variable   00:09:01.440 --> 00:09:08.730 or you know that it doesn't follow any of the  distributions that your statistical software   00:09:08.730 --> 00:09:15.510 supports then apply Poisson quasi maximum  likelihood estimation with robust analysis. 00:09:15.510 --> 00:09:22.110 So this is a kind of like it's a similarly  safe choice than using OLS is for the linear   00:09:22.110 --> 00:09:28.110 model. For the s-curve models the same thing  if you know the distribution so if you know   00:09:28.110 --> 00:09:33.360 that you are using fractional response data  and you know that the dependent variable is   00:09:33.360 --> 00:09:39.000 beta distributed given the predicted values  then use a beta recursion analysis so maximum   00:09:39.000 --> 00:09:43.710 likelihood GLM with the correct distribution.  Otherwise if you don't know the distribution   00:09:43.710 --> 00:09:50.670 of the dependent variable then use or burn only  quasi maximum likelihood with robust analysis. 00:09:50.670 --> 00:09:57.120 So if you have fractional response data then  basically I would always recommend that use just   00:09:57.120 --> 00:10:02.820 the normal logistic regression analysis for that  because it works. You would think that it doesn't   00:10:02.820 --> 00:10:07.470 but it actually does as long as this approach  has been a program to your computer software. 00:10:07.470 --> 00:10:15.180 Now this has nothing to do with the transformation  of the independent variables. So this is about the   00:10:15.180 --> 00:10:20.130 dependent variable transforming independent  variables is okay and you can consider the   00:10:20.130 --> 00:10:25.680 log transformation or sometimes even exponential  transformation of the independent variables to   00:10:25.680 --> 00:10:30.660 get a model that you think explains your data  well based on your theory and then you estimate   00:10:30.660 --> 00:10:38.130 it with our either OLS or GLM. This is more this  is about what you do with the dependent variable. 00:10:38.130 --> 00:10:44.940 The final question is that is this GLM  and transforming the fitted value versus   00:10:44.940 --> 00:10:49.950 transforming the dependent value is it a big  thing? So on let's do an empirical example. So   00:10:49.950 --> 00:10:57.630 we have here two models. We are using the prestige  data. We have a years of education here. We have   00:10:57.630 --> 00:11:04.860 the predictions from these two models transfer  dependent variable and GLM effects on income. 00:11:04.860 --> 00:11:11.580 When we look at the regression coefficients we  can see that there's a 7.5 percent difference.   00:11:11.580 --> 00:11:15.300 So this is one point zero point nine  one one nine this is zero point one   00:11:15.300 --> 00:11:22.170 twenty-eight. So seven point five difference  that is substantially in many methodological   00:11:22.170 --> 00:11:27.630 papers we think that five percent bias  it's something that you can ignore but   00:11:27.630 --> 00:11:32.070 this a seven point five percent difference  is something that we should care about. 00:11:32.070 --> 00:11:36.270 Also when we look at the predictions  here we can see that the transport   00:11:36.270 --> 00:11:44.370 dependent variable systematically under  predicts how much there are professions   00:11:44.370 --> 00:11:50.310 that require high education actually make  and this blue line here is a lot better fit   00:11:50.310 --> 00:11:58.230 to the data. So that's empirically it's not  a huge difference but it's something that I   00:11:58.230 --> 00:12:02.910 think which we are concerned about  because the fix is rather simple. 00:12:02.910 --> 00:12:12.180 Now the question is if and and when I get papers  to review where authors use a transformation on   00:12:12.180 --> 00:12:18.270 the dependent variable should I do I recommend  that those papers are rejected because they   00:12:18.270 --> 00:12:25.230 don't use the GLM approach or quasi maximum  likelihood estimation of Poisson on instead   00:12:25.230 --> 00:12:31.140 of the transformation of the dependent  variable. No I would not say that these   00:12:31.140 --> 00:12:35.880 are this red line is worthless. I'm saying that  the blue line is better and I would probably   00:12:35.880 --> 00:12:42.990 recommend the authors to take a look at some  articles that have cited here that explain why   00:12:42.990 --> 00:12:47.370 the blue line is better than the red line and  then tell them to make an informed decision.