WEBVTT

WEBVTT
Kind: captions
Language: en

00:00:00.060 --> 00:00:05.460
After watching the GLM videos you must have 
the question of should you use these models  

00:00:05.460 --> 00:00:09.870
and if so when and why? And that's the 
question that I will answer in this video.

00:00:09.870 --> 00:00:17.400
The question number one out of two questions 
of whether GLM is required or is it useful  

00:00:17.400 --> 00:00:23.790
is its transformation required? So what do you 
think it's the nature of the relationship with  

00:00:23.790 --> 00:00:28.680
your independent variable and the dependent 
variable? Is it linear and additive? So do  

00:00:28.680 --> 00:00:33.030
all the independent variables work separately 
so that the effect on the dependent variable  

00:00:33.030 --> 00:00:37.380
is there some or is it exponential 
and multiplicative which means that  

00:00:37.380 --> 00:00:41.190
you multiply the effects of independent 
variables together to get the effect on  

00:00:41.190 --> 00:00:46.470
the dependent variable? Or is it perhaps 
the S-curve when they are effect is first  

00:00:46.470 --> 00:00:53.190
very small then increases and then it's very 
small because everybody is at 100% already.

00:00:53.190 --> 00:00:58.530
So this is a question that is 
about theory and what kind of  

00:00:58.530 --> 00:01:04.290
relationships you expect. It's not about 
question of how the dependent variable is  

00:01:04.290 --> 00:01:09.480
distributed. So this is a primarily a 
modeling decision not a data decision.

00:01:09.480 --> 00:01:15.600
My practical recommendation is that you should 
always start with linear regression analysis  

00:01:15.600 --> 00:01:21.630
and then do Diagnostics. Do an added variable 
plot. Do a residual versus fit the plot and  

00:01:21.630 --> 00:01:27.360
see if there is evidence of non-linearity. If 
there is then you consider this alternative.  

00:01:27.360 --> 00:01:32.940
Of course you may have a strong theoretical 
reason to believe that an exponential model  

00:01:32.940 --> 00:01:39.000
or an s-curve model is preferable but still doing 
the regression analysis is very cheap it doesn't  

00:01:39.000 --> 00:01:44.160
cost you much time and they'll tell you something 
that you didn't know before. Most of the time.

00:01:44.160 --> 00:01:49.050
So starting with the regression analysis 
is a nice idea. It's a good idea. Then the  

00:01:49.050 --> 00:01:54.210
third consideration is that some textbooks and 
some articles say that you should transform the  

00:01:54.210 --> 00:01:59.070
dependent variable to reduce heteroscedasticity 
so that your standard errors will be correct.

00:01:59.070 --> 00:02:07.530
This decision has nothing to do with standard 
errors whatsoever. The decision of which  

00:02:07.530 --> 00:02:12.660
transformation applies is driven by theory what 
do you think is the best explanation for the  

00:02:12.660 --> 00:02:18.780
data and the consideration for a standard 
errors is secondary to that. And you can  

00:02:18.780 --> 00:02:25.410
always use robust standard errors to deal with any 
heteroscedasticity issue anyway. So this is driven  

00:02:25.410 --> 00:02:31.230
by theory not about standard error consistency 
or about the kind of data that you have.

00:02:31.230 --> 00:02:36.420
The next question is that once you have 
decided that you want to transform your  

00:02:36.420 --> 00:02:41.340
dependent variable somehow you can also 
transfer your independent variables but this  

00:02:41.340 --> 00:02:46.350
is mostly focused on the dependent variable. 
Should you transform the dependent variable  

00:02:46.350 --> 00:02:51.870
and then apply regression on the transform 
values or should you apply generalized linear  

00:02:51.870 --> 00:02:56.070
model where you transform the fitted 
value instead of the dependent value?

00:02:56.070 --> 00:03:03.000
There are simple points for and against both 
decisions. Simple points for transforming the  

00:03:03.000 --> 00:03:07.050
dependent variable is that it's simple to 
do. So there are no computational issues.  

00:03:07.050 --> 00:03:11.760
Regression analysis will always give 
you results and you can also use OLS  

00:03:11.760 --> 00:03:16.680
Diagnostics. Regression analysis diagnostics 
are very useful. They are more developed than  

00:03:16.680 --> 00:03:20.790
Diagnostics for GLM and you can find 
more resources on how to do those.

00:03:20.790 --> 00:03:27.210
Also regression analysis is well understood. For 
example the nature of multiplicative effects as  

00:03:27.210 --> 00:03:33.210
I explained in previous videos is something 
that many researchers don't fully understand.  

00:03:33.210 --> 00:03:40.380
So regression analysis is more commonly 
understood by readers and reviewers than GLM.

00:03:40.380 --> 00:03:47.430
There are points against transforming simple 
points. Transforming a variable with a few  

00:03:47.430 --> 00:03:52.590
discrete values is problematic. If your account 
variable with one two and three then trying to  

00:03:52.590 --> 00:03:58.020
do some kind of inverse poison transformation on 
that would it make much sense because it still  

00:03:58.020 --> 00:04:03.120
has three three discrete values. If you have 
ones and zeroes the binary dependent variable  

00:04:03.120 --> 00:04:07.800
transforming a binary dependent variable will 
give you another binary variables. It doesn't  

00:04:07.800 --> 00:04:12.930
do anything. And then you have the issue that if 
you want to for example explain company size and  

00:04:12.930 --> 00:04:18.510
you want to explain that with an exponential 
function. Some companies have zero revenues so  

00:04:18.510 --> 00:04:23.280
how do you deal with those zeros because you 
can take a log of zero and then you need this  

00:04:23.280 --> 00:04:28.080
awkward workarounds where you add +1 to the 
dependent variable before you take the log.

00:04:28.080 --> 00:04:35.250
So these are simple points against transport. 
There's a more rigorous way of looking at this  

00:04:35.250 --> 00:04:43.770
issue. It's looking at let's look at this GLM 
model and the transform model. Typically we're  

00:04:43.770 --> 00:04:49.410
interested in explaining what is the mean of the 
data or the expected value of the data given the  

00:04:49.410 --> 00:04:54.210
independent variables in which case we look at 
the nonlinear regression model. If we apply this  

00:04:54.210 --> 00:05:02.310
transform dependent variable model and we treat 
this transformed or this coefficients here as if  

00:05:02.310 --> 00:05:07.890
there were estimates for this original model of 
interest then are there actually inconsistent.

00:05:07.890 --> 00:05:14.610
So the transformed equation is an inconsistent 
estimator for the original request. So  

00:05:14.610 --> 00:05:19.620
statistically thinking you should never 
transfer the dependent variable. You should  

00:05:19.620 --> 00:05:25.080
always use all the GLM because the transferred 
variable is inconsistent estimator of the GLM.  

00:05:25.080 --> 00:05:30.840
That may not be enough to convince all the 
people but let's take a look at examples.

00:05:30.840 --> 00:05:39.960
So I have this data set here. This is all the 
data set that I've used before and we have the  

00:05:39.960 --> 00:05:46.620
distribution of income for professions that are 
more than half men and distribution of income for  

00:05:46.620 --> 00:05:53.340
professions that are more than half women. So we 
have men dominated and women dominated professors  

00:05:53.340 --> 00:05:58.590
and we are interested in knowing whether men 
dominated professions are making more money than  

00:05:58.590 --> 00:06:03.990
women dominated professors. And this is something 
that we would typically want to answer with men  

00:06:03.990 --> 00:06:11.130
make 20% more or 50% more instead of saying that 
men make a four thousand Canadian dollars more.  

00:06:11.130 --> 00:06:15.540
Because the percentage is something that we 
typically think in these kind of comparisons.

00:06:15.540 --> 00:06:23.250
So how do we do it? We're going to look at 
percentages and we do transform dependent  

00:06:23.250 --> 00:06:28.800
variable regression analysis. We get some 
estimates here. Then we can calculate predictions  

00:06:28.800 --> 00:06:35.040
using these estimates. So the predicted lines 
are here in the equation. In the plot here we  

00:06:35.040 --> 00:06:41.148
can see that there are the predictors plated 
lines here are less than the actual sample  

00:06:41.148 --> 00:06:48.180
means. So the model predicts the sample means a 
bit incorrectly predicting too low so they are  

00:06:48.180 --> 00:06:53.790
predicted in erroneously and they are also 
the model predicts the difference between  

00:06:53.790 --> 00:07:00.570
the men and women dominated professions 
to be smaller than what it actually is.

00:07:00.570 --> 00:07:04.890
So both the actual means and the 
difference between the means are  

00:07:04.890 --> 00:07:12.030
predicted incorrectly. The difference is not 
great but it's noticeable. So based on these  

00:07:12.030 --> 00:07:18.390
considerations the GLM approach should 
always be preferred over transforming  

00:07:18.390 --> 00:07:22.740
the dependent variable. Of course doing the 
transformation of the dependent variable  

00:07:22.740 --> 00:07:29.640
using OLS doing the Diagnostics that's a good 
starting point but in the end doing the GLM  

00:07:29.640 --> 00:07:34.590
is more rigorous and that's what your final 
the end product of your research should be.

00:07:34.590 --> 00:07:41.340
There is a nice blog post about this from 
William Gold who is the founder of Stata.  

00:07:41.340 --> 00:07:47.490
And he makes a strong case and with some 
nice references that that's actually how  

00:07:47.490 --> 00:07:52.530
you should do it. So don't log transform 
the dependent variable. Use the Poisson  

00:07:52.530 --> 00:07:59.310
GLM or QML estimate instead and with with 
robust than others. That gives you better  

00:07:59.310 --> 00:08:02.520
estimates than the regression on 
the transform dependent variable.

00:08:02.520 --> 00:08:10.440
So what are the practical recommendations. Once 
you have decided that you want to use one of these  

00:08:10.440 --> 00:08:17.730
transformations then what's the modeling technique 
that you should apply? So linear additive model  

00:08:17.730 --> 00:08:25.980
least squares always. No reason to use anything 
else. That's over all best and waiting least  

00:08:25.980 --> 00:08:31.290
squares could be slightly more efficient in some 
scenarios but it's not worth therefore to do that.

00:08:31.290 --> 00:08:39.210
If you have exponential model with multiplicative 
relationships then if you know the distribution  

00:08:39.210 --> 00:08:45.150
of the dependent variable given the fitted values 
then use the maximum likelihood estimation of the  

00:08:45.150 --> 00:08:49.920
generalized model with the correct distributions. 
So if you know that it's poisson you know it's  

00:08:49.920 --> 00:08:56.370
negative binomial you know that it's something 
else then apply the normal GI. If you don't  

00:08:56.370 --> 00:09:01.440
know what the distribution is or you're uncertain 
about the distribution of the dependent variable  

00:09:01.440 --> 00:09:08.730
or you know that it doesn't follow any of the 
distributions that your statistical software  

00:09:08.730 --> 00:09:15.510
supports then apply Poisson quasi maximum 
likelihood estimation with robust analysis.

00:09:15.510 --> 00:09:22.110
So this is a kind of like it's a similarly 
safe choice than using OLS is for the linear  

00:09:22.110 --> 00:09:28.110
model. For the s-curve models the same thing 
if you know the distribution so if you know  

00:09:28.110 --> 00:09:33.360
that you are using fractional response data 
and you know that the dependent variable is  

00:09:33.360 --> 00:09:39.000
beta distributed given the predicted values 
then use a beta recursion analysis so maximum  

00:09:39.000 --> 00:09:43.710
likelihood GLM with the correct distribution. 
Otherwise if you don't know the distribution  

00:09:43.710 --> 00:09:50.670
of the dependent variable then use or burn only 
quasi maximum likelihood with robust analysis.

00:09:50.670 --> 00:09:57.120
So if you have fractional response data then 
basically I would always recommend that use just  

00:09:57.120 --> 00:10:02.820
the normal logistic regression analysis for that 
because it works. You would think that it doesn't  

00:10:02.820 --> 00:10:07.470
but it actually does as long as this approach 
has been a program to your computer software.

00:10:07.470 --> 00:10:15.180
Now this has nothing to do with the transformation 
of the independent variables. So this is about the  

00:10:15.180 --> 00:10:20.130
dependent variable transforming independent 
variables is okay and you can consider the  

00:10:20.130 --> 00:10:25.680
log transformation or sometimes even exponential 
transformation of the independent variables to  

00:10:25.680 --> 00:10:30.660
get a model that you think explains your data 
well based on your theory and then you estimate  

00:10:30.660 --> 00:10:38.130
it with our either OLS or GLM. This is more this 
is about what you do with the dependent variable.

00:10:38.130 --> 00:10:44.940
The final question is that is this GLM 
and transforming the fitted value versus  

00:10:44.940 --> 00:10:49.950
transforming the dependent value is it a big 
thing? So on let's do an empirical example. So  

00:10:49.950 --> 00:10:57.630
we have here two models. We are using the prestige 
data. We have a years of education here. We have  

00:10:57.630 --> 00:11:04.860
the predictions from these two models transfer 
dependent variable and GLM effects on income.

00:11:04.860 --> 00:11:11.580
When we look at the regression coefficients we 
can see that there's a 7.5 percent difference.  

00:11:11.580 --> 00:11:15.300
So this is one point zero point nine 
one one nine this is zero point one  

00:11:15.300 --> 00:11:22.170
twenty-eight. So seven point five difference 
that is substantially in many methodological  

00:11:22.170 --> 00:11:27.630
papers we think that five percent bias 
it's something that you can ignore but  

00:11:27.630 --> 00:11:32.070
this a seven point five percent difference 
is something that we should care about.

00:11:32.070 --> 00:11:36.270
Also when we look at the predictions 
here we can see that the transport  

00:11:36.270 --> 00:11:44.370
dependent variable systematically under 
predicts how much there are professions  

00:11:44.370 --> 00:11:50.310
that require high education actually make 
and this blue line here is a lot better fit  

00:11:50.310 --> 00:11:58.230
to the data. So that's empirically it's not 
a huge difference but it's something that I  

00:11:58.230 --> 00:12:02.910
think which we are concerned about 
because the fix is rather simple.

00:12:02.910 --> 00:12:12.180
Now the question is if and and when I get papers 
to review where authors use a transformation on  

00:12:12.180 --> 00:12:18.270
the dependent variable should I do I recommend 
that those papers are rejected because they  

00:12:18.270 --> 00:12:25.230
don't use the GLM approach or quasi maximum 
likelihood estimation of Poisson on instead  

00:12:25.230 --> 00:12:31.140
of the transformation of the dependent 
variable. No I would not say that these  

00:12:31.140 --> 00:12:35.880
are this red line is worthless. I'm saying that 
the blue line is better and I would probably  

00:12:35.880 --> 00:12:42.990
recommend the authors to take a look at some 
articles that have cited here that explain why  

00:12:42.990 --> 00:12:47.370
the blue line is better than the red line and 
then tell them to make an informed decision.