WEBVTT

WEBVTT
Kind: captions
Language: en

00:00:00.060 --> 00:00:05.460
After watching the GLM videos you must have&nbsp;
the question of should you use these models&nbsp;&nbsp;

00:00:05.460 --> 00:00:09.870
and if so when and why? And that's the&nbsp;
question that I will answer in this video.

00:00:09.870 --> 00:00:17.400
The question number one out of two questions&nbsp;
of whether GLM is required or is it useful&nbsp;&nbsp;

00:00:17.400 --> 00:00:23.790
is its transformation required? So what do you&nbsp;
think it's the nature of the relationship with&nbsp;&nbsp;

00:00:23.790 --> 00:00:28.680
your independent variable and the dependent&nbsp;
variable? Is it linear and additive? So do&nbsp;&nbsp;

00:00:28.680 --> 00:00:33.030
all the independent variables work separately&nbsp;
so that the effect on the dependent variable&nbsp;&nbsp;

00:00:33.030 --> 00:00:37.380
is there some or is it exponential&nbsp;
and multiplicative which means that&nbsp;&nbsp;

00:00:37.380 --> 00:00:41.190
you multiply the effects of independent&nbsp;
variables together to get the effect on&nbsp;&nbsp;

00:00:41.190 --> 00:00:46.470
the dependent variable? Or is it perhaps&nbsp;
the S-curve when they are effect is first&nbsp;&nbsp;

00:00:46.470 --> 00:00:53.190
very small then increases and then it's very&nbsp;
small because everybody is at 100% already.

00:00:53.190 --> 00:00:58.530
So this is a question that is&nbsp;
about theory and what kind of&nbsp;&nbsp;

00:00:58.530 --> 00:01:04.290
relationships you expect. It's not about&nbsp;
question of how the dependent variable is&nbsp;&nbsp;

00:01:04.290 --> 00:01:09.480
distributed. So this is a primarily a&nbsp;
modeling decision not a data decision.

00:01:09.480 --> 00:01:15.600
My practical recommendation is that you should&nbsp;
always start with linear regression analysis&nbsp;&nbsp;

00:01:15.600 --> 00:01:21.630
and then do Diagnostics. Do an added variable&nbsp;
plot. Do a residual versus fit the plot and&nbsp;&nbsp;

00:01:21.630 --> 00:01:27.360
see if there is evidence of non-linearity. If&nbsp;
there is then you consider this alternative.&nbsp;&nbsp;

00:01:27.360 --> 00:01:32.940
Of course you may have a strong theoretical&nbsp;
reason to believe that an exponential model&nbsp;&nbsp;

00:01:32.940 --> 00:01:39.000
or an s-curve model is preferable but still doing&nbsp;
the regression analysis is very cheap it doesn't&nbsp;&nbsp;

00:01:39.000 --> 00:01:44.160
cost you much time and they'll tell you something&nbsp;
that you didn't know before. Most of the time.

00:01:44.160 --> 00:01:49.050
So starting with the regression analysis&nbsp;
is a nice idea. It's a good idea. Then the&nbsp;&nbsp;

00:01:49.050 --> 00:01:54.210
third consideration is that some textbooks and&nbsp;
some articles say that you should transform the&nbsp;&nbsp;

00:01:54.210 --> 00:01:59.070
dependent variable to reduce heteroscedasticity&nbsp;
so that your standard errors will be correct.

00:01:59.070 --> 00:02:07.530
This decision has nothing to do with standard&nbsp;
errors whatsoever. The decision of which&nbsp;&nbsp;

00:02:07.530 --> 00:02:12.660
transformation applies is driven by theory what&nbsp;
do you think is the best explanation for the&nbsp;&nbsp;

00:02:12.660 --> 00:02:18.780
data and the consideration for a standard&nbsp;
errors is secondary to that. And you can&nbsp;&nbsp;

00:02:18.780 --> 00:02:25.410
always use robust standard errors to deal with any&nbsp;
heteroscedasticity issue anyway. So this is driven&nbsp;&nbsp;

00:02:25.410 --> 00:02:31.230
by theory not about standard error consistency&nbsp;
or about the kind of data that you have.

00:02:31.230 --> 00:02:36.420
The next question is that once you have&nbsp;
decided that you want to transform your&nbsp;&nbsp;

00:02:36.420 --> 00:02:41.340
dependent variable somehow you can also&nbsp;
transfer your independent variables but this&nbsp;&nbsp;

00:02:41.340 --> 00:02:46.350
is mostly focused on the dependent variable.&nbsp;
Should you transform the dependent variable&nbsp;&nbsp;

00:02:46.350 --> 00:02:51.870
and then apply regression on the transform&nbsp;
values or should you apply generalized linear&nbsp;&nbsp;

00:02:51.870 --> 00:02:56.070
model where you transform the fitted&nbsp;
value instead of the dependent value?

00:02:56.070 --> 00:03:03.000
There are simple points for and against both&nbsp;
decisions. Simple points for transforming the&nbsp;&nbsp;

00:03:03.000 --> 00:03:07.050
dependent variable is that it's simple to&nbsp;
do. So there are no computational issues.&nbsp;&nbsp;

00:03:07.050 --> 00:03:11.760
Regression analysis will always give&nbsp;
you results and you can also use OLS&nbsp;&nbsp;

00:03:11.760 --> 00:03:16.680
Diagnostics. Regression analysis diagnostics&nbsp;
are very useful. They are more developed than&nbsp;&nbsp;

00:03:16.680 --> 00:03:20.790
Diagnostics for GLM and you can find&nbsp;
more resources on how to do those.

00:03:20.790 --> 00:03:27.210
Also regression analysis is well understood. For&nbsp;
example the nature of multiplicative effects as&nbsp;&nbsp;

00:03:27.210 --> 00:03:33.210
I explained in previous videos is something&nbsp;
that many researchers don't fully understand.&nbsp;&nbsp;

00:03:33.210 --> 00:03:40.380
So regression analysis is more commonly&nbsp;
understood by readers and reviewers than GLM.

00:03:40.380 --> 00:03:47.430
There are points against transforming simple&nbsp;
points. Transforming a variable with a few&nbsp;&nbsp;

00:03:47.430 --> 00:03:52.590
discrete values is problematic. If your account&nbsp;
variable with one two and three then trying to&nbsp;&nbsp;

00:03:52.590 --> 00:03:58.020
do some kind of inverse poison transformation on&nbsp;
that would it make much sense because it still&nbsp;&nbsp;

00:03:58.020 --> 00:04:03.120
has three three discrete values. If you have&nbsp;
ones and zeroes the binary dependent variable&nbsp;&nbsp;

00:04:03.120 --> 00:04:07.800
transforming a binary dependent variable will&nbsp;
give you another binary variables. It doesn't&nbsp;&nbsp;

00:04:07.800 --> 00:04:12.930
do anything. And then you have the issue that if&nbsp;
you want to for example explain company size and&nbsp;&nbsp;

00:04:12.930 --> 00:04:18.510
you want to explain that with an exponential&nbsp;
function. Some companies have zero revenues so&nbsp;&nbsp;

00:04:18.510 --> 00:04:23.280
how do you deal with those zeros because you&nbsp;
can take a log of zero and then you need this&nbsp;&nbsp;

00:04:23.280 --> 00:04:28.080
awkward workarounds where you add +1 to the&nbsp;
dependent variable before you take the log.

00:04:28.080 --> 00:04:35.250
So these are simple points against transport.&nbsp;
There's a more rigorous way of looking at this&nbsp;&nbsp;

00:04:35.250 --> 00:04:43.770
issue. It's looking at let's look at this GLM&nbsp;
model and the transform model. Typically we're&nbsp;&nbsp;

00:04:43.770 --> 00:04:49.410
interested in explaining what is the mean of the&nbsp;
data or the expected value of the data given the&nbsp;&nbsp;

00:04:49.410 --> 00:04:54.210
independent variables in which case we look at&nbsp;
the nonlinear regression model. If we apply this&nbsp;&nbsp;

00:04:54.210 --> 00:05:02.310
transform dependent variable model and we treat&nbsp;
this transformed or this coefficients here as if&nbsp;&nbsp;

00:05:02.310 --> 00:05:07.890
there were estimates for this original model of&nbsp;
interest then are there actually inconsistent.

00:05:07.890 --> 00:05:14.610
So the transformed equation is an inconsistent&nbsp;
estimator for the original request. So&nbsp;&nbsp;

00:05:14.610 --> 00:05:19.620
statistically thinking you should never&nbsp;
transfer the dependent variable. You should&nbsp;&nbsp;

00:05:19.620 --> 00:05:25.080
always use all the GLM because the transferred&nbsp;
variable is inconsistent estimator of the GLM.&nbsp;&nbsp;

00:05:25.080 --> 00:05:30.840
That may not be enough to convince all the&nbsp;
people but let's take a look at examples.

00:05:30.840 --> 00:05:39.960
So I have this data set here. This is all the&nbsp;
data set that I've used before and we have the&nbsp;&nbsp;

00:05:39.960 --> 00:05:46.620
distribution of income for professions that are&nbsp;
more than half men and distribution of income for&nbsp;&nbsp;

00:05:46.620 --> 00:05:53.340
professions that are more than half women. So we&nbsp;
have men dominated and women dominated professors&nbsp;&nbsp;

00:05:53.340 --> 00:05:58.590
and we are interested in knowing whether men&nbsp;
dominated professions are making more money than&nbsp;&nbsp;

00:05:58.590 --> 00:06:03.990
women dominated professors. And this is something&nbsp;
that we would typically want to answer with men&nbsp;&nbsp;

00:06:03.990 --> 00:06:11.130
make 20% more or 50% more instead of saying that&nbsp;
men make a four thousand Canadian dollars more.&nbsp;&nbsp;

00:06:11.130 --> 00:06:15.540
Because the percentage is something that we&nbsp;
typically think in these kind of comparisons.

00:06:15.540 --> 00:06:23.250
So how do we do it? We're going to look at&nbsp;
percentages and we do transform dependent&nbsp;&nbsp;

00:06:23.250 --> 00:06:28.800
variable regression analysis. We get some&nbsp;
estimates here. Then we can calculate predictions&nbsp;&nbsp;

00:06:28.800 --> 00:06:35.040
using these estimates. So the predicted lines&nbsp;
are here in the equation. In the plot here we&nbsp;&nbsp;

00:06:35.040 --> 00:06:41.148
can see that there are the predictors plated&nbsp;
lines here are less than the actual sample&nbsp;&nbsp;

00:06:41.148 --> 00:06:48.180
means. So the model predicts the sample means a&nbsp;
bit incorrectly predicting too low so they are&nbsp;&nbsp;

00:06:48.180 --> 00:06:53.790
predicted in erroneously and they are also&nbsp;
the model predicts the difference between&nbsp;&nbsp;

00:06:53.790 --> 00:07:00.570
the men and women dominated professions&nbsp;
to be smaller than what it actually is.

00:07:00.570 --> 00:07:04.890
So both the actual means and the&nbsp;
difference between the means are&nbsp;&nbsp;

00:07:04.890 --> 00:07:12.030
predicted incorrectly. The difference is not&nbsp;
great but it's noticeable. So based on these&nbsp;&nbsp;

00:07:12.030 --> 00:07:18.390
considerations the GLM approach should&nbsp;
always be preferred over transforming&nbsp;&nbsp;

00:07:18.390 --> 00:07:22.740
the dependent variable. Of course doing the&nbsp;
transformation of the dependent variable&nbsp;&nbsp;

00:07:22.740 --> 00:07:29.640
using OLS doing the Diagnostics that's a good&nbsp;
starting point but in the end doing the GLM&nbsp;&nbsp;

00:07:29.640 --> 00:07:34.590
is more rigorous and that's what your final&nbsp;
the end product of your research should be.

00:07:34.590 --> 00:07:41.340
There is a nice blog post about this from&nbsp;
William Gold who is the founder of Stata.&nbsp;&nbsp;

00:07:41.340 --> 00:07:47.490
And he makes a strong case and with some&nbsp;
nice references that that's actually how&nbsp;&nbsp;

00:07:47.490 --> 00:07:52.530
you should do it. So don't log transform&nbsp;
the dependent variable. Use the Poisson&nbsp;&nbsp;

00:07:52.530 --> 00:07:59.310
GLM or QML estimate instead and with with&nbsp;
robust than others. That gives you better&nbsp;&nbsp;

00:07:59.310 --> 00:08:02.520
estimates than the regression on&nbsp;
the transform dependent variable.

00:08:02.520 --> 00:08:10.440
So what are the practical recommendations. Once&nbsp;
you have decided that you want to use one of these&nbsp;&nbsp;

00:08:10.440 --> 00:08:17.730
transformations then what's the modeling technique&nbsp;
that you should apply? So linear additive model&nbsp;&nbsp;

00:08:17.730 --> 00:08:25.980
least squares always. No reason to use anything&nbsp;
else. That's over all best and waiting least&nbsp;&nbsp;

00:08:25.980 --> 00:08:31.290
squares could be slightly more efficient in some&nbsp;
scenarios but it's not worth therefore to do that.

00:08:31.290 --> 00:08:39.210
If you have exponential model with multiplicative&nbsp;
relationships then if you know the distribution&nbsp;&nbsp;

00:08:39.210 --> 00:08:45.150
of the dependent variable given the fitted values&nbsp;
then use the maximum likelihood estimation of the&nbsp;&nbsp;

00:08:45.150 --> 00:08:49.920
generalized model with the correct distributions.&nbsp;
So if you know that it's poisson you know it's&nbsp;&nbsp;

00:08:49.920 --> 00:08:56.370
negative binomial you know that it's something&nbsp;
else then apply the normal GI. If you don't&nbsp;&nbsp;

00:08:56.370 --> 00:09:01.440
know what the distribution is or you're uncertain&nbsp;
about the distribution of the dependent variable&nbsp;&nbsp;

00:09:01.440 --> 00:09:08.730
or you know that it doesn't follow any of the&nbsp;
distributions that your statistical software&nbsp;&nbsp;

00:09:08.730 --> 00:09:15.510
supports then apply Poisson quasi maximum&nbsp;
likelihood estimation with robust analysis.

00:09:15.510 --> 00:09:22.110
So this is a kind of like it's a similarly&nbsp;
safe choice than using OLS is for the linear&nbsp;&nbsp;

00:09:22.110 --> 00:09:28.110
model. For the s-curve models the same thing&nbsp;
if you know the distribution so if you know&nbsp;&nbsp;

00:09:28.110 --> 00:09:33.360
that you are using fractional response data&nbsp;
and you know that the dependent variable is&nbsp;&nbsp;

00:09:33.360 --> 00:09:39.000
beta distributed given the predicted values&nbsp;
then use a beta recursion analysis so maximum&nbsp;&nbsp;

00:09:39.000 --> 00:09:43.710
likelihood GLM with the correct distribution.&nbsp;
Otherwise if you don't know the distribution&nbsp;&nbsp;

00:09:43.710 --> 00:09:50.670
of the dependent variable then use or burn only&nbsp;
quasi maximum likelihood with robust analysis.

00:09:50.670 --> 00:09:57.120
So if you have fractional response data then&nbsp;
basically I would always recommend that use just&nbsp;&nbsp;

00:09:57.120 --> 00:10:02.820
the normal logistic regression analysis for that&nbsp;
because it works. You would think that it doesn't&nbsp;&nbsp;

00:10:02.820 --> 00:10:07.470
but it actually does as long as this approach&nbsp;
has been a program to your computer software.

00:10:07.470 --> 00:10:15.180
Now this has nothing to do with the transformation&nbsp;
of the independent variables. So this is about the&nbsp;&nbsp;

00:10:15.180 --> 00:10:20.130
dependent variable transforming independent&nbsp;
variables is okay and you can consider the&nbsp;&nbsp;

00:10:20.130 --> 00:10:25.680
log transformation or sometimes even exponential&nbsp;
transformation of the independent variables to&nbsp;&nbsp;

00:10:25.680 --> 00:10:30.660
get a model that you think explains your data&nbsp;
well based on your theory and then you estimate&nbsp;&nbsp;

00:10:30.660 --> 00:10:38.130
it with our either OLS or GLM. This is more this&nbsp;
is about what you do with the dependent variable.

00:10:38.130 --> 00:10:44.940
The final question is that is this GLM&nbsp;
and transforming the fitted value versus&nbsp;&nbsp;

00:10:44.940 --> 00:10:49.950
transforming the dependent value is it a big&nbsp;
thing? So on let's do an empirical example. So&nbsp;&nbsp;

00:10:49.950 --> 00:10:57.630
we have here two models. We are using the prestige&nbsp;
data. We have a years of education here. We have&nbsp;&nbsp;

00:10:57.630 --> 00:11:04.860
the predictions from these two models transfer&nbsp;
dependent variable and GLM effects on income.

00:11:04.860 --> 00:11:11.580
When we look at the regression coefficients we&nbsp;
can see that there's a 7.5 percent difference.&nbsp;&nbsp;

00:11:11.580 --> 00:11:15.300
So this is one point zero point nine&nbsp;
one one nine this is zero point one&nbsp;&nbsp;

00:11:15.300 --> 00:11:22.170
twenty-eight. So seven point five difference&nbsp;
that is substantially in many methodological&nbsp;&nbsp;

00:11:22.170 --> 00:11:27.630
papers we think that five percent bias&nbsp;
it's something that you can ignore but&nbsp;&nbsp;

00:11:27.630 --> 00:11:32.070
this a seven point five percent difference&nbsp;
is something that we should care about.

00:11:32.070 --> 00:11:36.270
Also when we look at the predictions&nbsp;
here we can see that the transport&nbsp;&nbsp;

00:11:36.270 --> 00:11:44.370
dependent variable systematically under&nbsp;
predicts how much there are professions&nbsp;&nbsp;

00:11:44.370 --> 00:11:50.310
that require high education actually make&nbsp;
and this blue line here is a lot better fit&nbsp;&nbsp;

00:11:50.310 --> 00:11:58.230
to the data. So that's empirically it's not&nbsp;
a huge difference but it's something that I&nbsp;&nbsp;

00:11:58.230 --> 00:12:02.910
think which we are concerned about&nbsp;
because the fix is rather simple.

00:12:02.910 --> 00:12:12.180
Now the question is if and and when I get papers&nbsp;
to review where authors use a transformation on&nbsp;&nbsp;

00:12:12.180 --> 00:12:18.270
the dependent variable should I do I recommend&nbsp;
that those papers are rejected because they&nbsp;&nbsp;

00:12:18.270 --> 00:12:25.230
don't use the GLM approach or quasi maximum&nbsp;
likelihood estimation of Poisson on instead&nbsp;&nbsp;

00:12:25.230 --> 00:12:31.140
of the transformation of the dependent&nbsp;
variable. No I would not say that these&nbsp;&nbsp;

00:12:31.140 --> 00:12:35.880
are this red line is worthless. I'm saying that&nbsp;
the blue line is better and I would probably&nbsp;&nbsp;

00:12:35.880 --> 00:12:42.990
recommend the authors to take a look at some&nbsp;
articles that have cited here that explain why&nbsp;&nbsp;

00:12:42.990 --> 00:12:47.370
the blue line is better than the red line and&nbsp;
then tell them to make an informed decision.