WEBVTT
Kind: captions
Language: en
00:00:00.060 --> 00:00:04.380
This model introduces the
exponential models for counts.
00:00:04.380 --> 00:00:08.190
Why the video is titled, exponential models,
00:00:08.190 --> 00:00:09.600
instead of just count data,
00:00:09.600 --> 00:00:11.820
will become clear pretty soon.
00:00:11.820 --> 00:00:14.160
So what are counts?
00:00:14.160 --> 00:00:18.030
Counts are typically counts of events,
00:00:18.030 --> 00:00:19.980
how many times does something happen?
00:00:19.980 --> 00:00:21.960
Like if you go fishing,
00:00:21.960 --> 00:00:23.730
how many fish you get,
00:00:23.730 --> 00:00:25.470
how many times did you catch a fish?
00:00:25.470 --> 00:00:28.050
If you are running a company,
00:00:28.050 --> 00:00:30.810
how many times the company patents per year?
00:00:30.810 --> 00:00:32.820
So they are discrete numbers, whole numbers.
00:00:32.820 --> 00:00:37.710
And they are strictly zero
or positive, non-negative.
00:00:37.710 --> 00:00:41.820
And there is some confusion around,
00:00:41.820 --> 00:00:44.460
how you model count variables in the literature.
00:00:44.460 --> 00:00:49.110
And this article in Organizational
Research Methods is one such example.
00:00:49.110 --> 00:00:53.070
So it's very commonly believed that
00:00:53.070 --> 00:00:55.080
if you have a variable that is a count,
00:00:55.080 --> 00:01:00.000
then you have to use some other model
than normal regression analysis,
00:01:00.000 --> 00:01:04.470
such as the Poisson regression analysis
or negative binomial regression analysis.
00:01:04.470 --> 00:01:10.020
And this article explains that the
application of normal regression analysis
00:01:10.020 --> 00:01:12.180
would be inappropriate for data,
00:01:12.180 --> 00:01:14.280
where the dependent variable is a count,
00:01:14.280 --> 00:01:17.880
and if you use the normal
regression analysis for a count,
00:01:17.880 --> 00:01:21.360
the results can be inefficient,
inconsistent and biased.
00:01:21.360 --> 00:01:26.010
And this statement is simply not true, generally.
00:01:26.010 --> 00:01:30.240
There are cases where normal regression
analysis should not be used for counts.
00:01:30.240 --> 00:01:36.570
But as a general statement that it is always wrong
to use normal regression analysis for counts,
00:01:36.570 --> 00:01:38.100
that is simply incorrect.
00:01:38.100 --> 00:01:41.160
How this statement is justified is
00:01:41.160 --> 00:01:44.070
by giving two references to econometrics books.
00:01:44.070 --> 00:01:46.530
The problem is that these are big books,
00:01:46.530 --> 00:01:48.330
and there are no page numbers.
00:01:48.330 --> 00:01:50.790
So we can't really check,
00:01:50.790 --> 00:01:54.150
whether these sources support the claim
00:01:54.150 --> 00:01:56.970
without going through and reading the full book,
00:01:56.970 --> 00:01:59.160
which you can't assume your reader to do.
00:01:59.160 --> 00:02:03.060
So whenever you see statements
that cite books as evidence,
00:02:03.060 --> 00:02:05.730
you should really ask the page number.
00:02:05.730 --> 00:02:09.990
So where exactly in that book does it
say that regression analysis will be
00:02:09.990 --> 00:02:12.150
biased, inconsistent, and inefficient,
00:02:12.150 --> 00:02:14.400
if your dependent variable is a count.
00:02:14.400 --> 00:02:23.340
To understand, why using counts could be a problem
or is not a problem for regression analysis,
00:02:23.340 --> 00:02:26.160
let's review the regression analysis assumption.
00:02:26.160 --> 00:02:27.570
So this is from Wooldridge's book.
00:02:27.570 --> 00:02:34.950
And regression analysis assumes four different
things for unbiasedness and consistency.
00:02:34.950 --> 00:02:37.440
So we have a linear model,
00:02:37.440 --> 00:02:40.830
we have random sampling, no perfect
colinearity and no endogeneity.
00:02:40.830 --> 00:02:42.060
If these are true,
00:02:42.060 --> 00:02:45.180
regression analysis is consistently unbiased.
00:02:45.180 --> 00:02:49.110
There is nothing about not
being a count variable here.
00:02:49.110 --> 00:02:53.220
There is in fact nothing about the
distribution of the dependent variable at all.
00:02:53.220 --> 00:02:54.480
It's only about,
00:02:54.480 --> 00:02:57.000
what's the expected value
of the dependent variable,
00:02:57.000 --> 00:03:00.480
or the mean gave the observed
independent variables.
00:03:00.480 --> 00:03:05.580
We start getting interested in the
distribution of the dependent variable,
00:03:05.580 --> 00:03:09.660
when we have the efficiency assumption.
00:03:09.660 --> 00:03:11.842
So when you have homoscedasticity error,
00:03:11.842 --> 00:03:16.410
so that variance with the error term doesn't
change with the explanatory variables,
00:03:16.410 --> 00:03:19.050
then regression analysis is also efficient.
00:03:19.050 --> 00:03:20.820
But again, there's no,
00:03:20.820 --> 00:03:22.560
should not be a count assumption.
00:03:22.560 --> 00:03:27.360
So using a regression analysis
for counts is completely fine.
00:03:27.360 --> 00:03:28.860
So there is no problem with that.
00:03:28.860 --> 00:03:30.120
To demonstrate,
00:03:30.120 --> 00:03:32.310
let's have an empirical demonstration.
00:03:32.310 --> 00:03:34.260
So we have a dice here.
00:03:34.260 --> 00:03:37.950
We have 30 sets of die throws.
00:03:37.950 --> 00:03:40.740
And we have the number of dice that were thrown
00:03:40.740 --> 00:03:42.420
and the number of sixes that we got.
00:03:42.420 --> 00:03:44.640
So the number of dice that were thrown,
00:03:44.640 --> 00:03:46.080
a number sixes that we got,
00:03:46.080 --> 00:03:50.760
are the independent variable and the dependent
variable in a simple regression analysis.
00:03:50.760 --> 00:03:52.410
We draw a regression line.
00:03:52.410 --> 00:03:53.640
Number of die throws,
00:03:53.640 --> 00:03:54.900
the explanatory variables,
00:03:54.900 --> 00:03:56.640
a number of sixes here.
00:03:56.640 --> 00:03:58.740
And it looks pretty good to me.
00:03:58.740 --> 00:04:01.350
So the regression line seems
to go through the data.
00:04:01.350 --> 00:04:05.550
And in fact, there is heteroscedasticity.
00:04:05.550 --> 00:04:08.220
So the variance here is
greater than variance here.
00:04:08.220 --> 00:04:10.740
But other than that, regression analysis is fine.
00:04:10.740 --> 00:04:12.720
Just use a robust and dares,
00:04:12.720 --> 00:04:16.500
and this is going to be the best way to model it.
00:04:16.500 --> 00:04:21.810
So what if we use Poisson regression analysis,
00:04:21.810 --> 00:04:25.020
that is commonly recommended for counts.
00:04:25.020 --> 00:04:27.900
So we use the Poisson model here.
00:04:27.900 --> 00:04:34.080
The coefficient 0.02 and for normal
regression analysis, the coefficient is 0.17.
00:04:34.080 --> 00:04:36.690
So 0.17 is about one out of six,
00:04:36.690 --> 00:04:40.620
which we know that we get for
each additional die throw.
00:04:40.620 --> 00:04:44.520
The expected number of sixes
increases by one out of six,
00:04:44.520 --> 00:04:48.030
because that's the probability of
getting a six from a fair dice.
00:04:48.030 --> 00:04:53.610
The 0.02 should be interpreted
as percentages increased.
00:04:53.610 --> 00:04:56.340
So relative to the current level of sixes,
00:04:57.360 --> 00:05:00.420
the expected level of sixes
increases by two percent.
00:05:00.420 --> 00:05:04.800
That doesn't really make any sense
to think about, die throws that way.
00:05:04.800 --> 00:05:08.010
And if we plot the Poisson regression line here,
00:05:08.010 --> 00:05:11.100
it's actually a curve because
Poisson is an exponential model.
00:05:11.100 --> 00:05:15.720
We can see that this exponential model
doesn't really explain the data at all,
00:05:15.720 --> 00:05:18.210
because when we have one throw for example,
00:05:18.210 --> 00:05:23.190
it predicts whether we get
four sixes, that's impossible.
00:05:23.190 --> 00:05:27.540
And also that the number of
sixes here grows exponentially.
00:05:27.540 --> 00:05:31.920
It can't, at some point, you hit the
limit of how many times you throw.
00:05:31.920 --> 00:05:35.760
So just the fact that our
dependent variable is a count,
00:05:35.760 --> 00:05:40.020
doesn't mean that we can't
use regression analysis,
00:05:40.020 --> 00:05:43.320
and that we should use
Poisson regression analysis or
00:05:43.320 --> 00:05:46.170
some variant of that technique.
00:05:46.170 --> 00:05:52.050
The important thing about the
Poisson regression analysis is that
00:05:52.050 --> 00:05:53.400
it's an exponential model.
00:05:53.400 --> 00:06:00.000
So we're modeling the expected value
of Y as an exponential function.
00:06:00.000 --> 00:06:02.370
And this is the important part.
00:06:02.370 --> 00:06:03.840
When you have an exponential function,
00:06:03.840 --> 00:06:07.260
then the least squares is no
longer an ideal technique.
00:06:07.260 --> 00:06:14.100
If you think that your count depends linearly
and additively on your independent variables,
00:06:14.100 --> 00:06:18.540
then using normal regression
analysis is not problematic at all,
00:06:18.540 --> 00:06:21.570
in fact, it's an ideal technique
for that kind of analysis.
00:06:21.570 --> 00:06:24.900
So in Poisson regression analysis,
00:06:24.900 --> 00:06:26.640
we are using an exponential function.
00:06:26.640 --> 00:06:27.840
And that's the reason,
00:06:27.840 --> 00:06:31.320
why this video is not
regression analysis for counts,
00:06:31.320 --> 00:06:34.860
but instead, it's exponential models for counts.
00:06:34.860 --> 00:06:39.390
So what is the Poisson distribution?
00:06:39.390 --> 00:06:45.060
It's a distribution of count of independent
events that occur at the constant rate.
00:06:45.060 --> 00:06:56.070
So if you have a rate of, let's say,
0.001 deaths per capita in a country,
00:06:56.070 --> 00:06:57.630
how many people die in a given year?
00:06:57.630 --> 00:06:59.130
Something like that.
00:06:59.130 --> 00:07:03.210
And what does this Poisson distribution look like?
00:07:03.930 --> 00:07:05.370
It's a discrete distribution,
00:07:05.370 --> 00:07:07.500
so we have discrete numbers.
00:07:07.500 --> 00:07:10.770
And when we have small numbers,
00:07:10.770 --> 00:07:12.180
the expected value here is 1,
00:07:12.180 --> 00:07:14.430
then we typically get 1, 2 or 3,
00:07:14.430 --> 00:07:17.490
and getting 20 it's almost impossible.
00:07:17.490 --> 00:07:20.580
If we have a large value here,
00:07:20.580 --> 00:07:21.900
expected value is 9,
00:07:21.900 --> 00:07:28.020
then the range of values that we can
get ranges from about 3 to about 20,
00:07:28.020 --> 00:07:29.640
and that's still plausible to get.
00:07:29.640 --> 00:07:31.950
So what we can see here is that,
00:07:31.950 --> 00:07:35.310
the dispersion increases with the expected value.
00:07:35.310 --> 00:07:38.430
And that's a feature of
Poisson regression analysis.
00:07:38.430 --> 00:07:44.070
So normally when we have this expected value,
00:07:44.070 --> 00:07:47.070
this expected value of 1, the variance is 1,
00:07:47.070 --> 00:07:50.040
expected value is 9, the variance is 9.
00:07:50.040 --> 00:07:55.200
So the variance and the mean of
Poisson distribution are the same.
00:07:55.200 --> 00:07:59.940
Now coming back to our example of die throws.
00:07:59.940 --> 00:08:07.470
This distribution is an ideal
distribution for modeling die throws.
00:08:07.470 --> 00:08:11.130
But we don't need to Poisson regression analysis,
00:08:11.130 --> 00:08:14.490
because that also includes
the exponential function,
00:08:14.490 --> 00:08:15.720
which we don't need.
00:08:15.720 --> 00:08:21.600
And using the least squares
estimation technique is good enough,
00:08:21.600 --> 00:08:24.600
regardless of the distribution
of the dependent variables.
00:08:24.600 --> 00:08:29.700
So using a linear function with Poisson
distribution would be unnecessary.
00:08:29.700 --> 00:08:35.160
Sometimes if we are interested
in the actual predictions
00:08:35.160 --> 00:08:36.720
from the distribution, how they're distributed,
00:08:36.720 --> 00:08:37.860
then we could use that.
00:08:37.860 --> 00:08:43.140
But normally the Poisson distribution is
only required when we do nonlinear models.
00:08:43.140 --> 00:08:48.360
When we go for larger values
in the expected value.
00:08:48.360 --> 00:08:53.700
Let's say we go from 2, 4, 8 and so on to 512,
00:08:53.700 --> 00:08:55.200
so these are exponents of 2.
00:08:55.200 --> 00:08:58.230
We can see that the distribution
approaches the normal distribution.
00:08:58.230 --> 00:09:03.270
So with large numbers, large expected values,
00:09:03.270 --> 00:09:06.150
the Poisson distribution
approximates a normal distribution.
00:09:06.150 --> 00:09:09.210
Whichever you use, normal distribution or Poisson,
00:09:09.210 --> 00:09:13.200
then in many cases, it doesn't make a difference,
00:09:13.200 --> 00:09:20.880
if you can have the standard deviation of the
normal distribution as a parameter as well.
00:09:20.880 --> 00:09:23.340
So they're roughly the same.
00:09:23.340 --> 00:09:26.460
So the distribution makes the most difference,
00:09:26.460 --> 00:09:28.650
if your expected value is small.
00:09:28.650 --> 00:09:31.320
So this is distinctly non-normal as is that,
00:09:31.320 --> 00:09:33.120
but this is not as much.
00:09:33.120 --> 00:09:38.880
So you apply the Poisson regression model,
00:09:38.880 --> 00:09:45.870
when you think that the exponential
model is the right model for your data.
00:09:45.870 --> 00:09:50.310
So you're expecting that the effects
are relative to the current level,
00:09:50.310 --> 00:09:52.440
and they're multiplicative together.
00:09:52.440 --> 00:09:59.400
And you interpret these results
the same way as you would results,
00:09:59.400 --> 00:10:01.980
when your dependent variable is log-transformed.
00:10:01.980 --> 00:10:08.820
The number that you explain is
the expected number of events.
00:10:08.820 --> 00:10:15.930
One thing that's very common in studies
that apply these techniques is that,
00:10:15.930 --> 00:10:17.610
if we study for example,
00:10:17.610 --> 00:10:21.450
how many people die in each country
and we look at European countries.
00:10:21.450 --> 00:10:24.630
The European countries are quite
different in size from one another.
00:10:24.630 --> 00:10:29.400
Finland has about five or six million people
and Germany has like 100 million people.
00:10:29.400 --> 00:10:32.490
So we have to take that into account somehow,
00:10:32.490 --> 00:10:37.020
because we can't really compare the number of
deaths in Finland and number of deaths in Germany,
00:10:37.020 --> 00:10:38.940
unless we somehow standardize the data.
00:10:39.854 --> 00:10:42.000
Quite often we are looking at,
00:10:42.000 --> 00:10:45.840
we want to understand the rate
at which something happens
00:10:45.840 --> 00:10:47.450
instead of the count.
00:10:47.859 --> 00:10:51.540
And to do that we use the exposure and offsets.
00:10:51.540 --> 00:10:54.750
For example, the number of deaths
due to cancer per population,
00:10:54.750 --> 00:10:59.250
or the number of citations
per article in a journal.
00:11:00.150 --> 00:11:03.840
The population here and the article
here are what we call exposure.
00:11:03.840 --> 00:11:09.840
So this is like the total amount
of units or whatever at risk,
00:11:09.840 --> 00:11:12.750
that could have the event occurring at them.
00:11:12.750 --> 00:11:17.580
One thing that we could try,
00:11:17.580 --> 00:11:23.310
if we don't think it through is just to
divide the number of deaths per population.
00:11:23.310 --> 00:11:28.890
But that's highly problematic for
reasons explained in this article.
00:11:28.890 --> 00:11:31.830
So using the rate itself is a bad idea.
00:11:31.830 --> 00:11:37.650
And also the Poisson regression analysis and
the variance of that technique are very useful,
00:11:37.650 --> 00:11:39.660
because there is a nice trick that we can apply.
00:11:39.660 --> 00:11:45.750
So when we want to model
the rate instead of model,
00:11:45.750 --> 00:11:49.050
the actual count of deaths or counts of citations,
00:11:49.050 --> 00:11:51.960
we want to estimate this kind of article.
00:11:51.960 --> 00:11:55.440
So we look at the expectation here,
00:11:55.440 --> 00:11:57.720
multiplied by the exposure.
00:11:57.720 --> 00:11:59.700
So we are interested in that kind of model.
00:11:59.700 --> 00:12:05.130
And so this gives us the rate of events,
00:12:05.130 --> 00:12:08.970
and we multiply it with the size of our unit,
00:12:08.970 --> 00:12:11.730
and that will give the actual count of events.
00:12:11.730 --> 00:12:14.370
We can apply a little bit of math
00:12:14.370 --> 00:12:19.080
and move this exposure inside
the exponential function
00:12:19.080 --> 00:12:22.830
by taking a logarithm and then
adding it to the linear predictor.
00:12:24.930 --> 00:12:29.760
Taking a logarithm and including the
variable inside the regression model
00:12:29.760 --> 00:12:34.710
without the regression coefficient or
regression coefficient constraint to be one,
00:12:34.710 --> 00:12:36.500
is called an offset.
00:12:37.067 --> 00:12:41.940
So we are basically adding a
constant number to the fitted value,
00:12:41.940 --> 00:12:45.960
that's calculated based on our observation.
00:12:47.504 --> 00:12:52.200
So using an offset is something that your
statistical software will do for you.
00:12:52.200 --> 00:12:56.190
So you specify one variable as an
offset and how it works is that
00:12:56.190 --> 00:12:59.100
the statistical software takes
a logarithm of that value,
00:12:59.100 --> 00:13:01.710
adds it to the regression function,
00:13:01.710 --> 00:13:06.870
but instead of estimating a regression
coefficient, it constrains the effect to be one.
00:13:06.870 --> 00:13:13.590
And then that allows you to
interpret these effects as rates,
00:13:13.590 --> 00:13:15.360
instead of as total counts.
00:13:15.360 --> 00:13:16.620
And that's very useful.
00:13:16.620 --> 00:13:19.890
I've used that myself in one
article that I'm working on.
00:13:21.160 --> 00:13:25.780
Then we have another variant of
the Poisson regression model.
00:13:25.780 --> 00:13:29.680
So the Poisson regression model,
Poisson distribution, assumes
00:13:29.680 --> 00:13:34.420
that the variance of the distribution
of the dependent variable is the same
00:13:34.420 --> 00:13:36.580
as the expected value for a given observation.
00:13:36.580 --> 00:13:43.420
So the Poisson makes the variance
assumption that the variance equals mean.
00:13:43.420 --> 00:13:47.350
We can relax that assumption by saying that
00:13:47.350 --> 00:13:50.350
the variance equals alpha times the mean.
00:13:50.350 --> 00:13:55.270
And that will give us a negative
binomial regression analysis.
00:13:55.270 --> 00:13:57.910
So if alpha is greater than 1,
00:13:57.910 --> 00:14:01.480
then we're saying that our
data are overdispersion.
00:14:01.480 --> 00:14:05.770
And that's when negative binomial
regression analysis could be used.
00:14:05.770 --> 00:14:08.440
If alpha is less than 1,
00:14:08.440 --> 00:14:12.340
so the variance of the dependent
variable is less than the mean,
00:14:12.340 --> 00:14:15.460
then data are underdispersion.
00:14:15.460 --> 00:14:17.440
So here is an example.
00:14:17.440 --> 00:14:20.470
So this is the Poisson distribution.
00:14:20.470 --> 00:14:27.168
The expectation is 1, 2 and 3,
00:14:27.168 --> 00:14:30.670
so these are powers of 2 I think or something like that.
00:14:30.670 --> 00:14:34.870
And then alpha is 2, 2, 2 and 3.
00:14:34.870 --> 00:14:38.740
So we can see that the expectation stays the same,
00:14:38.740 --> 00:14:41.200
but the variance increases.
00:14:41.200 --> 00:14:45.490
So when we say here that the
overdispersion here is 3.
00:14:45.490 --> 00:14:48.610
So the variance is 3 times the mean.
00:14:48.610 --> 00:14:52.630
So the mean is about 3 or something,
00:14:52.630 --> 00:14:55.330
and the variance is a lot greater.
00:14:57.094 --> 00:15:00.280
Negative binomial regression analysis
is commonly used for these scenarios.
00:15:00.280 --> 00:15:05.200
But the choice between negative
binomial and Poisson analysis
00:15:05.200 --> 00:15:09.400
is not as straightforward as
looking at the amount of dispersion.
00:15:09.400 --> 00:15:11.167
So which of these techniques?
00:15:11.167 --> 00:15:17.290
The common way of choosing
between these techniques is to
00:15:17.290 --> 00:15:21.160
fit both and then check, which
one fits the data better,
00:15:21.160 --> 00:15:22.630
using a likelihood ratio test.
00:15:22.630 --> 00:15:25.810
But there's more to that
decision than just comparing,
00:15:25.810 --> 00:15:27.700
which one of these fits the data better.
00:15:27.700 --> 00:15:32.620
So, whether you use Poisson or negative binomial
00:15:32.620 --> 00:15:33.760
depends on a couple of things,
00:15:33.760 --> 00:15:36.550
and you have to understand the
consequences of that decision.
00:15:36.550 --> 00:15:41.950
So typically when you choose an
analysis technique over another,
00:15:41.950 --> 00:15:44.560
you have a specific reason to do so.
00:15:44.560 --> 00:15:50.200
So using the Poisson regression analysis over negative binomial regression analysis
00:15:50.200 --> 00:15:54.760
when we know that the distribution
of the dependent variable is Poisson,
00:15:54.760 --> 00:15:59.350
then the reason to use Poisson
regression analysis is that
00:15:59.350 --> 00:16:01.870
it is more efficient than negative binomial,
00:16:01.870 --> 00:16:04.870
which is consistent but
inefficient in this scenario.
00:16:06.004 --> 00:16:10.300
When there is overdispersion,
it goes the other way.
00:16:10.300 --> 00:16:14.050
So Poisson is consistent but it is inefficient,
00:16:14.050 --> 00:16:17.650
and negative binomial is consistent but efficient.
00:16:17.650 --> 00:16:22.960
Then standard errors can be
inconsistent for Poisson depending on,
00:16:22.960 --> 00:16:26.380
which of the available equations you apply,
00:16:26.380 --> 00:16:27.580
because there are multiple.
00:16:27.580 --> 00:16:32.110
And you have to consult your statistical
software's user manual to know,
00:16:32.110 --> 00:16:33.010
which one is applied.
00:16:33.010 --> 00:16:36.160
Most likely, at least in Stata,
00:16:36.160 --> 00:16:41.710
you're using the equation that is
consistent even under overdispersion.
00:16:42.781 --> 00:16:44.881
Then we have underdispersion.
00:16:44.890 --> 00:16:49.000
And Poisson regression is consistent, inefficient,
00:16:49.000 --> 00:16:51.552
and standard errors may be inconsistent.
00:16:51.552 --> 00:16:53.920
The negative binomial is inconsistent,
00:16:53.920 --> 00:16:56.680
so the estimates will be
incorrect in large samples,
00:16:56.680 --> 00:16:57.775
and that's really bad.
00:16:58.720 --> 00:17:01.600
Okay, so this covers the three scenarios.
00:17:01.600 --> 00:17:07.270
When the dependent variable is
distributed like Poisson, random variable,
00:17:07.270 --> 00:17:10.360
but it could be overdispersed.
00:17:10.360 --> 00:17:16.210
It's also possible that you have a count that
doesn't look like Poisson distribution at all.
00:17:16.210 --> 00:17:21.130
And in that case, Poisson
regression analysis is consistent,
00:17:21.130 --> 00:17:23.080
standard errors are inconsistent,
00:17:23.080 --> 00:17:26.080
and negative binomial regression is inconsistent.
00:17:26.080 --> 00:17:27.730
So what do we make of it?
00:17:28.750 --> 00:17:33.880
In some scenarios, the negative
binomial is more efficient than Poisson.
00:17:33.880 --> 00:17:37.300
In others, it's less efficient than Poisson.
00:17:37.300 --> 00:17:40.690
But generally, we want our
estimates to be consistent.
00:17:40.690 --> 00:17:46.030
So that we may have a bit of inefficiency,
00:17:46.030 --> 00:17:49.330
but the trade-off of getting efficient estimator,
00:17:49.330 --> 00:17:54.460
that could be inconsistent,
that's not worth making.
00:17:54.460 --> 00:17:56.200
You want to have something that is robust.
00:17:56.200 --> 00:18:02.050
And if your sample size is large then efficiency
differences don't make much difference.
00:18:02.650 --> 00:18:06.070
So using Poisson regression
analysis is a safe choice,
00:18:06.070 --> 00:18:08.470
if you don't know what you're doing.
00:18:08.470 --> 00:18:11.440
If you have a specific reason to believe that
00:18:11.440 --> 00:18:14.620
your dependent variable is
distributed as negative binomial,
00:18:14.620 --> 00:18:19.420
condition alone on the other fitted
values, then you can use negative binomial.
00:18:19.420 --> 00:18:21.670
But using Poisson is a safer option.
00:18:21.670 --> 00:18:25.480
This is not something that is current practice,
00:18:25.480 --> 00:18:29.037
but that's what the methodological
literature suggests.
00:18:29.037 --> 00:18:32.140
We have also some extensions to these models.
00:18:32.140 --> 00:18:34.990
Zero-inflated models are one.
00:18:34.990 --> 00:18:37.600
So the idea of zero-inflated models is that
00:18:37.600 --> 00:18:41.050
sometimes you have these structural zeros,
00:18:41.050 --> 00:18:44.110
we call them in the sample or in the population.
00:18:44.110 --> 00:18:50.740
And Stata's user manual gives this
example of a person going fishing,
00:18:50.740 --> 00:18:52.810
or people going fishing to a natural park.
00:18:52.810 --> 00:18:58.420
And the number of fish that they
catch is not distributed as Poisson,
00:18:58.420 --> 00:19:01.150
because some people choose not to fish.
00:19:01.150 --> 00:19:05.620
So people get zeros if they choose not to fish
00:19:05.620 --> 00:19:09.550
and they get zeros if they choose
to fish but they don't get any.
00:19:09.550 --> 00:19:11.920
So the amount of fish that you get
00:19:11.920 --> 00:19:15.760
is probably independent events,
00:19:15.760 --> 00:19:18.340
probably distributed very close to Poisson,
00:19:18.340 --> 00:19:22.240
depending on the weather and season,
00:19:22.240 --> 00:19:24.040
and maybe your fishing gear and skills,
00:19:24.040 --> 00:19:30.850
but given the time and given the person,
this is most likely very close to Poisson.
00:19:30.850 --> 00:19:35.650
Except for those people who decide
not to fish, that will get zeros.
00:19:36.850 --> 00:19:38.650
This is called a zero inflation scenario.
00:19:38.650 --> 00:19:41.650
And how we handle the zero inflation is that
00:19:41.650 --> 00:19:43.330
we estimate two models.
00:19:43.330 --> 00:19:47.500
So we estimate some kind of S curve model,
00:19:47.500 --> 00:19:50.440
typically logistic regression
analysis for structural zeros.
00:19:50.440 --> 00:19:53.320
So this is the idea of modeling,
00:19:53.320 --> 00:19:55.990
whether a person decides to fish or not.
00:19:55.990 --> 00:19:58.870
And then we have exponential count models,
00:19:58.870 --> 00:20:00.190
such as the Poisson model,
00:20:00.190 --> 00:20:02.410
for the number of fishes.
00:20:02.410 --> 00:20:04.480
We could have a linear regression model as well,
00:20:04.480 --> 00:20:07.960
if we think that the linear
model is better for the data,
00:20:07.960 --> 00:20:09.760
than the exponential model.
00:20:09.760 --> 00:20:12.160
So we estimate two models at the same time
00:20:12.160 --> 00:20:14.620
and these two models give us
the likelihood that we maximize.
00:20:14.620 --> 00:20:20.500
It's important that we report both
models and interpret both models,
00:20:20.500 --> 00:20:21.580
when we report results.
00:20:21.580 --> 00:20:25.420
Because it could be interesting,
what defines the structural zeros,
00:20:25.420 --> 00:20:31.540
and if that's very different from the actual
zeros that occur from the actual process,
00:20:31.540 --> 00:20:34.450
or non zero values.
00:20:34.450 --> 00:20:37.090
Then we have another commonly,
00:20:37.090 --> 00:20:40.540
or a bit less commonly used
but still sometimes used,
00:20:40.540 --> 00:20:43.180
variant of these models called the hurdle model,
00:20:43.180 --> 00:20:48.880
which is similar to the zero-inflation model.
00:20:48.880 --> 00:20:55.660
But in this case, instead of looking
at the people who don't fish at all,
00:20:56.410 --> 00:21:01.990
we look at the difference between people
who get one and people who get one or more.
00:21:01.990 --> 00:21:06.010
The example here, the typical example,
00:21:06.010 --> 00:21:07.720
is going to see a doctor.
00:21:07.720 --> 00:21:12.090
So how many times you go and see a doctor.
00:21:13.270 --> 00:21:16.480
The first time you go to a doctor
depends on different things,
00:21:16.480 --> 00:21:21.370
than whether you go there the
second, third and fourth time.
00:21:22.810 --> 00:21:25.480
Whether you go to see a doctor
the second time probably
00:21:25.480 --> 00:21:28.510
depends a lot on what the doctor tells you.
00:21:28.510 --> 00:21:32.650
And whether you decide to go and
see a doctor in the first place
00:21:32.650 --> 00:21:35.710
can't depend on what the doctor tells you,
00:21:35.710 --> 00:21:37.210
because you haven't seen the doctor.
00:21:37.210 --> 00:21:41.200
We model this kind of processes
using the hurdle model.
00:21:41.200 --> 00:21:42.850
The idea is that we have two models.
00:21:42.850 --> 00:21:46.420
We have again S curve model for zero and non zero,
00:21:46.420 --> 00:21:52.180
and then we have a truncated version of
exponential count model for the actual count.
00:21:52.180 --> 00:21:53.950
So we model first,
00:21:53.950 --> 00:21:55.540
does the person go to a doctor?
00:21:55.540 --> 00:21:57.880
And then we model,
00:21:57.880 --> 00:22:01.330
given that the person went
to a doctor at least once,
00:22:01.330 --> 00:22:03.760
how many times does the person go to the doctor?
00:22:03.760 --> 00:22:07.450
Again you will get two sets
of results for two models,
00:22:07.450 --> 00:22:09.610
then you usually interpret both and report.
00:22:10.335 --> 00:22:12.645
Let's take a look at an example.
00:22:12.645 --> 00:22:16.450
So this is from the same
example from Blevin's paper.
00:22:16.450 --> 00:22:21.310
They don't interpret the zero-inflation model,
00:22:21.310 --> 00:22:24.947
but they present Poisson regression,
negative binomial regression,
00:22:24.947 --> 00:22:28.660
zero-inflated Poisson, and
zero-inflated negative binomial.
00:22:28.660 --> 00:22:33.310
We're gonna be looking at the
likelihoods and the degrees of freedom.
00:22:34.360 --> 00:22:36.250
This is not actually degrees of freedom,
00:22:36.250 --> 00:22:39.400
but it's the number of parameters instead,
00:22:39.400 --> 00:22:42.220
which is incorrectly reported
as degrees of freedom.
00:22:42.220 --> 00:22:46.930
So the degrees of freedom difference between
00:22:46.930 --> 00:22:51.130
negative binomial model and
the basic Poisson model is one.
00:22:51.130 --> 00:22:55.000
The one difference is that
these estimate the same model.
00:22:55.000 --> 00:22:56.980
So the regression coefficients are the same,
00:22:56.980 --> 00:23:00.130
but the negative binomial regression model here
00:23:00.130 --> 00:23:03.430
estimates the amount of overdispersion
in a Poisson distribution
00:23:03.430 --> 00:23:05.830
that we fit the data.
00:23:06.303 --> 00:23:11.508
When we go from the basic Poisson model
to the zero-inflated Poisson model,
00:23:11.508 --> 00:23:15.400
we can see that the number of parameters
is twice as the Poisson model.
00:23:15.400 --> 00:23:19.000
The reason for that is that
we have actually two models.
00:23:19.000 --> 00:23:22.990
So we have one model explaining
the structure of zeros,
00:23:23.228 --> 00:23:24.760
the S curve model.
00:23:24.760 --> 00:23:29.322
And then we have the normal
Poisson regression model.
00:23:29.322 --> 00:23:34.330
The negative binomial results and Poisson
results are typically very close to one another,
00:23:34.330 --> 00:23:40.390
because Poisson is consistent under
the negative binomial assumptions.
00:23:40.390 --> 00:23:44.290
So if the sample size is large
then they should be very similar.
00:23:45.172 --> 00:23:52.330
The zero-inflated model results and the negative binomial Poisson results are typically quite different.
00:23:52.330 --> 00:23:56.320
And here we can see again the
one degree of freedom difference.
00:23:56.320 --> 00:23:59.770
How we choose between
negative binomial and Poisson?
00:24:00.490 --> 00:24:04.510
The convention is that you
do a likelihood ratio test.
00:24:04.510 --> 00:24:08.440
So you compare the
log-likelihood of Poisson against
00:24:08.440 --> 00:24:09.940
the log-likelihood of the negative binomial.
00:24:09.940 --> 00:24:16.000
We can see that there's 400 difference
with one degree of freedom difference,
00:24:16.000 --> 00:24:18.340
that is highly statistically significant.
00:24:18.340 --> 00:24:22.930
So the negative binomial here is
a much better fit for the data
00:24:22.930 --> 00:24:24.970
than the basic Poisson model.
00:24:24.970 --> 00:24:33.490
The reason, why negative binomial almost always
fits better than the basic Poisson is that
00:24:34.420 --> 00:24:36.400
the Poisson model assumes that
00:24:36.400 --> 00:24:43.390
all the independent variables in the
model explain the mean perfectly.
00:24:43.390 --> 00:24:47.440
So only variation or the mean is
00:24:47.440 --> 00:24:51.790
the variation that belongs to the error term,
00:24:51.790 --> 00:24:54.430
or however, we like to call
the Poisson distribution.
00:24:54.430 --> 00:24:57.220
In practice our models are imperfect.
00:24:57.220 --> 00:25:01.990
So there are always some variables that
we could have observed but did not.
00:25:01.990 --> 00:25:04.450
That would explain the dependent variable.
00:25:04.450 --> 00:25:08.200
And if those explained the dependent
variable to a substantial degree,
00:25:08.200 --> 00:25:12.310
then that additional variation,
00:25:12.310 --> 00:25:14.380
that could have been explained but we didn't,
00:25:14.380 --> 00:25:15.970
goes to the error term.
00:25:15.970 --> 00:25:18.400
So it's the same thing as
in a regression analysis.
00:25:18.400 --> 00:25:24.730
So if your R-squared is 20% then
80% of the variation is unexplained.
00:25:24.730 --> 00:25:28.210
If you add more variables,
R-squared increases to 50%,
00:25:28.210 --> 00:25:29.890
and then the error variance decreases.
00:25:29.890 --> 00:25:31.180
So the same thing happens here.
00:25:31.180 --> 00:25:34.450
The negative binomial model,
00:25:34.450 --> 00:25:37.570
if it fits better than the basic Poisson model,
00:25:37.570 --> 00:25:43.360
then it means that our model is incomplete
in explaining the data completely.
00:25:43.360 --> 00:25:46.300
That's not a problem, as soon as,
00:25:46.300 --> 00:25:50.380
any of the omitted causes are uncorrelated
with the explanatory variables,
00:25:50.380 --> 00:25:52.750
and don't lead to an endogeneity problem.
00:25:52.750 --> 00:25:55.180
That's something to be aware of.
00:25:57.228 --> 00:26:02.170
Finally quite often you see this
kind of diagrams on what to do.
00:26:02.170 --> 00:26:04.360
And this is again a convention.
00:26:04.360 --> 00:26:09.730
How do we choose between the
negative binomial and Poisson model?
00:26:09.730 --> 00:26:13.420
There is no problem in using the
Poisson model for overdispersed data,
00:26:13.420 --> 00:26:16.240
as long as you adjust the
standard errors accordingly.
00:26:16.240 --> 00:26:20.770
So the current convention
is that you do both models,
00:26:20.770 --> 00:26:23.530
and then you do a likelihood ratio test between
00:26:23.530 --> 00:26:26.620
the Poisson model and the negative binomial model.
00:26:26.620 --> 00:26:30.460
If the negative binomial model
fits significantly better,
00:26:30.460 --> 00:26:32.260
that's evidence for overdispersion,
00:26:32.260 --> 00:26:34.960
and then you go for the negative binomial model.
00:26:34.960 --> 00:26:38.620
Then this article suggests that you look at,
00:26:38.620 --> 00:26:41.140
whether there are excess zeros.
00:26:41.140 --> 00:26:43.480
If there are excess zeros,
00:26:44.200 --> 00:26:48.010
then are you do one test and based on that,
00:26:48.010 --> 00:26:55.030
you either choose a negative binomial model
or zero-inflated negative binomial model.
00:26:57.109 --> 00:27:00.160
The problem with this approach is that,
00:27:00.160 --> 00:27:03.970
the one test is problematic and also,
00:27:03.970 --> 00:27:09.160
you should not be doing modeling
decisions based on empirical results only.
00:27:09.709 --> 00:27:15.255
Zero-inflation is a hypothesis that
has a theoretical interpretation.
00:27:15.255 --> 00:27:17.530
If you use the zero inflation model,
00:27:17.530 --> 00:27:23.845
then you're making a hypothesis that your data
are actually results of two different processes,
00:27:23.845 --> 00:27:25.750
a process that generates zeroes,
00:27:25.750 --> 00:27:30.220
people who never go fishing will never get fish.
00:27:30.220 --> 00:27:35.170
And so you have theoretical
guidance that you can usually use
00:27:35.170 --> 00:27:38.980
to choose wheater you use zero inflation or not.
00:27:38.980 --> 00:27:41.470
So is there a plausible mechanism for the zeroes,
00:27:41.470 --> 00:27:43.540
then you apply zero inflation.
00:27:43.540 --> 00:27:47.830
Otherwise, you apply Poisson regression analysis,
00:27:47.830 --> 00:27:53.740
because the zero inflation is actually not a
violation of the Poisson regression assumptions.