WEBVTT
Kind: captions
Language: en
00:00:00.090 --> 00:00:03.240
Interpreting the logistic
regression analysis results
00:00:03.240 --> 00:00:06.860
differs a bit from normal
regression analysis interpretation.
00:00:06.860 --> 00:00:10.500
Let's take a look at the results
from logistic regression analysis,
00:00:10.500 --> 00:00:11.940
using the Menarche dataset.
00:00:12.825 --> 00:00:17.520
The R GLM -command gives us these results.
00:00:17.520 --> 00:00:21.510
So we'll just focus on the
actual coefficients for now,
00:00:21.510 --> 00:00:25.290
and leave these other things for another video.
00:00:25.961 --> 00:00:28.050
We have the estimate,
00:00:28.050 --> 00:00:30.000
which is an estimate.
00:00:30.000 --> 00:00:31.350
Then we have the standard error,
00:00:31.350 --> 00:00:36.630
which quantifies how much the estimate is
likely to change from one sample to another,
00:00:36.630 --> 00:00:38.250
if we repeat the study over.
00:00:38.250 --> 00:00:40.170
We have a Z value,
00:00:40.170 --> 00:00:43.410
which is the ratio of the estimate
divided by the standard error.
00:00:43.410 --> 00:00:47.730
So the Z value is the same as the
T value in regression analysis.
00:00:47.730 --> 00:00:52.140
It is called a Z statistic instead of T statistic,
00:00:52.140 --> 00:00:56.790
because the maximum likelihood estimates
are based on large sample theory,
00:00:56.790 --> 00:00:59.550
and instead of comparing
against the T distribution,
00:00:59.550 --> 00:01:03.840
we compare this against the normal distribution.
00:01:03.840 --> 00:01:09.690
So under the null hypothesis that this
estimate is zero in the population,
00:01:09.690 --> 00:01:12.300
and if the sample size is large enough,
00:01:12.300 --> 00:01:15.210
the Z value follows a standard normal distribution
00:01:15.210 --> 00:01:18.480
and that allows us to calculate the p-values.
00:01:18.480 --> 00:01:24.060
So whether age has an effect or not
can be interpreted from these p-values,
00:01:24.060 --> 00:01:30.300
we can see that age has a large, very
statistically significant result.
00:01:30.300 --> 00:01:32.490
So we can confidently say that
00:01:32.490 --> 00:01:37.380
age has some kind of effect on the
probability of having had menarche.
00:01:38.356 --> 00:01:43.470
What is the magnitude of that effect is a
bit more complicated question to answer.
00:01:44.111 --> 00:01:48.480
We really can't say that the
probability of having had menarche
00:01:49.166 --> 00:01:54.210
increases by 1.6 when the
girl gets one year older.
00:01:54.210 --> 00:01:59.970
One reason is that 1.6 increase gets
us beyond the range of the data.
00:01:59.970 --> 00:02:02.730
So if the probability is 0,
00:02:02.730 --> 00:02:04.950
initially you increase age by one,
00:02:04.950 --> 00:02:08.160
the predicted probability would be 1.62.
00:02:08.160 --> 00:02:10.020
So doesn't work that way.
00:02:10.798 --> 00:02:13.950
The reason why we can't
interpret this directly is,
00:02:13.950 --> 00:02:19.500
these are the effects before we
applied the logistic link function.
00:02:19.500 --> 00:02:26.040
So these are effects on the linear predictor
and not on the actual dependent variable.
00:02:26.406 --> 00:02:28.236
So it's the same thing as in
00:02:28.236 --> 00:02:31.380
when you do a log transformation
from a dependent variable,
00:02:31.380 --> 00:02:33.972
then the interpretation is in,
00:02:34.110 --> 00:02:38.576
the coefficient tells you what
is the effect in log scale,
00:02:39.000 --> 00:02:42.090
and you want to know what's the
effect on the original scale.
00:02:42.090 --> 00:02:44.160
This coefficient here tells you,
00:02:44.160 --> 00:02:47.220
what is the effect in the
scale of the linear predictor?
00:02:47.220 --> 00:02:49.350
But you are not really interested in that,
00:02:49.350 --> 00:02:50.400
you are interested in,
00:02:50.400 --> 00:02:52.770
what is the effect on the observed variable scale?
00:02:53.334 --> 00:02:57.600
So we don't interpret these directly instead
00:02:57.600 --> 00:03:00.180
we interpret them as odds ratios.
00:03:00.897 --> 00:03:06.990
So the odds ratio is a concept that
is useful for regression analysis
00:03:06.990 --> 00:03:11.820
and for some other logistic regression
analysis and for some other models as well.
00:03:13.101 --> 00:03:17.340
The idea is that odds are
the ratio of two outcomes.
00:03:17.340 --> 00:03:27.600
So here we have the outcome of girl having
had menarche and not having had menarche.
00:03:27.600 --> 00:03:31.920
If 1 in 100 girls have had menarche,
00:03:31.920 --> 00:03:36.450
then odds for having had menarche is 1 to 99,
00:03:36.450 --> 00:03:39.630
because one out of the sample, out of 100 has had it,
00:03:39.630 --> 00:03:44.070
and then remaining 99 hasn't had it.
00:03:44.680 --> 00:03:50.580
You can think of one common
use of odds is in gambling.
00:03:50.580 --> 00:03:52.860
So if you have a team, two soccer teams,
00:03:52.860 --> 00:03:55.980
one has won two matches in the past,
00:03:55.980 --> 00:03:59.400
another one has won five matches in the past.
00:03:59.400 --> 00:04:01.500
Then you say that based on that data
00:04:01.500 --> 00:04:04.410
the odds for the first team
winning is two to five.
00:04:04.410 --> 00:04:06.000
So that's the idea of odds.
00:04:06.930 --> 00:04:13.080
And more formally, if the
probability of an outcome is P,
00:04:13.080 --> 00:04:17.790
then the odds are defined
us P against one minus P.
00:04:17.790 --> 00:04:22.260
So it's the probability of one outcome
divided by the probability of another outcome,
00:04:22.260 --> 00:04:23.670
if you have only two possible outcomes.
00:04:25.256 --> 00:04:27.060
And the exponential,
00:04:27.060 --> 00:04:30.660
if you exponentiate the logistic
regression coefficients,
00:04:30.660 --> 00:04:34.020
those can be interpreted as odds ratios.
00:04:34.478 --> 00:04:39.930
And the idea is that when you
exponentiate the coefficients,
00:04:39.930 --> 00:04:42.930
then the coefficients tell you that
00:04:42.930 --> 00:04:48.600
one unit increase in independent variable causes
00:04:48.600 --> 00:04:53.610
the odds to change proportionally
to the regression coefficient.
00:04:54.000 --> 00:04:55.410
I'll show you an example.
00:04:55.807 --> 00:04:58.110
Let's take a look at the idea of odds ration,
00:04:58.110 --> 00:05:02.310
and why we can interpret these
coefficients as odds ratios.
00:05:02.829 --> 00:05:05.079
So example odds for the data.
00:05:05.079 --> 00:05:08.610
And this is some guess of the results,
00:05:08.610 --> 00:05:10.230
we have the linear prediction,
00:05:10.230 --> 00:05:12.360
we have the fitted like probability,
00:05:12.360 --> 00:05:13.800
we have fitted odds,
00:05:13.800 --> 00:05:21.450
which is the probability
against the other probability,
00:05:21.450 --> 00:05:22.954
and we calculate the value.
00:05:23.518 --> 00:05:28.328
So the odds for this first girl
having had menarche is 74% to 26%,
00:05:28.328 --> 00:05:30.128
which is 2.79.
00:05:30.128 --> 00:05:33.300
The odds for the second girl is 8% to 92%,
00:05:33.300 --> 00:05:35.520
which is 0.09 and so on.
00:05:35.520 --> 00:05:36.270
So these are the odds.
00:05:36.270 --> 00:05:40.860
And when we calculate marginal prediction.
00:05:40.860 --> 00:05:44.490
So in regression analysis, we are
interested in the marginal effect,
00:05:44.490 --> 00:05:47.370
what is the increase of one independent variable,
00:05:47.370 --> 00:05:51.510
what is the effect of increasing one
independent variable Y by one unit,
00:05:51.510 --> 00:05:53.010
holding everything else constant.
00:05:53.010 --> 00:05:55.110
So we are interested in marginal effect.
00:05:55.110 --> 00:06:00.630
And let's calculate marginal effects
now for girls of different ages.
00:06:00.630 --> 00:06:03.090
So instead of using this actual data,
00:06:03.090 --> 00:06:08.040
we have a hypothetical girl at
age of 9, 10, 11, 12 and so on.
00:06:08.040 --> 00:06:12.900
We calculate the fitted
probabilities using our model.
00:06:13.037 --> 00:06:16.682
And we calculate odds.
00:06:16.743 --> 00:06:19.495
We calculate the value of the odds,
00:06:19.495 --> 00:06:22.770
and when we compare two odds here,
00:06:22.770 --> 00:06:24.330
the ratio of these two zeros,
00:06:24.330 --> 00:06:27.352
they are actually not exactly 0, is 4.6.
00:06:27.352 --> 00:06:31.477
So every time we go and we
increase the girl's age by one,
00:06:31.477 --> 00:06:34.950
then the odds increase by 4.6.
00:06:34.950 --> 00:06:40.860
So every additional year
increases the odds by 4.6 units.
00:06:41.400 --> 00:06:44.340
So that's the odds ratio interpretation.
00:06:44.340 --> 00:06:50.520
So these always increase by 4.6.
00:06:51.084 --> 00:06:56.280
And how we use that in regression analysis.
00:06:56.280 --> 00:06:58.380
Well, we calculate the odds ratios,
00:06:58.380 --> 00:07:01.338
and this is for the actual data.
00:07:02.101 --> 00:07:04.800
So we calculate the odds ratio,
00:07:04.800 --> 00:07:06.157
which is about 5.
00:07:06.157 --> 00:07:09.210
And the interpretation is that,
00:07:09.210 --> 00:07:17.913
each additional year of age increases the
odds of having had menarche by fivefold.
00:07:18.279 --> 00:07:21.000
So that kind of quantifies,
00:07:21.084 --> 00:07:23.100
how large the effect of age is.
00:07:23.100 --> 00:07:26.109
We know that if something is increased fivefold,
00:07:26.109 --> 00:07:27.902
then it's a pretty large effect.
00:07:27.902 --> 00:07:30.720
The problem still is that with odds ratios,
00:07:30.720 --> 00:07:35.820
we can't really say, how much does
the actual probability increase,
00:07:35.820 --> 00:07:38.490
because odds and probability
are not the same things.
00:07:38.917 --> 00:07:41.407
And quite often we want to know,
00:07:42.124 --> 00:07:47.464
how much does the probability of
having had menarche depend on the age,
00:07:47.520 --> 00:07:50.190
and what does the effect look like?
00:07:50.770 --> 00:07:56.520
To do that we would need to plot the
marginal predictions from the model.