WEBVTT
00:00:00.090 --> 00:00:03.120
We will now formalize the
previous example a bit more,
00:00:03.120 --> 00:00:10.080
and we will discuss the concept of
null hypothesis significance testing,
00:00:10.080 --> 00:00:12.870
or NHST, which is the acronym.
00:00:12.870 --> 00:00:18.000
The idea of null hypothesis
significance testing is that,
00:00:18.690 --> 00:00:20.970
we start by defining,
00:00:20.970 --> 00:00:26.130
we have some kind of estimation problem
that gives us some kind of estimate,
00:00:26.130 --> 00:00:28.590
and we define two things,
00:00:28.590 --> 00:00:33.510
we define a test statistic and
then we define a null hypothesis.
00:00:33.510 --> 00:00:35.700
So we call the test statistic,
00:00:35.700 --> 00:00:37.410
we refer to it as a key,
00:00:37.410 --> 00:00:45.330
and then we need to have the sampling
distribution of the T under the null hypothesis.
00:00:45.330 --> 00:00:50.190
The null hypothesis or H0 is typically
00:00:50.190 --> 00:00:52.950
a hypothesis that there is no effect,
00:00:52.950 --> 00:00:57.570
there is no correlation between
CEO gender and profitability,
00:00:57.570 --> 00:01:02.220
or there is no difference between men
and women-led companies on profitability.
00:01:02.220 --> 00:01:06.630
Then we derive, based on statistical theory,
00:01:06.630 --> 00:01:08.670
a reference distribution,
00:01:08.670 --> 00:01:14.370
so how would the test statistic be
distributed if there was really no effect?
00:01:14.370 --> 00:01:22.590
Then we compare the test statistic calculated
from our sample to the distribution,
00:01:22.590 --> 00:01:27.270
and we can see that, okay this
area here gives us the p-value.
00:01:27.270 --> 00:01:31.440
So it is the probability of
obtaining the test statistic
00:01:31.440 --> 00:01:34.500
under the null hypothesis given our sample size.
00:01:34.500 --> 00:01:38.220
Then we compare the observed
statistic to get the p-value.
00:01:38.220 --> 00:01:40.680
So that's the idea of null
hypothesis significance testing.
00:01:40.680 --> 00:01:43.890
Typically this is done by a computer for you,
00:01:43.890 --> 00:01:47.070
so you don't have to draw this normal
distribution or calculate the area,
00:01:47.070 --> 00:01:50.940
but it's useful to understand
what's going on under the hood,
00:01:50.940 --> 00:01:56.070
so you know what kind of problems we
face when we do this kind of inference.
00:01:56.070 --> 00:02:01.530
The simplest test perhaps, using the
null hypothesis significance testing,
00:02:01.530 --> 00:02:04.080
is the t test,
00:02:04.080 --> 00:02:10.920
and the idea of a t test is that
it assumes that the estimates
00:02:10.920 --> 00:02:13.920
are normally distributed over repeated samples.
00:02:13.920 --> 00:02:17.160
It was the case when we compared two means,
00:02:17.160 --> 00:02:21.420
so the difference of two means
is normally distributed under,
00:02:21.420 --> 00:02:23.022
when the sample size is large enough.
00:02:23.022 --> 00:02:29.190
And then we have the estimate,
00:02:29.190 --> 00:02:34.876
then the test statistic is: estimate
divided by the standard error.
00:02:35.289 --> 00:02:36.660
So instead of looking at
00:02:36.660 --> 00:02:40.440
how far the estimate is from the
null hypothesis value of zero,
00:02:40.440 --> 00:02:46.200
we look at, how far the estimate, divided
by the standard error, is from zero.
00:02:47.730 --> 00:02:51.510
And this follows students' t distributions,
00:02:51.510 --> 00:02:53.040
that looks like a normal distribution,
00:02:53.040 --> 00:02:56.700
but it's a bit wider in small samples.
00:02:56.700 --> 00:03:02.370
The idea of a t test or this estimate
divided by the standard error,
00:03:02.370 --> 00:03:05.190
is that we standardize the estimate.
00:03:05.190 --> 00:03:13.920
So remember standardization is
subtracting the mean of the estimates,
00:03:13.920 --> 00:03:17.310
so here we assume the mean
to be the null hypothesis,
00:03:17.310 --> 00:03:20.310
so we subtract zero and it
doesn't really make a difference,
00:03:20.310 --> 00:03:23.100
and we divide by standard deviation,
00:03:23.100 --> 00:03:27.030
which case is estimated by standard error here.
00:03:27.030 --> 00:03:29.970
So the t statistic tells us,
00:03:29.970 --> 00:03:35.550
how far from zero the estimate
is, on a standardized metric.
00:03:35.550 --> 00:03:40.800
If it's more than two
standard deviations from zero,
00:03:40.800 --> 00:03:42.750
then we conclude that,
00:03:42.750 --> 00:03:47.460
that kind of observation will be
unlikely to occur, by chance only,
00:03:47.460 --> 00:03:54.750
because 95% of the observations fall within
plus or minus two standard deviations,
00:03:54.750 --> 00:04:00.937
when we have normally distributed statistic.
00:04:00.937 --> 00:04:03.180
So, we compare this area,
00:04:03.180 --> 00:04:07.410
in practice, it often makes
sense to compare both areas here,
00:04:07.410 --> 00:04:09.150
so we calculate this area as well.
00:04:09.150 --> 00:04:10.740
The logic being that,
00:04:11.250 --> 00:04:16.140
it will be an important finding if the
difference was to the other direction as well.
00:04:16.140 --> 00:04:21.300
And this relates to, or is referred
to as one and two-tailed tests.
00:04:21.300 --> 00:04:23.580
So, what area we compare?
00:04:23.580 --> 00:04:28.820
So normally, if we only compare
one end of the normal distribution here,
00:04:28.820 --> 00:04:30.800
this is called a one-tailed test.
00:04:30.800 --> 00:04:34.760
And if we compare the area,
00:04:34.760 --> 00:04:38.270
what is the five percent
area here and here together?
00:04:38.270 --> 00:04:43.190
So this is 2.5 % and this
is 2.5 % so they sum to 5 %,
00:04:43.190 --> 00:04:45.529
then that's called a two-tailed test.
00:04:46.396 --> 00:04:52.400
Normally, when your statistical software
gives you a p-value from a t test,
00:04:52.400 --> 00:04:56.210
or some other test that uses something
that looks like a normal distribution,
00:04:56.210 --> 00:04:58.280
for example a z test,
00:04:58.280 --> 00:05:01.820
then it is two tails, so you compare both ends.
00:05:01.820 --> 00:05:06.320
And it's considered cheating
to use the one-tailed test,
00:05:06.320 --> 00:05:09.530
because what the one-tailed test basically does,
00:05:09.530 --> 00:05:16.730
it gives you a p-value that is exactly half
of the p-value of the two-tailed test.
00:05:16.730 --> 00:05:18.680
Because you have two tails here,
00:05:18.680 --> 00:05:21.350
the probability of the observation being in both.
00:05:21.350 --> 00:05:24.470
Either tail is twice as the probability here,
00:05:24.470 --> 00:05:28.400
so the probability here is half from
what the probability here would be.
00:05:28.400 --> 00:05:32.090
The problem in one-tailed tests is that,
00:05:32.090 --> 00:05:39.770
the standard is to use two tails and if
we observe a p-value in a research paper,
00:05:39.770 --> 00:05:45.410
we assume that it's made by
using this two-tailed test.
00:05:45.410 --> 00:05:54.350
Sometimes if the p-value is 0.06 and a
researcher wants it to be less than 0.05,
00:05:54.350 --> 00:05:56.480
they switch to one-tailed test,
00:05:56.480 --> 00:05:58.940
which allows them to divide the p-value by half,
00:05:58.940 --> 00:06:05.900
and they present those as if they
were tests from the two-tailed test,
00:06:05.900 --> 00:06:08.960
that's misleading readers, that's unethical.
00:06:10.250 --> 00:06:15.320
There are basically no good reasons
ever to use these one-tailed tests,
00:06:15.320 --> 00:06:18.890
because this is more commonly accepted and also,
00:06:18.890 --> 00:06:22.550
if someone wants to have the one-tailed
test instead of a two-tailed test,
00:06:22.550 --> 00:06:25.100
they can just divide your p-values by two
00:06:25.100 --> 00:06:27.965
and that's what the difference is.
00:06:27.965 --> 00:06:32.480
The p-values are very commonly
used in research papers.
00:06:32.480 --> 00:06:35.690
So you see papers, for example,
00:06:35.690 --> 00:06:37.160
this is from Hekman's paper,
00:06:37.160 --> 00:06:40.940
you see these p-values behind statistics,
00:06:40.940 --> 00:06:42.740
so you regression estimate here,
00:06:42.740 --> 00:06:45.290
and then there is p-value less than 0.01,
00:06:45.290 --> 00:06:47.300
that is statistically significant,
00:06:47.300 --> 00:06:50.630
you see this n.s, that means non-significant or,
00:06:50.630 --> 00:06:53.630
you can see p-value is greater than 0.05.
00:06:53.630 --> 00:07:00.860
So for some reason, we have decided
that 5 % p-value is the gold standard,
00:07:00.860 --> 00:07:03.530
and if you have less than 5 %, then it's a good thing,
00:07:03.530 --> 00:07:05.900
if you have more than 5 %, that's a bad thing,
00:07:05.900 --> 00:07:07.460
so that's an arbitrary threshold.
00:07:07.460 --> 00:07:13.010
So, a paper could have
hundreds of p-values easily,
00:07:13.010 --> 00:07:17.392
so it's a very commonly used in research articles.
00:07:18.569 --> 00:07:20.690
P-value relates to two different things,
00:07:20.690 --> 00:07:24.080
so it relates to two different errors.
00:07:24.080 --> 00:07:30.590
And we have two things in statistical analysis,
00:07:30.590 --> 00:07:31.640
we have the population,
00:07:31.640 --> 00:07:34.088
and we have the sample.
00:07:34.088 --> 00:07:40.700
We want to make an inference that
something exists in the population
00:07:40.700 --> 00:07:43.386
using the sample data.
00:07:43.386 --> 00:07:46.258
So we calculate a test statistic,
00:07:46.258 --> 00:07:50.750
the test statistic rejects the
null hypothesis in the sample,
00:07:50.750 --> 00:07:54.350
then we say that we assume it is,
00:07:54.350 --> 00:07:57.080
the null then doesn't hold in the population.
00:07:57.080 --> 00:08:00.950
But that's not actually always the case.
00:08:00.950 --> 00:08:03.890
When you get a p-value that is small,
00:08:03.890 --> 00:08:08.810
it's also possible that it
is a false positive finding.
00:08:08.810 --> 00:08:12.860
So p less 0.05 means that,
00:08:12.860 --> 00:08:14.420
if there was no effect,
00:08:14.420 --> 00:08:19.520
then getting the kind of result
that you just got would be,
00:08:19.520 --> 00:08:21.820
the probability for that would be 5%.
00:08:21.820 --> 00:08:26.840
So 1 out of 20 samples from a population,
00:08:26.840 --> 00:08:28.217
you would be getting a false positive,
00:08:28.217 --> 00:08:31.370
if the null hypothesis
wouldn't hold, doesn't hold.
00:08:31.370 --> 00:08:33.980
So it's possible that it's false positive,
00:08:33.980 --> 00:08:38.390
but it's also possible that it's a true positive.
00:08:38.390 --> 00:08:40.370
So the problem is, we don't know.
00:08:40.370 --> 00:08:43.610
We have evidence that it would be unlikely that
00:08:43.610 --> 00:08:47.120
we would get an effect estimate by chance only.
00:08:47.120 --> 00:08:50.600
Then we conclude that maybe
it wasn't by chance only,
00:08:50.600 --> 00:08:52.100
but we can't know for sure.
00:08:54.300 --> 00:08:56.640
So this is type 1 error,
00:08:56.640 --> 00:08:59.130
then we have type 2 error,
which is a false negative.
00:08:59.130 --> 00:09:02.280
Let's say that the null hypothesis
holds in the population,
00:09:02.280 --> 00:09:07.470
and let's say that women-led companies are
really more profitable than men-led companies,
00:09:07.470 --> 00:09:13.500
but for some reason, our study
couldn't find the difference.
00:09:13.500 --> 00:09:14.640
So that would be a false negative.
00:09:14.640 --> 00:09:17.490
And there is the case that we say that
00:09:17.490 --> 00:09:19.920
we can't reject the null hypothesis,
00:09:19.920 --> 00:09:23.490
we can't reject the claim
that there's no difference,
00:09:23.490 --> 00:09:25.110
and there really is no difference,
00:09:25.110 --> 00:09:27.390
so that's also a valid finding.
00:09:27.390 --> 00:09:32.065
So we want to be sure that we either
have true positives or two negatives.
00:09:32.582 --> 00:09:38.790
The probability of false positives
under the null hypothesis is,
00:09:38.790 --> 00:09:42.810
we consider 5% or less acceptable.
00:09:42.810 --> 00:09:46.590
So if we say that the p-value is valid,
00:09:46.590 --> 00:09:50.119
then it should behave as expected.
00:09:50.119 --> 00:09:56.088
So it's okay for the p-value to be less than 0.05,
00:09:56.088 --> 00:09:58.415
3% of the time,
00:09:58.415 --> 00:10:02.610
if the null hypothesis holds in the population.
00:10:02.610 --> 00:10:06.630
So we have a conservative test, that's okay.
00:10:06.630 --> 00:10:10.200
So we want to make errors to be cautious.
00:10:10.200 --> 00:10:16.320
But if our p-value was less than
0.05, let's say 7 % of the time,
00:10:16.320 --> 00:10:19.830
then you would say that it's too liberal and
00:10:19.830 --> 00:10:22.950
it's not the valid p-value
for the particular test,
00:10:22.950 --> 00:10:25.620
because it doesn't follow
the reference distribution.
00:10:25.620 --> 00:10:29.584
It's important that when the null hypothesis hold,
00:10:29.584 --> 00:10:33.810
our p-values don't indicate the support too often.
00:10:34.616 --> 00:10:37.800
Then we have another concept
called statistical power,
00:10:37.800 --> 00:10:40.294
so this is a false positive rate.
00:10:40.294 --> 00:10:43.890
And statistical power is something that,
00:10:43.890 --> 00:10:49.650
once we have a statistic whose p-value
doesn't exceed a false positive rate,
00:10:49.650 --> 00:10:53.160
we want the test to identify an effect,
00:10:53.160 --> 00:10:56.916
when it exists as frequently as possible.
00:10:57.144 --> 00:10:59.792
Typically we are ok with 80%,
00:10:59.792 --> 00:11:03.390
but there are studies with way less power.
00:11:03.390 --> 00:11:07.770
So 80% power means that when
there is an effect the population
00:11:07.770 --> 00:11:12.300
then in four out of five studies,
we would actually detect an effect.
00:11:12.300 --> 00:11:15.360
The question is, which one is more important?
00:11:15.360 --> 00:11:19.710
So we're not okay with more
than 5 % false-positive rates,
00:11:19.710 --> 00:11:23.430
but we are okay with 20 % false-negative rates,
00:11:23.430 --> 00:11:26.520
because of 80 % power 20 % false negative.
00:11:26.520 --> 00:11:28.710
Then the reason why,
00:11:29.730 --> 00:11:33.420
we are so much more worried
about false positives is that,
00:11:33.420 --> 00:11:36.870
positive effects typically have
some kind of policy implications.
00:11:36.870 --> 00:11:42.480
If we find out that a medicine
doesn't do us any good,
00:11:42.480 --> 00:11:44.850
then no one is going to take the
medicine, we continue to research.
00:11:44.850 --> 00:11:48.270
If we find out that the medicine helps people,
00:11:48.270 --> 00:11:50.490
then people will start taking the medicine.
00:11:50.490 --> 00:11:53.490
If it's a false positive finding,
00:11:53.490 --> 00:11:57.750
then people will take a medicine that is useless,
00:11:57.750 --> 00:11:59.610
or could be even harmful to them.
00:11:59.610 --> 00:12:05.370
So false positives have poisonous implications much more often than false negatives,
00:12:05.370 --> 00:12:08.970
and that's the reason why we
want to avoid false positives.
00:12:08.970 --> 00:12:11.850
We have agreed that it's okay to have a 5 % rate,
00:12:11.850 --> 00:12:15.840
hence p is less than 0.05, but not more.
00:12:16.470 --> 00:12:21.300
Of course, in some scenarios, if you
have a really like a life-critical thing,
00:12:21.300 --> 00:12:27.060
then you could be using a p-value
threshold of 0.001, for example.
00:12:27.060 --> 00:12:31.530
So 0.05 is not the one correct value,
00:12:31.530 --> 00:12:33.600
that's just the convention in many fields.
00:12:33.600 --> 00:12:35.970
Some other fields use smaller values and
00:12:35.970 --> 00:12:38.520
you can use smaller values in
an individual study as well