WEBVTT
Kind: captions
Language: en
00:00:00.030 --> 00:00:04.380
It's very common within the research we're
interested in large number of people or
00:00:04.380 --> 00:00:12.540
organizations. For example in political polling
it's usually interesting to know what is the
00:00:12.540 --> 00:00:19.410
popularity of a political party. However if
we consider a national level popularity then
00:00:19.410 --> 00:00:25.770
measuring everybody's opinion would involve
calling millions of people, and that's in most
00:00:25.770 --> 00:00:32.370
cases impractical. Instead what we do is that we
take a smaller number of people called a sample,
00:00:32.370 --> 00:00:38.400
for example we call 300 people or thousand people,
we ask their opinions about political parties,
00:00:38.400 --> 00:00:46.200
and we use that sample to calculate an estimate
of what is the population popularity of that
00:00:46.200 --> 00:00:52.950
political party. If the sample is well chosen
and if it's large enough then the popularity
00:00:52.950 --> 00:01:00.540
from the sample or gets very close to the actual
population popularity. Then another thing that
00:01:00.540 --> 00:01:08.190
when we do polling is that we want to tell the
readers of our poll, how certain we are about the
00:01:08.190 --> 00:01:13.710
result. To do so we present the margin of error.
Let's say that the political party's popularity
00:01:13.710 --> 00:01:24.900
is 21% plus or minus 1% point. That decree of
uncertainty is quantified by the standard error.
00:01:24.900 --> 00:01:32.670
I'm gonna go through these four concepts next,
so let's take an example, let's assume that we
00:01:32.670 --> 00:01:39.120
have a university with let's say 10,000 students
and stuff, and we want to calculate what is the
00:01:39.120 --> 00:01:43.920
mean height of people are that they're affiliated
with the University including students and stuff.
00:01:43.920 --> 00:01:49.950
And there are a different ways of doing that.
First we have to understand the basic concepts.
00:01:49.950 --> 00:01:57.870
So here our population is everyone who is enrolled
at the University. The actual list of people who
00:01:57.870 --> 00:02:04.470
are admin, who have been admitted, and the list
of people who are employed available from the
00:02:04.470 --> 00:02:10.080
university administration forms our sampling
frame, which is the operational definition the
00:02:10.080 --> 00:02:16.380
actual list of people that we think belong to
our population. Then we take a random sample
00:02:16.380 --> 00:02:23.220
and from that random sample we hope that we can
learn something about the populace. We could of
00:02:23.220 --> 00:02:28.380
course take other kinds of samples as well but for
now we'll just talk about random samples, because
00:02:28.380 --> 00:02:35.400
that simplify thing simplifies things a lot so
let's go to the example now. And we have different
00:02:35.400 --> 00:02:42.840
strategies for measuring the height or estimating
the mean height of people at the University one or
00:02:42.840 --> 00:02:47.970
B. Your strategy is to take a small sample of
people and then calculate everybody's height.
00:02:47.970 --> 00:02:53.700
Take a sum and divide it by the number of people
which gives us the average height of the sample or
00:02:53.700 --> 00:03:00.090
the sample mean of the height. So we can do that
and here's some data. So we have a hypothetical
00:03:00.090 --> 00:03:08.190
University with a population mean high this 169
about 96 centimeters, and we have five samples.
00:03:08.190 --> 00:03:13.950
So we have a sample size of ten people, we have
their measured heights. here we can see that some
00:03:13.950 --> 00:03:18.690
people are shorter than average some people
are be taller than average, some people are
00:03:18.690 --> 00:03:27.150
very tall. And the first sample the sample mean is
161, so we underestimate the population value by
00:03:27.150 --> 00:03:35.520
about 8 centimeters. The second sample is 169.50
6 so that's very close to the actual population
00:03:35.520 --> 00:03:43.200
value. Third random sample gives us 173 with
overestimates the population value then we have
00:03:43.200 --> 00:03:51.270
163 which underestimates again and 168 which is
close to the correct true population mean. Now the
00:03:51.270 --> 00:03:59.820
question is why do these values differ, so why do
we get a different estimate from its sample. That
00:03:59.820 --> 00:04:07.950
is because in a random sample sometimes it happens
that tall people get selected more often in that
00:04:07.950 --> 00:04:13.020
sample than sort people. Sometimes we select
randomly short people more than tall people.
00:04:13.020 --> 00:04:20.790
So this is a this estimate varies from sample to
sample and this is called our sampling variance of
00:04:20.790 --> 00:04:28.590
an estimator. Estimator here means any strategy
that we can apply to data to calculate an S. So
00:04:28.590 --> 00:04:35.370
these estimates vary from sample to sample. Now
two questions are how do we make the estimates
00:04:35.370 --> 00:04:42.810
more precise and can we improve the estimates.
Because if the population value is 169 and our
00:04:42.810 --> 00:04:50.880
estimates vary between 161 and 173 that's quite
imprecise, and the second question is how do
00:04:50.880 --> 00:04:59.370
we quantify the uncertainty if we just say that
we estimate that the mean height is 161. That's
00:04:59.370 --> 00:05:06.030
quite a responsible thing to do because we are not
telling there our audience that our sample size is
00:05:06.030 --> 00:05:12.600
so small that these estimates are very imprecise.
Recall my example from political polling,
00:05:12.600 --> 00:05:17.820
when you see a poll number there's always
the margin of error attached to that
00:05:17.820 --> 00:05:23.820
particular point estimate of popularity.
Let's take a look at the effect of sample
00:05:23.820 --> 00:05:29.910
size so one obvious strategy for making our
our estimates calculated from sample bearer
00:05:29.910 --> 00:05:36.150
is to increase the sample size. So here is a
distribution of 10,000 random samples from our
00:05:36.150 --> 00:05:44.310
population using a sample size of 10 typically.
We get estimates that are close to the right
00:05:44.310 --> 00:05:49.620
the correct population value, sometimes we
get estimates that are way too small and,
00:05:49.620 --> 00:05:56.430
sometimes estimates that are way too large. So
once we increase the sample size to 50 this red
00:05:56.430 --> 00:06:02.430
line here, we can see that our the estimates from
from repeated samples actually are now distributed
00:06:02.430 --> 00:06:09.870
between plus or or minus about seven from the
population value here. So the estimates are
00:06:09.870 --> 00:06:16.590
more precise, than what we got from 50 from ten
observations. Sf we further increase the samples
00:06:16.590 --> 00:06:22.680
as a 200 there is a now we get plus or minus
3 centimeter and population mean so the our
00:06:22.680 --> 00:06:28.950
press our precision increases here. If we have the
full population then we have the full population
00:06:28.950 --> 00:06:36.330
value.So when we have a sample our estimates
typically improve our samples as increases.
00:06:36.330 --> 00:06:41.560
That's referred to as a consistency property
of an estimator that I will talk in the next
00:06:41.560 --> 00:06:50.110
slide. Then our another thing besides their being
uncertain do, we have to quantify. So to quantify
00:06:50.110 --> 00:06:56.560
the uncertainty have to quantify the dispersion.
So that the question of uncertainty quantification
00:06:56.560 --> 00:07:04.840
refers to the question of if we were to repeat the
study over and over again. How much the estimates
00:07:04.840 --> 00:07:11.380
would vary from sample to sample. So we want to
quantify the sampling variance of of the estimate.
00:07:11.380 --> 00:07:18.340
So what the quantify how widely the different
estimates are dispersed. Remember that our we have
00:07:18.340 --> 00:07:27.250
two statistics that quantify this person. We have
standard deviation, we have variance. Typically in
00:07:27.250 --> 00:07:33.130
estimates we are interested in standard deviation
because it is in the same metric as the estimator.
00:07:33.130 --> 00:07:41.410
So if the estimate is 160 centimeters we can
say that the the standard error is plus or
00:07:41.410 --> 00:07:48.820
plus is 5 centimeters. So standard error is an
estimate of what would be the standard deviation
00:07:48.820 --> 00:07:54.550
of repeated samples from the same population.
Of course we would ideally want to calculate
00:07:54.550 --> 00:08:01.870
the actual standard deviation of of these 10,000
replications, but consider for example political
00:08:01.870 --> 00:08:10.600
polling. If you were asked to provide a standard
deviation of the same poll repeated over 10,000
00:08:10.600 --> 00:08:15.940
times, you have to actually do the 10,000
replications to be able to calculate that
00:08:15.940 --> 00:08:22.690
standard abuse and that's not a practical thing
to do. Therefore we use standard error which is an
00:08:22.690 --> 00:08:31.000
estimate of this standard deviation. So the same
way as they are the sample mean is an estimate of
00:08:31.000 --> 00:08:37.090
the population mean, the standard error is an
estimate of the standard deviation of their of
00:08:37.090 --> 00:08:42.820
the sample mean over repeated samples. How the
standard error is calculated it's not relevant
00:08:42.820 --> 00:08:50.110
at this point, you just have to understand it it
quantifies the dispersion over the same study if
00:08:50.110 --> 00:08:57.610
it was repeated over independent random samples.
Let's get back to your task, so this far we only
00:08:57.610 --> 00:09:05.620
discussed sample mean. So taking a mean of sample
is an obvious strategy if we want to calculate the
00:09:05.620 --> 00:09:11.110
population mean or estimate the population mean.
But that's not the only strategy, it's actually
00:09:11.110 --> 00:09:18.130
if you take a sample of let's say 30 people in a
class and you measure everybody's height. It takes
00:09:18.130 --> 00:09:23.800
on time. It could be our sometimes time or effort
to do the calculation it's an issue for you,
00:09:23.800 --> 00:09:29.320
so we could for example just take one person from
it aside from the class and measure his height.
00:09:29.320 --> 00:09:35.860
And if we get 160 centimeters then that's
a ballpark estimate is still an estimate,
00:09:35.860 --> 00:09:39.610
it's a very precise but it's an estimate
nevertheless, and it's valid in some sense and
00:09:39.610 --> 00:09:46.240
it's easy to calculate. Then another alternative
strategy of course that's that's all we would
00:09:46.240 --> 00:09:53.200
be omitting 225 people from our sample of 30. So
that's not a good strategy. Another quick strategy
00:09:53.200 --> 00:09:59.740
for calculating the estimate for the height, is
to allow people to self-organize into a line,
00:09:59.740 --> 00:10:05.260
so we tell that the shortest person goes to the
back of the class the tallest person goes to the
00:10:05.260 --> 00:10:10.360
front of the class and everyone in between
goes everyone else goes in between those two
00:10:10.360 --> 00:10:15.790
people ordered by their height. So people can
self-organize that way pretty quickly. Then we
00:10:15.790 --> 00:10:21.790
just go and we measure the height of the person
in the middle that's a sample media and that's
00:10:21.790 --> 00:10:28.750
an OK strategy. Press T when interpolation
mean under certain conditions. So there are
00:10:28.750 --> 00:10:34.150
different ways of calculating than estimate of
the population mean, we could use the sample mean,
00:10:34.150 --> 00:10:39.760
we could use the the height of the first person
that we see in the class, or we could use the
00:10:39.760 --> 00:10:45.340
class the median of the people in the class. So
which strategy should be used in this case. The
00:10:45.340 --> 00:10:52.960
mean is the best but uh to make an informed choice
of which is the most preferable, we have to first
00:10:52.960 --> 00:10:59.770
define what is the best. So every time when we
say that something is the best thing then then
00:10:59.770 --> 00:11:05.260
we have some kind of criterion. For example the
best ice hockey team is the one that won the most
00:11:05.260 --> 00:11:14.320
matches. The best runner is the one that got the
smallest time, and the best student in the class
00:11:14.320 --> 00:11:18.880
is the one with the highest grade. Product this is
all one so we have to when we say that something
00:11:18.880 --> 00:11:25.360
is the best we have to have some criteria.
And so we have to go and talk about different
00:11:25.360 --> 00:11:31.870
properties that these estimation strategies
could have when we decide which one is the best.
00:11:31.870 --> 00:11:42.190
So estimators can have certain properties.
Estimator refers to again any strategy or
00:11:42.190 --> 00:11:47.620
any calculation that you applied your sample
to get one value that is an estimate of the
00:11:47.620 --> 00:11:54.880
population. One minimum quality that they were
useful estimator must have is that the estimator
00:11:54.880 --> 00:11:59.740
must be consistent, and that consistency
means that if we increase the sample size,
00:11:59.740 --> 00:12:04.720
then our estimates will get better. So
the sample mean is a consistent estimator
00:12:04.720 --> 00:12:11.830
because it improves and also consistently
requires that if we have the full population,
00:12:11.830 --> 00:12:19.120
and we apply our calculation strategy to the full
population then we will get the correct population
00:12:19.120 --> 00:12:26.320
result. So consistency guarantees that one study
will get better as samples as increases. Of course
00:12:26.320 --> 00:12:32.170
in reality we can study populations for because
of cost issues but we have to rely on samples and
00:12:32.170 --> 00:12:39.100
therefore there are other things that we need to
consider as well be besides consistency second.
00:12:39.100 --> 00:12:45.490
Important thing is unbiasness, so if an estimator
is unbiased it means that it is free of systematic
00:12:45.490 --> 00:12:53.230
error. For example a biased estimate of sample
of the height would be for measurement tape is
00:12:53.230 --> 00:12:58.810
actually are shorter than what it says on this
on the scale. The numbers on the table being
00:12:58.810 --> 00:13:05.470
correct , that would be a biased estimator..
So the definition of unbiasness means that if
00:13:05.470 --> 00:13:12.820
we repeat the study many many times then even
if an individual study could be way incorrect
00:13:12.820 --> 00:13:20.050
then on average. Those studies would provide us
the correct result that is important because of
00:13:20.050 --> 00:13:25.900
how science works. So the idea of science and
resources that are we accumulate knowledge so
00:13:25.900 --> 00:13:31.720
we have studies and they're added to the body
of knowledge, and then are at some point someone
00:13:31.720 --> 00:13:38.230
looks at hundred studies, and looks at okay so
what is the average effect of one thing or not
00:13:38.230 --> 00:13:45.430
on another. If those studies are unbiased or free
of systematic error, then average of multiple
00:13:45.430 --> 00:13:50.710
repeated studies of the same issue provides us
a pretty good estimate of the population value.
00:13:50.710 --> 00:13:59.410
If in reality we often have to work with estimates
that are slightly biased but still consistent,
00:13:59.410 --> 00:14:06.460
sometimes we have multiple unbiased estimators
and we have to make a choice so which estimator
00:14:06.460 --> 00:14:13.900
do we choose. sample median and sample mean are
both unbiased, for this particular scenario. So
00:14:13.900 --> 00:14:19.060
which one we use, which one is the best, we
have to consider efficiency. So efficiency
00:14:19.060 --> 00:14:28.600
is a property that compares one or two or more
estimation strategies, and the one that has the
00:14:28.600 --> 00:14:34.660
least variation over repeated samples so it's
the most precise or individual estimates are
00:14:34.660 --> 00:14:41.110
expected to be closer to the population value,
than with alternative strategies- That is called
00:14:41.110 --> 00:14:49.540
an efficient estimator, and the property is called
efficiency. Then finally we have normality it's
00:14:49.540 --> 00:14:56.320
useful for statistical inference, if the estimates
are normally distributed over repeated samples or
00:14:56.320 --> 00:15:01.780
at least follow some other known distribution.
Why that's important will be discussed a bit
00:15:01.780 --> 00:15:10.540
later. Now okay so this is a bit of all let's say
statistical theory or or concepts that you may not
00:15:10.540 --> 00:15:17.650
are terms that you may not encounter in empirical
articles. So why knowing this is important? Or is
00:15:17.650 --> 00:15:25.210
it just nice to know stuff? This is important for
two reasons: the one reason is that if you study a
00:15:25.210 --> 00:15:32.350
good book about statistical analysis or research
methods, you will see these terms and unless you
00:15:32.350 --> 00:15:37.540
know what these terms refer to, it's difficult
to understand what you're reading. The second
00:15:37.540 --> 00:15:44.500
thing is that are in a regression analysis which
is a pretty basic tool that we'll talk later. The
00:15:44.500 --> 00:15:49.480
choice of regression analysis is pretty obvious
in certain scenarios, but in other scenarios you
00:15:49.480 --> 00:15:54.040
have different competing options that you could
choose and there are trade-offs. So you could
00:15:54.040 --> 00:16:00.430
use an estimator that is a very inefficient but
unbiased, or you could have a slightly biased
00:16:00.430 --> 00:16:06.850
but efficient estimator. So which one do you
choose you have to understand these concepts to
00:16:06.850 --> 00:16:14.680
make choices. Let's take a look at an example. So
here is the other height example and we have six
00:16:14.680 --> 00:16:20.530
estimation strategies. We have the sample median
the sample mean that we discussed. We have the
00:16:20.530 --> 00:16:26.470
sample median which is an OK strategy. So take the
person in the middle and measure their height. We
00:16:26.470 --> 00:16:31.570
have the height of the first observation which
is sometimes if you're really in a hurry. That's
00:16:31.570 --> 00:16:38.560
a fast way of estimating things. Then we have
our three completely made-up strategies. One
00:16:38.560 --> 00:16:44.800
is absolute value of the sample mean around
the population value. So I'm just using that
00:16:44.800 --> 00:16:51.370
to get that kind of shape and we have sample
mean plus 100 divided by sample size. So this
00:16:51.370 --> 00:16:58.120
is a an unreasonable strategy as well. And then
we have this random guess between 140 and 200.
00:16:58.120 --> 00:17:08.050
So consistency. Do these estimators get better
as sample size increases? For the sample mean
00:17:08.050 --> 00:17:15.910
obviously yeah that is it will get better so it's
consistent we can see that these estimates get
00:17:15.910 --> 00:17:21.910
closer and closer to the population value as the
sample size increases. Same thing here absolute
00:17:21.910 --> 00:17:28.630
value of sample mean around the population value
is it's not a very good estimator because they
00:17:28.630 --> 00:17:33.610
are systematic to large, but if you increase the
sample size, they will get they're still pretty
00:17:33.610 --> 00:17:37.630
bad but they will get better these estimates.
So they're still systematically too large,
00:17:37.630 --> 00:17:46.480
but they will get better. First observation it
is inconsistent because there are sample size
00:17:46.480 --> 00:17:53.560
has no effect. So consistence is about whether
things improve our sample size increases and the
00:17:53.560 --> 00:18:00.700
first observation doesn't really, the number of
observations our sample doesn't really influence
00:18:00.700 --> 00:18:08.560
what is the height of the first person in the
class. Sample median is consistent. Sample mean
00:18:08.560 --> 00:18:18.100
plus 100 divided by sample size is consistent if
the population sizes are in definitely large. So
00:18:18.100 --> 00:18:25.450
if the population is very very large this one goes
to zero and the values go to the actual population
00:18:25.450 --> 00:18:32.350
value. Then we have a guess between 140 and 200,
and that is inconsistent because it doesn't depend
00:18:32.350 --> 00:18:41.050
on the sample size. So we have our four consistent
estimators and two inconsistent ones. The next
00:18:41.050 --> 00:18:48.370
property was unbiasness. Sample mean is unbiased
because we can see that these observations are
00:18:48.370 --> 00:18:53.950
equally spread out around the population value. So
if we take a mean regardless of the sample size,
00:18:53.950 --> 00:19:01.030
we will get the correct population value. Absolute
value of sample means income is biased. We see
00:19:01.030 --> 00:19:06.820
that these observations are systematically these
are estimates are systematically too large that is
00:19:06.820 --> 00:19:12.550
the definition of biasness. So there's systematic
error in them in the estimates. This is actually
00:19:12.550 --> 00:19:19.900
our unbiased because even though this is a really
bad way of estimating things, because it doesn't
00:19:19.900 --> 00:19:26.890
improve with sample size, it is unbiased because
on average repeated estimates are correct at the
00:19:26.890 --> 00:19:32.020
population value. In reality it's very difficult
to come up with a scenario where there is an
00:19:32.020 --> 00:19:37.270
unbiased estimator that is inconsistent and that
will still be useful, because typically if we
00:19:37.270 --> 00:19:45.250
have an unbiased estimator it is also consistent.
Then our sample median is unbiased to correct on
00:19:45.250 --> 00:19:53.860
average. Sample mean plus 100 a body with sample
sizes biased systematically too large, and this
00:19:53.860 --> 00:20:02.630
one is slightly biased. So you can't see from your
but it is. Then we have efficiency. Sample mean is
00:20:02.630 --> 00:20:11.420
efficient and we compare efficiency against other
unbiased estimators. So sample mean is certainly
00:20:11.420 --> 00:20:17.240
more precise than taking the first observation. So
that's very clear. The difference is very clear.
00:20:17.240 --> 00:20:24.950
So this is spread widely here and here they are
much closer to the population value. Sample mean
00:20:24.950 --> 00:20:31.310
is also more precise than sample median, you
can't see with plain eye. So the difference is
00:20:31.310 --> 00:20:38.600
very small, but that's it has been proven to be
for this particular case and generally also more
00:20:38.600 --> 00:20:48.500
efficient. Then our absolute value of sample mean
is actually are slightly more precise than sample
00:20:48.500 --> 00:20:55.640
mean so these observations are less dispersed
but because this is a an a biased estimator,
00:20:55.640 --> 00:21:02.360
considering its efficiency doesn't make much
sense, so this is our efficient but it doesn't
00:21:02.360 --> 00:21:09.350
really count. This was is inefficient because they
spread widely compared to others. Sample median is
00:21:09.350 --> 00:21:16.640
inefficient because sample mean is better. This
one is a sample mean plus 100 divided by n is
00:21:16.640 --> 00:21:24.500
efficient or or equally efficient. Our sample
mean because their distribution this person
00:21:24.500 --> 00:21:29.150
is the same but it's biased. So comparing
the efficiency doesn't really make sense,
00:21:29.150 --> 00:21:34.670
if we compare the efficiency we need these
two biased estimators then this would be
00:21:34.670 --> 00:21:40.220
inefficient. But again the comparison of
efficiency of biased estimators doesn't
00:21:40.220 --> 00:21:45.230
make much sense and this one is inefficient
because they are spread out quite widely.