WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:04.380 It's very common within the research we're  interested in large number of people or   00:00:04.380 --> 00:00:12.540 organizations. For example in political polling  it's usually interesting to know what is the   00:00:12.540 --> 00:00:19.410 popularity of a political party. However if  we consider a national level popularity then   00:00:19.410 --> 00:00:25.770 measuring everybody's opinion would involve  calling millions of people, and that's in most   00:00:25.770 --> 00:00:32.370 cases impractical. Instead what we do is that we  take a smaller number of people called a sample,   00:00:32.370 --> 00:00:38.400 for example we call 300 people or thousand people,  we ask their opinions about political parties,   00:00:38.400 --> 00:00:46.200 and we use that sample to calculate an estimate  of what is the population popularity of that   00:00:46.200 --> 00:00:52.950 political party. If the sample is well chosen  and if it's large enough then the popularity   00:00:52.950 --> 00:01:00.540 from the sample or gets very close to the actual  population popularity. Then another thing that   00:01:00.540 --> 00:01:08.190 when we do polling is that we want to tell the  readers of our poll, how certain we are about the   00:01:08.190 --> 00:01:13.710 result. To do so we present the margin of error.  Let's say that the political party's popularity   00:01:13.710 --> 00:01:24.900 is 21% plus or minus 1% point. That decree of  uncertainty is quantified by the standard error.   00:01:24.900 --> 00:01:32.670 I'm gonna go through these four concepts next,  so let's take an example, let's assume that we   00:01:32.670 --> 00:01:39.120 have a university with let's say 10,000 students  and stuff, and we want to calculate what is the   00:01:39.120 --> 00:01:43.920 mean height of people are that they're affiliated  with the University including students and stuff.   00:01:43.920 --> 00:01:49.950 And there are a different ways of doing that.  First we have to understand the basic concepts.   00:01:49.950 --> 00:01:57.870 So here our population is everyone who is enrolled  at the University. The actual list of people who   00:01:57.870 --> 00:02:04.470 are admin, who have been admitted, and the list  of people who are employed available from the   00:02:04.470 --> 00:02:10.080 university administration forms our sampling  frame, which is the operational definition the   00:02:10.080 --> 00:02:16.380 actual list of people that we think belong to  our population. Then we take a random sample   00:02:16.380 --> 00:02:23.220 and from that random sample we hope that we can  learn something about the populace. We could of   00:02:23.220 --> 00:02:28.380 course take other kinds of samples as well but for  now we'll just talk about random samples, because   00:02:28.380 --> 00:02:35.400 that simplify thing simplifies things a lot so  let's go to the example now. And we have different   00:02:35.400 --> 00:02:42.840 strategies for measuring the height or estimating  the mean height of people at the University one or   00:02:42.840 --> 00:02:47.970 B. Your strategy is to take a small sample of  people and then calculate everybody's height.   00:02:47.970 --> 00:02:53.700 Take a sum and divide it by the number of people  which gives us the average height of the sample or   00:02:53.700 --> 00:03:00.090 the sample mean of the height. So we can do that  and here's some data. So we have a hypothetical   00:03:00.090 --> 00:03:08.190 University with a population mean high this 169  about 96 centimeters, and we have five samples.   00:03:08.190 --> 00:03:13.950 So we have a sample size of ten people, we have  their measured heights. here we can see that some   00:03:13.950 --> 00:03:18.690 people are shorter than average some people  are be taller than average, some people are   00:03:18.690 --> 00:03:27.150 very tall. And the first sample the sample mean is  161, so we underestimate the population value by   00:03:27.150 --> 00:03:35.520 about 8 centimeters. The second sample is 169.50  6 so that's very close to the actual population   00:03:35.520 --> 00:03:43.200 value. Third random sample gives us 173 with  overestimates the population value then we have   00:03:43.200 --> 00:03:51.270 163 which underestimates again and 168 which is  close to the correct true population mean. Now the   00:03:51.270 --> 00:03:59.820 question is why do these values differ, so why do  we get a different estimate from its sample. That   00:03:59.820 --> 00:04:07.950 is because in a random sample sometimes it happens  that tall people get selected more often in that   00:04:07.950 --> 00:04:13.020 sample than sort people. Sometimes we select  randomly short people more than tall people.   00:04:13.020 --> 00:04:20.790 So this is a this estimate varies from sample to  sample and this is called our sampling variance of   00:04:20.790 --> 00:04:28.590 an estimator. Estimator here means any strategy  that we can apply to data to calculate an S. So   00:04:28.590 --> 00:04:35.370 these estimates vary from sample to sample. Now  two questions are how do we make the estimates   00:04:35.370 --> 00:04:42.810 more precise and can we improve the estimates. Because if the population value is 169 and our   00:04:42.810 --> 00:04:50.880 estimates vary between 161 and 173 that's quite  imprecise, and the second question is how do   00:04:50.880 --> 00:04:59.370 we quantify the uncertainty if we just say that  we estimate that the mean height is 161. That's   00:04:59.370 --> 00:05:06.030 quite a responsible thing to do because we are not  telling there our audience that our sample size is   00:05:06.030 --> 00:05:12.600 so small that these estimates are very imprecise. Recall my example from political polling,   00:05:12.600 --> 00:05:17.820 when you see a poll number there's always  the margin of error attached to that   00:05:17.820 --> 00:05:23.820 particular point estimate of popularity. Let's take a look at the effect of sample   00:05:23.820 --> 00:05:29.910 size so one obvious strategy for making our  our estimates calculated from sample bearer   00:05:29.910 --> 00:05:36.150 is to increase the sample size. So here is a  distribution of 10,000 random samples from our   00:05:36.150 --> 00:05:44.310 population using a sample size of 10 typically. We get estimates that are close to the right   00:05:44.310 --> 00:05:49.620 the correct population value, sometimes we  get estimates that are way too small and,   00:05:49.620 --> 00:05:56.430 sometimes estimates that are way too large. So  once we increase the sample size to 50 this red   00:05:56.430 --> 00:06:02.430 line here, we can see that our the estimates from  from repeated samples actually are now distributed   00:06:02.430 --> 00:06:09.870 between plus or or minus about seven from the  population value here. So the estimates are   00:06:09.870 --> 00:06:16.590 more precise, than what we got from 50 from ten  observations. Sf we further increase the samples   00:06:16.590 --> 00:06:22.680 as a 200 there is a now we get plus or minus  3 centimeter and population mean so the our   00:06:22.680 --> 00:06:28.950 press our precision increases here. If we have the  full population then we have the full population   00:06:28.950 --> 00:06:36.330 value.So when we have a sample our estimates  typically improve our samples as increases.   00:06:36.330 --> 00:06:41.560 That's referred to as a consistency property  of an estimator that I will talk in the next   00:06:41.560 --> 00:06:50.110 slide. Then our another thing besides their being  uncertain do, we have to quantify. So to quantify   00:06:50.110 --> 00:06:56.560 the uncertainty have to quantify the dispersion.  So that the question of uncertainty quantification   00:06:56.560 --> 00:07:04.840 refers to the question of if we were to repeat the  study over and over again. How much the estimates   00:07:04.840 --> 00:07:11.380 would vary from sample to sample. So we want to  quantify the sampling variance of of the estimate.   00:07:11.380 --> 00:07:18.340 So what the quantify how widely the different  estimates are dispersed. Remember that our we have   00:07:18.340 --> 00:07:27.250 two statistics that quantify this person. We have  standard deviation, we have variance. Typically in   00:07:27.250 --> 00:07:33.130 estimates we are interested in standard deviation  because it is in the same metric as the estimator.   00:07:33.130 --> 00:07:41.410 So if the estimate is 160 centimeters we can  say that the the standard error is plus or   00:07:41.410 --> 00:07:48.820 plus is 5 centimeters. So standard error is an  estimate of what would be the standard deviation   00:07:48.820 --> 00:07:54.550 of repeated samples from the same population.  Of course we would ideally want to calculate   00:07:54.550 --> 00:08:01.870 the actual standard deviation of of these 10,000  replications, but consider for example political   00:08:01.870 --> 00:08:10.600 polling. If you were asked to provide a standard  deviation of the same poll repeated over 10,000   00:08:10.600 --> 00:08:15.940 times, you have to actually do the 10,000  replications to be able to calculate that   00:08:15.940 --> 00:08:22.690 standard abuse and that's not a practical thing  to do. Therefore we use standard error which is an   00:08:22.690 --> 00:08:31.000 estimate of this standard deviation. So the same  way as they are the sample mean is an estimate of   00:08:31.000 --> 00:08:37.090 the population mean, the standard error is an  estimate of the standard deviation of their of   00:08:37.090 --> 00:08:42.820 the sample mean over repeated samples. How the  standard error is calculated it's not relevant   00:08:42.820 --> 00:08:50.110 at this point, you just have to understand it it  quantifies the dispersion over the same study if   00:08:50.110 --> 00:08:57.610 it was repeated over independent random samples. Let's get back to your task, so this far we only   00:08:57.610 --> 00:09:05.620 discussed sample mean. So taking a mean of sample  is an obvious strategy if we want to calculate the   00:09:05.620 --> 00:09:11.110 population mean or estimate the population mean.  But that's not the only strategy, it's actually   00:09:11.110 --> 00:09:18.130 if you take a sample of let's say 30 people in a  class and you measure everybody's height. It takes   00:09:18.130 --> 00:09:23.800 on time. It could be our sometimes time or effort  to do the calculation it's an issue for you,   00:09:23.800 --> 00:09:29.320 so we could for example just take one person from  it aside from the class and measure his height.   00:09:29.320 --> 00:09:35.860 And if we get 160 centimeters then that's  a ballpark estimate is still an estimate,   00:09:35.860 --> 00:09:39.610 it's a very precise but it's an estimate  nevertheless, and it's valid in some sense and   00:09:39.610 --> 00:09:46.240 it's easy to calculate. Then another alternative  strategy of course that's that's all we would   00:09:46.240 --> 00:09:53.200 be omitting 225 people from our sample of 30. So  that's not a good strategy. Another quick strategy   00:09:53.200 --> 00:09:59.740 for calculating the estimate for the height, is  to allow people to self-organize into a line,   00:09:59.740 --> 00:10:05.260 so we tell that the shortest person goes to the  back of the class the tallest person goes to the   00:10:05.260 --> 00:10:10.360 front of the class and everyone in between  goes everyone else goes in between those two   00:10:10.360 --> 00:10:15.790 people ordered by their height. So people can  self-organize that way pretty quickly. Then we   00:10:15.790 --> 00:10:21.790 just go and we measure the height of the person  in the middle that's a sample media and that's   00:10:21.790 --> 00:10:28.750 an OK strategy. Press T when interpolation  mean under certain conditions. So there are   00:10:28.750 --> 00:10:34.150 different ways of calculating than estimate of  the population mean, we could use the sample mean,   00:10:34.150 --> 00:10:39.760 we could use the the height of the first person  that we see in the class, or we could use the   00:10:39.760 --> 00:10:45.340 class the median of the people in the class. So  which strategy should be used in this case. The   00:10:45.340 --> 00:10:52.960 mean is the best but uh to make an informed choice  of which is the most preferable, we have to first   00:10:52.960 --> 00:10:59.770 define what is the best. So every time when we  say that something is the best thing then then   00:10:59.770 --> 00:11:05.260 we have some kind of criterion. For example the  best ice hockey team is the one that won the most   00:11:05.260 --> 00:11:14.320 matches. The best runner is the one that got the  smallest time, and the best student in the class   00:11:14.320 --> 00:11:18.880 is the one with the highest grade. Product this is  all one so we have to when we say that something   00:11:18.880 --> 00:11:25.360 is the best we have to have some criteria.  And so we have to go and talk about different   00:11:25.360 --> 00:11:31.870 properties that these estimation strategies  could have when we decide which one is the best. 00:11:31.870 --> 00:11:42.190 So estimators can have certain properties.  Estimator refers to again any strategy or   00:11:42.190 --> 00:11:47.620 any calculation that you applied your sample  to get one value that is an estimate of the   00:11:47.620 --> 00:11:54.880 population. One minimum quality that they were  useful estimator must have is that the estimator   00:11:54.880 --> 00:11:59.740 must be consistent, and that consistency  means that if we increase the sample size,   00:11:59.740 --> 00:12:04.720 then our estimates will get better. So  the sample mean is a consistent estimator   00:12:04.720 --> 00:12:11.830 because it improves and also consistently  requires that if we have the full population,   00:12:11.830 --> 00:12:19.120 and we apply our calculation strategy to the full  population then we will get the correct population   00:12:19.120 --> 00:12:26.320 result. So consistency guarantees that one study  will get better as samples as increases. Of course   00:12:26.320 --> 00:12:32.170 in reality we can study populations for because  of cost issues but we have to rely on samples and   00:12:32.170 --> 00:12:39.100 therefore there are other things that we need to  consider as well be besides consistency second.   00:12:39.100 --> 00:12:45.490 Important thing is unbiasness, so if an estimator  is unbiased it means that it is free of systematic   00:12:45.490 --> 00:12:53.230 error. For example a biased estimate of sample  of the height would be for measurement tape is   00:12:53.230 --> 00:12:58.810 actually are shorter than what it says on this  on the scale. The numbers on the table being   00:12:58.810 --> 00:13:05.470 correct , that would be a biased estimator..  So the definition of unbiasness means that if   00:13:05.470 --> 00:13:12.820 we repeat the study many many times then even  if an individual study could be way incorrect   00:13:12.820 --> 00:13:20.050 then on average. Those studies would provide us  the correct result that is important because of   00:13:20.050 --> 00:13:25.900 how science works. So the idea of science and  resources that are we accumulate knowledge so   00:13:25.900 --> 00:13:31.720 we have studies and they're added to the body  of knowledge, and then are at some point someone   00:13:31.720 --> 00:13:38.230 looks at hundred studies, and looks at okay so  what is the average effect of one thing or not   00:13:38.230 --> 00:13:45.430 on another. If those studies are unbiased or free  of systematic error, then average of multiple   00:13:45.430 --> 00:13:50.710 repeated studies of the same issue provides us  a pretty good estimate of the population value.  00:13:50.710 --> 00:13:59.410 If in reality we often have to work with estimates  that are slightly biased but still consistent,   00:13:59.410 --> 00:14:06.460 sometimes we have multiple unbiased estimators  and we have to make a choice so which estimator   00:14:06.460 --> 00:14:13.900 do we choose. sample median and sample mean are  both unbiased, for this particular scenario. So   00:14:13.900 --> 00:14:19.060 which one we use, which one is the best, we  have to consider efficiency. So efficiency   00:14:19.060 --> 00:14:28.600 is a property that compares one or two or more  estimation strategies, and the one that has the   00:14:28.600 --> 00:14:34.660 least variation over repeated samples so it's  the most precise or individual estimates are   00:14:34.660 --> 00:14:41.110 expected to be closer to the population value,  than with alternative strategies- That is called   00:14:41.110 --> 00:14:49.540 an efficient estimator, and the property is called  efficiency. Then finally we have normality it's   00:14:49.540 --> 00:14:56.320 useful for statistical inference, if the estimates  are normally distributed over repeated samples or   00:14:56.320 --> 00:15:01.780 at least follow some other known distribution.  Why that's important will be discussed a bit   00:15:01.780 --> 00:15:10.540 later. Now okay so this is a bit of all let's say  statistical theory or or concepts that you may not   00:15:10.540 --> 00:15:17.650 are terms that you may not encounter in empirical  articles. So why knowing this is important? Or is   00:15:17.650 --> 00:15:25.210 it just nice to know stuff? This is important for  two reasons: the one reason is that if you study a   00:15:25.210 --> 00:15:32.350 good book about statistical analysis or research  methods, you will see these terms and unless you   00:15:32.350 --> 00:15:37.540 know what these terms refer to, it's difficult  to understand what you're reading. The second   00:15:37.540 --> 00:15:44.500 thing is that are in a regression analysis which  is a pretty basic tool that we'll talk later. The   00:15:44.500 --> 00:15:49.480 choice of regression analysis is pretty obvious  in certain scenarios, but in other scenarios you   00:15:49.480 --> 00:15:54.040 have different competing options that you could  choose and there are trade-offs. So you could   00:15:54.040 --> 00:16:00.430 use an estimator that is a very inefficient but  unbiased, or you could have a slightly biased   00:16:00.430 --> 00:16:06.850 but efficient estimator. So which one do you  choose you have to understand these concepts to   00:16:06.850 --> 00:16:14.680 make choices. Let's take a look at an example. So  here is the other height example and we have six   00:16:14.680 --> 00:16:20.530 estimation strategies. We have the sample median  the sample mean that we discussed. We have the   00:16:20.530 --> 00:16:26.470 sample median which is an OK strategy. So take the  person in the middle and measure their height. We   00:16:26.470 --> 00:16:31.570 have the height of the first observation which  is sometimes if you're really in a hurry. That's   00:16:31.570 --> 00:16:38.560 a fast way of estimating things. Then we have  our three completely made-up strategies. One   00:16:38.560 --> 00:16:44.800 is absolute value of the sample mean around  the population value. So I'm just using that   00:16:44.800 --> 00:16:51.370 to get that kind of shape and we have sample  mean plus 100 divided by sample size. So this   00:16:51.370 --> 00:16:58.120 is a an unreasonable strategy as well. And then  we have this random guess between 140 and 200.   00:16:58.120 --> 00:17:08.050 So consistency. Do these estimators get better  as sample size increases? For the sample mean   00:17:08.050 --> 00:17:15.910 obviously yeah that is it will get better so it's  consistent we can see that these estimates get   00:17:15.910 --> 00:17:21.910 closer and closer to the population value as the  sample size increases. Same thing here absolute   00:17:21.910 --> 00:17:28.630 value of sample mean around the population value  is it's not a very good estimator because they   00:17:28.630 --> 00:17:33.610 are systematic to large, but if you increase the  sample size, they will get they're still pretty   00:17:33.610 --> 00:17:37.630 bad but they will get better these estimates.  So they're still systematically too large,   00:17:37.630 --> 00:17:46.480 but they will get better. First observation it  is inconsistent because there are sample size   00:17:46.480 --> 00:17:53.560 has no effect. So consistence is about whether  things improve our sample size increases and the   00:17:53.560 --> 00:18:00.700 first observation doesn't really, the number of  observations our sample doesn't really influence   00:18:00.700 --> 00:18:08.560 what is the height of the first person in the  class. Sample median is consistent. Sample mean   00:18:08.560 --> 00:18:18.100 plus 100 divided by sample size is consistent if  the population sizes are in definitely large. So   00:18:18.100 --> 00:18:25.450 if the population is very very large this one goes  to zero and the values go to the actual population   00:18:25.450 --> 00:18:32.350 value. Then we have a guess between 140 and 200,  and that is inconsistent because it doesn't depend   00:18:32.350 --> 00:18:41.050 on the sample size. So we have our four consistent  estimators and two inconsistent ones. The next   00:18:41.050 --> 00:18:48.370 property was unbiasness. Sample mean is unbiased  because we can see that these observations are   00:18:48.370 --> 00:18:53.950 equally spread out around the population value. So  if we take a mean regardless of the sample size,   00:18:53.950 --> 00:19:01.030 we will get the correct population value. Absolute  value of sample means income is biased. We see   00:19:01.030 --> 00:19:06.820 that these observations are systematically these  are estimates are systematically too large that is   00:19:06.820 --> 00:19:12.550 the definition of biasness. So there's systematic  error in them in the estimates. This is actually   00:19:12.550 --> 00:19:19.900 our unbiased because even though this is a really  bad way of estimating things, because it doesn't   00:19:19.900 --> 00:19:26.890 improve with sample size, it is unbiased because  on average repeated estimates are correct at the   00:19:26.890 --> 00:19:32.020 population value. In reality it's very difficult  to come up with a scenario where there is an   00:19:32.020 --> 00:19:37.270 unbiased estimator that is inconsistent and that  will still be useful, because typically if we   00:19:37.270 --> 00:19:45.250 have an unbiased estimator it is also consistent.  Then our sample median is unbiased to correct on   00:19:45.250 --> 00:19:53.860 average. Sample mean plus 100 a body with sample  sizes biased systematically too large, and this   00:19:53.860 --> 00:20:02.630 one is slightly biased. So you can't see from your  but it is. Then we have efficiency. Sample mean is   00:20:02.630 --> 00:20:11.420 efficient and we compare efficiency against other  unbiased estimators. So sample mean is certainly   00:20:11.420 --> 00:20:17.240 more precise than taking the first observation. So  that's very clear. The difference is very clear.   00:20:17.240 --> 00:20:24.950 So this is spread widely here and here they are  much closer to the population value. Sample mean   00:20:24.950 --> 00:20:31.310 is also more precise than sample median, you  can't see with plain eye. So the difference is   00:20:31.310 --> 00:20:38.600 very small, but that's it has been proven to be  for this particular case and generally also more   00:20:38.600 --> 00:20:48.500 efficient. Then our absolute value of sample mean  is actually are slightly more precise than sample   00:20:48.500 --> 00:20:55.640 mean so these observations are less dispersed  but because this is a an a biased estimator,   00:20:55.640 --> 00:21:02.360 considering its efficiency doesn't make much  sense, so this is our efficient but it doesn't   00:21:02.360 --> 00:21:09.350 really count. This was is inefficient because they  spread widely compared to others. Sample median is   00:21:09.350 --> 00:21:16.640 inefficient because sample mean is better. This  one is a sample mean plus 100 divided by n is   00:21:16.640 --> 00:21:24.500 efficient or or equally efficient. Our sample  mean because their distribution this person   00:21:24.500 --> 00:21:29.150 is the same but it's biased. So comparing  the efficiency doesn't really make sense,   00:21:29.150 --> 00:21:34.670 if we compare the efficiency we need these  two biased estimators then this would be   00:21:34.670 --> 00:21:40.220 inefficient. But again the comparison of  efficiency of biased estimators doesn't   00:21:40.220 --> 00:21:45.230 make much sense and this one is inefficient  because they are spread out quite widely.