WEBVTT 00:00:00.060 --> 00:00:03.720 Normally in a research study,  we cannot study full populations 00:00:03.720 --> 00:00:05.040 because of practical issues. 00:00:05.040 --> 00:00:07.710 Therefore we have to rely on a sample. 00:00:07.710 --> 00:00:09.600 When we take that sample, 00:00:09.600 --> 00:00:12.060 there are multiple things  that we need to consider, 00:00:12.060 --> 00:00:14.310 and a number of things that can go wrong, 00:00:14.310 --> 00:00:19.964 that can either produce results that are  biased or results that are inefficient. 00:00:20.259 --> 00:00:23.190 So let's take a look at some  issues related to sampling. 00:00:23.190 --> 00:00:26.100 First, we have a population, 00:00:26.100 --> 00:00:28.890 that population is the  thing that we want to study, 00:00:28.890 --> 00:00:31.653 so we want to say something about the population. 00:00:31.653 --> 00:00:35.820 And let's say we are studying  that population using a survey, 00:00:35.820 --> 00:00:37.590 that we mailed to companies. 00:00:37.590 --> 00:00:40.680 To send out the invitations to participate, 00:00:40.680 --> 00:00:42.750 we have to have an address for every company. 00:00:42.750 --> 00:00:47.880 So we have to have some kind of  operational definition of our population. 00:00:47.880 --> 00:00:51.330 So we called operational  population a sampling frame. 00:00:51.330 --> 00:00:55.740 So the sampling frame is an  actual list of companies or   00:00:55.740 --> 00:00:58.020 people or whatever things we're studying. 00:00:58.020 --> 00:01:02.760 And the population is the conceptual  definition of the thing that we're studying. 00:01:02.760 --> 00:01:04.890 Then from the sampling frame, 00:01:04.890 --> 00:01:07.980 it can be if we are studying  individuals in Finland for example, 00:01:07.980 --> 00:01:12.990 the sampling frame could come  from the population register, 00:01:12.990 --> 00:01:15.660 and it could contain millions of people. 00:01:15.660 --> 00:01:18.450 So then from the sampling frame, 00:01:18.450 --> 00:01:20.010 we actually take the sample. 00:01:20.010 --> 00:01:22.260 Typically we choose people randomly, 00:01:22.260 --> 00:01:26.670 so the random sample is the  simplest way of taking a sample, 00:01:26.670 --> 00:01:28.950 and it is often the most desirable way as well. 00:01:28.950 --> 00:01:34.170 Then we send out our survey and  some people choose to participate, 00:01:34.170 --> 00:01:35.850 some companies choose to participate, 00:01:35.850 --> 00:01:38.280 others choose to not participate, 00:01:38.280 --> 00:01:41.640 so we get an actual dataset that we can work with. 00:01:41.640 --> 00:01:43.650 Now a number of things can go wrong, 00:01:43.650 --> 00:01:46.140 and we have to take those into consideration. 00:01:46.140 --> 00:01:49.500 So let's take an example of  what this framework means. 00:01:49.500 --> 00:01:55.710 So let's say we're studying the population  of young Finnish high technology companies. 00:01:55.710 --> 00:01:58.800 That is a conceptual definition, 00:01:58.800 --> 00:02:02.250 then we need to actually define empirically, 00:02:02.250 --> 00:02:07.290 or have an operational definition of  what it means to be a young company, 00:02:07.290 --> 00:02:10.980 and what it means to be a technology firm. 00:02:10.980 --> 00:02:15.300 So no one is maintaining a  list of technology companies, 00:02:15.300 --> 00:02:18.900 so we have to operationalize that concept to a way 00:02:18.900 --> 00:02:20.730 that we can actually get data for. 00:02:20.730 --> 00:02:23.280 So the sampling frame would be, for example, 00:02:24.000 --> 00:02:27.420 business IDs, so registered corporations, 00:02:27.420 --> 00:02:29.670 that is not the same thing as a company, 00:02:29.670 --> 00:02:33.510 so one organization can  have multiple business IDs. 00:02:33.510 --> 00:02:37.170 But we have to have some kind  of operational definition 00:02:37.170 --> 00:02:38.520 that we can actually get data for, 00:02:38.520 --> 00:02:45.120 and we can get data from business IDs or  legal entities behind these companies. 00:02:45.120 --> 00:02:51.240 So let's define young technology companies as 00:02:51.240 --> 00:02:54.600 companies that are zero to three years old, 00:02:54.600 --> 00:03:01.800 and are on certain industry  codes, for example, 62 or 72, 00:03:01.800 --> 00:03:05.640 those correspond to information  technology industries. 00:03:05.640 --> 00:03:12.030 So that is our operational definition and that  allows us to get a list of actual companies. 00:03:12.030 --> 00:03:14.070 Then we get a sample, 00:03:14.070 --> 00:03:15.570 so let's say that 00:03:15.570 --> 00:03:22.020 we get a thousand firms randomly selected  on a list of maybe ten thousand companies 00:03:22.020 --> 00:03:23.550 or five thousand companies, 00:03:23.550 --> 00:03:25.590 whatever the sampling frame is. 00:03:26.520 --> 00:03:28.470 The reason for taking a sample here, 00:03:28.470 --> 00:03:30.630 that we email is cost, 00:03:30.630 --> 00:03:33.090 so whenever we email or mail, 00:03:33.090 --> 00:03:38.340 the address acquisition, it  cost some money or some effort, 00:03:38.340 --> 00:03:42.060 and if we mail physical letters,  then there are printing costs. 00:03:42.060 --> 00:03:43.770 Then we get the actual data, 00:03:43.770 --> 00:03:47.190 for example, ten percent of our informants 00:03:47.190 --> 00:03:50.070 that we're invited to participate  decide to respond to the survey. 00:03:50.070 --> 00:03:53.070 So what can go wrong with this kind of thing? 00:03:53.070 --> 00:03:55.140 And there are multiple things. 00:03:55.140 --> 00:03:57.900 The relevant question with  the sampling frame is that, 00:03:57.900 --> 00:04:03.180 does our operational definition of the  population match the conceptual one? 00:04:03.180 --> 00:04:06.929 So does the frame match the population? 00:04:07.410 --> 00:04:13.830 Then we have, the second question  is, how large is the sample size? 00:04:13.830 --> 00:04:16.320 Basically, this is randomly chosen then 00:04:16.320 --> 00:04:19.050 the only thing that we can decide is, 00:04:19.050 --> 00:04:20.610 how many observations we get, 00:04:20.610 --> 00:04:23.370 so is thousand enough? 00:04:23.370 --> 00:04:25.140 And when we plan for sample size, 00:04:25.140 --> 00:04:31.200 we have to take the expected  response rate into consideration. 00:04:31.200 --> 00:04:34.320 So if we expect a ten percent response rate, 00:04:34.320 --> 00:04:38.700 and we need five hundred full  responses to our analysis. 00:04:38.700 --> 00:04:44.880 Then we should send out the  invitation to five thousand companies, 00:04:44.880 --> 00:04:50.490 so we would have five thousand randomly  chosen firms instead of this one thousand. 00:04:50.490 --> 00:04:55.050 Then the most problematic part is that, 00:04:55.050 --> 00:04:58.350 the people who, or companies  who decide to respond, 00:04:58.350 --> 00:05:00.720 may not be randomly chosen. 00:05:00.720 --> 00:05:01.920 So if we have, 00:05:01.920 --> 00:05:07.170 out of these thousand companies  that were invited to participate, 00:05:07.170 --> 00:05:10.770 if random 10% respond, 00:05:10.770 --> 00:05:13.890 that only means that we have inefficiency. 00:05:13.890 --> 00:05:20.070 So increasing the sample size would make  our results or estimates more precise, 00:05:20.070 --> 00:05:21.870 but that's it. 00:05:21.870 --> 00:05:29.760 A more problematic condition occurs if  this 10 percent are chosen systematically, 00:05:29.760 --> 00:05:31.740 and that leads to bias results. 00:05:31.740 --> 00:05:36.660 For example, if our survey  was about innovativeness 00:05:36.660 --> 00:05:41.970 and those companies that are more  innovative or more likely to participate, 00:05:41.970 --> 00:05:47.700 then any regression analysis involving  innovation as a dependent variable 00:05:47.700 --> 00:05:50.280 would produce biased results. 00:05:50.280 --> 00:05:52.770 Let's take a look at why that happens. 00:05:52.770 --> 00:05:56.820 So this is a classic example  from Berk's paper 1983, 00:05:56.820 --> 00:06:02.220 and he is demonstrating that there's a  relationship between education and income 00:06:02.220 --> 00:06:07.170 such that when education increases  then income goes up as well, 00:06:07.170 --> 00:06:09.210 so it's a linear relationship here. 00:06:09.210 --> 00:06:16.050 So what will happen if people who get  low-income either don't provide data, 00:06:16.050 --> 00:06:21.870 or these people who don't have much  education simply decide not to work. 00:06:21.870 --> 00:06:24.660 So let's set a barrier here, 00:06:24.660 --> 00:06:28.110 so no one above this point  actually provides us data, 00:06:28.110 --> 00:06:29.040 or below this point. 00:06:29.040 --> 00:06:33.780 So if we eliminate that data here, what  will happen to our regression estimates? 00:06:33.780 --> 00:06:35.670 There are two things that will happen. 00:06:35.670 --> 00:06:42.120 First of all, all the regression  results will be biased, 00:06:42.120 --> 00:06:46.320 because now we are fitting the  regression analysis to this data here, 00:06:46.320 --> 00:06:52.080 and we are cutting this data, these observations, 00:06:52.080 --> 00:06:54.300 that produce negative residuals, 00:06:54.300 --> 00:06:58.980 these are negative residuals because they  are mostly below the regression line, 00:06:58.980 --> 00:07:04.230 so we have negative residuals  here and positive residuals here, 00:07:04.230 --> 00:07:05.010 that are included, 00:07:05.010 --> 00:07:07.440 then it pulls the regression line up, 00:07:07.440 --> 00:07:10.140 so the regression results will be biased. 00:07:10.140 --> 00:07:22.484 Then we have no idea of what's the effect  like in this group with low-income people, 00:07:23.310 --> 00:07:25.020 low education people here, 00:07:25.020 --> 00:07:29.250 and also for those people  for whom we have the data, 00:07:29.250 --> 00:07:31.590 then we have biased results. 00:07:31.590 --> 00:07:38.010 So if our sample is selected systematically based on the variable that we study, 00:07:38.010 --> 00:07:41.550 then our results will be biased, 00:07:41.550 --> 00:07:45.000 that the magnitude of the bias  can be great in some instances. 00:07:45.000 --> 00:07:47.910 So this is not just an academic concern, 00:07:48.570 --> 00:07:51.360 I will next demonstrate a couple of examples. 00:07:51.360 --> 00:07:58.470 So there is this a widely known business  book called Good to Great by Jim Collins. 00:07:58.470 --> 00:08:04.710 And it has sold millions of copies and  provided inspiration for lots of managers, 00:08:04.710 --> 00:08:08.910 also, it received great attention in Finland, 00:08:08.910 --> 00:08:11.160 when it was first translated to Finnish. 00:08:11.160 --> 00:08:14.340 So many people think this is a valuable book. 00:08:14.340 --> 00:08:15.870 And how was the book written? 00:08:15.870 --> 00:08:19.140 Well, there is a slight problem, 00:08:19.140 --> 00:08:22.620 it's presented as an academic  study and it kind of is 00:08:22.620 --> 00:08:25.800 but there are methodological  problems with this book. 00:08:25.800 --> 00:08:31.770 So in the book, they basically chose  a large number of good companies, 00:08:31.770 --> 00:08:35.340 and they followed, based on  some accounting measures, 00:08:35.340 --> 00:08:41.730 they followed the performance of those  companies for 40 years with the research team, 00:08:41.730 --> 00:08:47.250 and then they found 11 companies  that were initially good companies, 00:08:47.250 --> 00:08:49.650 and then they became great companies, 00:08:49.650 --> 00:08:54.630 according to the definitions  that these authors here use. 00:08:55.380 --> 00:09:00.060 Jim Collins is the first author but he had a  team of researchers helping him writing the book. 00:09:00.060 --> 00:09:05.550 So they chose 11 companies  that perform extremely well 00:09:05.550 --> 00:09:07.920 and then they studied, 00:09:07.920 --> 00:09:10.710 what made those companies perform that well. 00:09:10.710 --> 00:09:13.080 Then they asked later on, 00:09:13.710 --> 00:09:17.580 why did these companies  perform better than others? 00:09:17.580 --> 00:09:21.420 And then they wrote a book about it. 00:09:21.420 --> 00:09:25.800 So the problem with that is two things. 00:09:25.800 --> 00:09:31.620 First of all, if you choose companies  that happen to be great in the past, 00:09:31.620 --> 00:09:34.920 then you're sampling on the dependent variable. 00:09:34.920 --> 00:09:40.140 And if a company happens to  be good, for a chance reason, 00:09:40.140 --> 00:09:41.250 it will get selected, 00:09:41.250 --> 00:09:46.350 or at least some of these companies could  get selected because of chance reasons. 00:09:46.350 --> 00:09:51.390 And then, when some other researchers  looked at these companies later on, 00:09:51.390 --> 00:09:52.980 the next 15 year period, 00:09:52.980 --> 00:09:55.860 only one out of the eleven remained great. 00:09:55.860 --> 00:10:01.080 So we can just attribute, the chose of these eleven companies to chance explanation. 00:10:01.080 --> 00:10:04.650 Also what happens is that, 00:10:04.650 --> 00:10:06.450 when a company's performing well, 00:10:06.450 --> 00:10:16.680 then people start to attribute that performance to something that the companies did. 00:10:16.680 --> 00:10:19.200 So that's called the halo effect. 00:10:19.200 --> 00:10:23.640 And when you identify  companies that are doing well, 00:10:23.640 --> 00:10:28.710 and then you ask those people to evaluate, 00:10:28.710 --> 00:10:30.450 why are these companies doing well, 00:10:30.450 --> 00:10:35.640 then people answer 'well they're doing well because of something that they did in the past'. 00:10:35.640 --> 00:10:41.370 And it's also possible that these  companies just happen to be lucky, 00:10:41.370 --> 00:10:50.310 and the fact that only one out of eleven stayed great after the 15-year period under study 00:10:50.310 --> 00:10:54.750 just underlines the point that  that's the likely explanation. 00:10:54.750 --> 00:10:59.820 So these happen to be great for reasons unknown 00:10:59.820 --> 00:11:05.040 and then people attribute that greatness  to something that the companies did. 00:11:05.040 --> 00:11:09.240 So this sign does not provide  evidence of causality. 00:11:10.925 --> 00:11:12.991 Let's take another example, 00:11:12.991 --> 00:11:17.160 this is from Morgan and Winship  book on causal inference. 00:11:17.160 --> 00:11:20.850 And they have this hypothetical college, 00:11:20.850 --> 00:11:25.050 where entry to the college  depends on the SAT exam, 00:11:25.050 --> 00:11:29.850 the American high school exit exam basically, 00:11:29.850 --> 00:11:33.420 and a motivation score that is measured somehow. 00:11:33.420 --> 00:11:38.820 Motivation score and SAT score are weakly  and positively dependent on each other, 00:11:38.820 --> 00:11:41.100 and college entry depends on both of them. 00:11:41.100 --> 00:11:47.940 So here is the data, and here's the SAT  score and here is the motivation score. 00:11:47.940 --> 00:11:55.680 And these guys here were not accepted to the  college and these circle guys were admitted. 00:11:55.680 --> 00:12:02.190 So the sum of the SAT score and sum  of their motivation scored determines, 00:12:02.190 --> 00:12:04.290 who gets to go to this hypothetical college. 00:12:04.290 --> 00:12:07.020 So there's a weak positive relationship, 00:12:07.020 --> 00:12:12.450 we can't really see it here,  but it's around 0.1 correlation, 00:12:12.450 --> 00:12:14.400 but it's not visible to the plain eye. 00:12:14.400 --> 00:12:22.860 What happens if we measure the correlation only from those people who got admitted to the college? 00:12:25.963 --> 00:12:28.530 We only observed students who got to the college, 00:12:28.530 --> 00:12:30.504 there is a strong negative correlation. 00:12:30.840 --> 00:12:33.090 So we get a strong negative correlation, 00:12:33.090 --> 00:12:37.230 because we only studied those  people who got admitted. 00:12:37.230 --> 00:12:42.450 So if you were the principal of this college, 00:12:42.450 --> 00:12:50.910 a smart principal would ask that does that  result replicate also on those students 00:12:50.910 --> 00:12:54.390 who didn't get accepted, and  they find that yes it does, 00:12:54.390 --> 00:12:57.510 so you will get the same negative result. 00:12:57.510 --> 00:13:02.820 This negative result has very  little to do with the actual   00:13:02.820 --> 00:13:06.450 relationship between motivation and SAT score, 00:13:06.450 --> 00:13:11.520 instead, it's a function of  how we selected the sample. 00:13:11.520 --> 00:13:14.130 If we choose the sample so that, 00:13:14.130 --> 00:13:22.320 the sum of motivation and sum of SAT score must be more than a threshold or less than a threshold, 00:13:22.320 --> 00:13:25.560 then you will get this kind  of negative correlation 00:13:25.560 --> 00:13:28.320 just because of their selection effect. 00:13:28.320 --> 00:13:33.510 So this is called the selection effect  and the outcome is selection bias. 00:13:33.510 --> 00:13:35.610 So whenever you take a sample, 00:13:35.610 --> 00:13:39.120 unless you are careful that your sample is   00:13:39.120 --> 00:13:42.540 actually a random sample of  the population under study, 00:13:42.540 --> 00:13:48.540 then you risk having a  selection bias in your analysis 00:13:48.540 --> 00:13:50.310 and the bias can be great. 00:13:50.310 --> 00:13:54.210 Let's take another really practical example. 00:13:54.210 --> 00:13:58.650 So I went to the building fair  in Vantaa a couple of years ago 00:13:58.650 --> 00:14:05.940 and there was this construction company  presenting an idea called a container home. 00:14:05.940 --> 00:14:10.470 So it's a small home, the  size of a shipping container 00:14:10.470 --> 00:14:13.050 and these can be built as condominiums. 00:14:13.050 --> 00:14:19.620 So the idea is that you can increase the density of housing by having these very small apartments. 00:14:19.620 --> 00:14:24.870 And then they wanted to get feedback on the idea. 00:14:24.870 --> 00:14:29.340 So how was the feedback collected? 00:14:29.340 --> 00:14:31.980 So they had a polling station, 00:14:31.980 --> 00:14:35.850 where you could indicate whether you  agree or disagree with the idea that 00:14:35.850 --> 00:14:37.680 this container home is a good idea. 00:14:37.680 --> 00:14:41.370 And how it was actually set up is that, 00:14:41.370 --> 00:14:48.000 you walk along the road here and the  container home was on the side of the road, 00:14:48.000 --> 00:14:51.330 so you could choose to just walk by, 00:14:51.330 --> 00:14:53.760 or you could choose to go in. 00:14:53.760 --> 00:15:00.780 Then you went in here, you went  through the apartment to the balcony, 00:15:00.780 --> 00:15:03.330 and that's where the polling station was. 00:15:03.330 --> 00:15:07.945 So what is the problem, why could  that produce a selection effect? 00:15:09.122 --> 00:15:12.450 Of course, people who are not interested at all, 00:15:12.450 --> 00:15:13.980 who think this is a stupid idea, 00:15:13.980 --> 00:15:17.370 they will just walk past the container home 00:15:17.370 --> 00:15:21.330 and they will never see the polling  station, which is behind the container home. 00:15:21.330 --> 00:15:25.740 So you actually have to show enough  interest to go through the container home, 00:15:25.740 --> 00:15:28.470 walk all the way through to the behind, 00:15:28.470 --> 00:15:32.730 and then after you have seen the  home, then you present an opinion. 00:15:32.730 --> 00:15:36.990 The counter-argument for  this selection bias is that, 00:15:36.990 --> 00:15:40.620 you only want to have people who have  actually seen what it looks inside. 00:15:40.620 --> 00:15:46.140 But that's not as important as is the fact that, 00:15:46.140 --> 00:15:48.750 people who think it's a stupid idea in   00:15:48.750 --> 00:15:52.020 the first place will just walk  by without providing any data. 00:15:52.020 --> 00:15:59.400 So this is an introduction to issues  that are related to sampling, 00:15:59.400 --> 00:16:02.880 and there are multiple different  techniques that you can apply. 00:16:02.880 --> 00:16:06.210 These selection effects can be modelled, 00:16:06.210 --> 00:16:13.260 and also you can do sampling in many  different ways to increase your efficiency, 00:16:13.260 --> 00:16:15.720 and to avoid the risk of selection bias. 00:16:15.720 --> 00:16:19.050 There are other sampling techniques, 00:16:19.050 --> 00:16:24.540 if you are a Stata user, the Stata has  a separate user manual for survey data 00:16:24.540 --> 00:16:27.060 that discusses different sampling designs, 00:16:27.060 --> 00:16:29.970 and here are some references  that you may be interested in. 00:16:30.780 --> 00:16:35.730 The typical sample in a statistical  book is a random sample, 00:16:35.730 --> 00:16:38.910 and that is also what I will  be covering on this course, 00:16:38.910 --> 00:16:43.470 because assuming that the sample  is random simplifies things a lot. 00:16:43.470 --> 00:16:46.530 The second kind of sample that is very common, 00:16:46.530 --> 00:16:48.300 is a cluster sample. 00:16:48.300 --> 00:16:51.360 So cluster sample refers to a sample, 00:16:51.360 --> 00:16:55.350 where the observations are no longer  equally likely to be selected. 00:16:55.350 --> 00:16:58.020 So random sample is defined as a sample, 00:16:58.020 --> 00:17:01.830 where each observation in a population  is equally likely to be selected. 00:17:01.830 --> 00:17:05.550 A cluster sample, on the other  hand, refers to a scenario, 00:17:05.550 --> 00:17:10.020 where you, for example, have to  interview people at their homes. 00:17:10.020 --> 00:17:15.720 So if you do that and you  take a sample of let's say, 00:17:15.720 --> 00:17:19.740 all Finnish people, a random  sample from all Finnish households, 00:17:19.740 --> 00:17:22.890 then you will have to travel all  over Finland to get your data. 00:17:22.890 --> 00:17:29.220 So in practice, we choose a couple of cities  and from those cities a couple of streets, 00:17:29.220 --> 00:17:35.610 and we then sample people from those streets or just interview everyone on those streets. 00:17:35.610 --> 00:17:38.100 So we take samples from clusters. 00:17:38.100 --> 00:17:42.570 So if your neighbours are interviewed, 00:17:42.570 --> 00:17:45.720 then it's more likely that  you are interviewed as well. 00:17:45.720 --> 00:17:50.610 So the probability of being selected is clustered. 00:17:50.610 --> 00:17:54.660 So that if you live close to those people  who are more likely to be selected, 00:17:54.660 --> 00:17:56.400 you're a lot more likely to be selected as well. 00:17:56.400 --> 00:18:00.570 And that cluster sample causes some  problems that we'll talk later. 00:18:00.570 --> 00:18:09.180 One way that we can deal with cluster  sampling is called a stratified sample, 00:18:09.180 --> 00:18:11.370 a stratified random sample. 00:18:11.370 --> 00:18:14.730 Stratified random sample concerns situations, 00:18:14.730 --> 00:18:19.230 where you have for example uneven distribution of people or you have the cluster sample issue. 00:18:19.230 --> 00:18:29.850 Let's say we have a school with 300  students out of which 30 are minorities. 00:18:29.850 --> 00:18:35.370 In that kind of scenario, taking  a random sample of 50 students 00:18:35.370 --> 00:18:42.570 is going to likely produce you a very  small number of minority students. 00:18:42.570 --> 00:18:47.850 So it makes sense to take a sample  separately from the minority students, 00:18:47.850 --> 00:18:50.670 and sample separately from the other students, 00:18:50.670 --> 00:18:53.520 so that you can get a sample  that is better for your study. 00:18:53.520 --> 00:18:56.130 So stratification refers to, 00:18:56.130 --> 00:19:04.110 first dividing the sampling frame into  different strata or different sets, 00:19:04.110 --> 00:19:06.870 and then you take a random sample for each set. 00:19:06.870 --> 00:19:11.400 And stratification improves the  distribution of your variables, 00:19:11.400 --> 00:19:17.160 it produces random samples that  can be better in some instances, 00:19:17.160 --> 00:19:21.000 and that's a very commonly used sampling design. 00:19:21.000 --> 00:19:26.340 So these are the three most  commonly used sampling designs. 00:19:26.340 --> 00:19:30.120 Random sample, everybody is  equally likely to be selected. 00:19:30.120 --> 00:19:35.730 Cluster sample means you choose  people from certain areas, 00:19:35.730 --> 00:19:38.160 which you choose in advance, 00:19:38.160 --> 00:19:42.150 so the people in other areas have  a zero chance of being selected, 00:19:42.150 --> 00:19:43.320 so that's cluster sampling. 00:19:43.320 --> 00:19:49.620 And stratified random sampling means that you divide your sampling frame into different strata 00:19:49.620 --> 00:19:52.260 based on criteria, for example, 00:19:52.260 --> 00:19:55.110 race, education level and so on, 00:19:55.110 --> 00:19:58.770 and then you take a random sample  of each of those strata separately, 00:19:58.770 --> 00:20:02.090 and that provides you with  some statistical benefits. 00:20:02.090 --> 00:20:08.570 Then we have the fourth type of commonly used sample called the convenience sample. 00:20:08.570 --> 00:20:12.830 And convenience sample is  none of these, none of that. 00:20:12.830 --> 00:20:17.420 So convenient sample is something  that we just happen to get. 00:20:17.420 --> 00:20:23.810 In most cases, if you do a survey  study and you send out invitations, 00:20:24.560 --> 00:20:29.450 the people or organizations that you  choose to invite may be a random sample, 00:20:29.450 --> 00:20:35.720 but in the end, those that you get data for is not a random sample of those who got the invitation, 00:20:35.720 --> 00:20:37.640 rather it's a convenience sample, 00:20:37.640 --> 00:20:39.890 just the companies that we happen to get. 00:20:39.890 --> 00:20:44.870 Convenience samples are debated to some extent. 00:20:44.870 --> 00:20:48.020 Some people argue that they should be avoided, 00:20:48.020 --> 00:20:51.980 some people argue that  convenience samples are useful, 00:20:51.980 --> 00:20:57.140 because they allow us to do designs that wouldn't be possible with random samples for example. 00:20:57.140 --> 00:21:00.440 But you have to understand  these different concepts 00:21:00.440 --> 00:21:04.910 to understand issues that relate to  sampling, that I'll cover in later videos