WEBVTT 00:00:00.060 --> 00:00:06.720 Another feature that people typically check  from their data is the presence of outliers. 00:00:06.720 --> 00:00:10.950 Outliers are influential  observations or observations   00:00:10.950 --> 00:00:12.660 that are different from other observations. 00:00:12.660 --> 00:00:17.520 While outliers are not an  assumption or lack of outliers   00:00:17.520 --> 00:00:19.590 is not an assumption in regression analysis. 00:00:19.590 --> 00:00:22.860 There are reasons sometimes to delete them. 00:00:22.860 --> 00:00:25.170 So you have to understand why you have an outlier. 00:00:25.170 --> 00:00:26.970 Let's take a look at outliers. 00:00:26.970 --> 00:00:29.850 Here we have that prestige data set. 00:00:29.850 --> 00:00:34.590 We have a regression line  of the effect of education   00:00:34.590 --> 00:00:39.090 on prestige and it's a clean nice regression line. 00:00:39.090 --> 00:00:43.680 The observations are homoscedastic and they  are spread evenly on the regression line. 00:00:44.730 --> 00:00:45.510 And there are no problems. 00:00:45.510 --> 00:00:51.630 What happens if we have one observation  that is very far apart from others. 00:00:51.630 --> 00:00:55.020 We have an outlier here so  what will that outlier do. 00:00:55.020 --> 00:01:01.980 The outlier will pull the regression  line toward itself and now with the   00:01:01.980 --> 00:01:07.770 outlier including the data regression  line goes the slope is a bit less. 00:01:07.770 --> 00:01:12.630 And also it no longer goes through the  middle of the remaining observations   00:01:12.630 --> 00:01:16.620 rather it goes kind of like  too low here and too high here. 00:01:16.620 --> 00:01:21.840 So the outlier clearly, we  don't want to have it here. 00:01:21.840 --> 00:01:27.270 But before we decide what to do with the outlier  we have to consider the different mechanisms. 00:01:27.270 --> 00:01:29.460 So what is this observation really about. 00:01:29.460 --> 00:01:33.540 And it could be that it's a data entry mistake. 00:01:33.540 --> 00:01:41.760 So the occupations prestigiousness really should  be 70 but somebody wrote 17 to our dataset. 00:01:41.760 --> 00:01:45.120 Or it is possible that this is an outlier if these   00:01:45.120 --> 00:01:50.640 were companies it could be a company  that is outside of our population. 00:01:50.640 --> 00:01:56.400 If we do a survey of small technology  companies then we can accidentally   00:01:56.400 --> 00:01:59.640 send the survey to a large technology company. 00:01:59.640 --> 00:02:04.290 And the large technology company would be outside   00:02:04.290 --> 00:02:08.550 of our population so it's not part  of our sample or our population. 00:02:08.550 --> 00:02:12.360 Or it could be a case that is very unique. 00:02:12.360 --> 00:02:17.430 If we're studying the growth of  small technology-based companies,   00:02:17.430 --> 00:02:24.210 then for example Supercell Finnish  game developer that makes billions   00:02:24.210 --> 00:02:28.950 of euros of revenue on games  on App Store is an outlier. 00:02:28.950 --> 00:02:33.720 Because while they technically are a  small and young technology-based company,  00:02:33.720 --> 00:02:37.350 they are so different from other  companies in their performance, 00:02:37.350 --> 00:02:43.710 that using that company when our regression model  typically wants to explain the bulk of the data, 00:02:43.710 --> 00:02:45.240 so where most of the observations are,  00:02:45.240 --> 00:02:51.420 then including that particular outlier is  something that we probably don't want to do. 00:02:51.420 --> 00:02:55.980 So outliers are either, they could be  observations that are truly unique,  00:02:55.980 --> 00:02:59.460 they could be worth studying  separately as case studies,  00:02:59.460 --> 00:03:05.070 they could be data entry mistakes  and or they could be observations   00:03:05.070 --> 00:03:08.700 that don't belong to our population, and we're including the sample accidentally. 00:03:08.700 --> 00:03:12.030 The effects of outlier depend  on two different things. 00:03:12.030 --> 00:03:14.850 So we have first the residuals. 00:03:14.850 --> 00:03:16.680 How far the outlier is from the regression line? 00:03:16.680 --> 00:03:22.320 The outlier pulls the regression  line toward itself and the strength   00:03:22.320 --> 00:03:26.340 of or the force is are related to the residual. 00:03:26.340 --> 00:03:28.170 So we want to minimize the  sum of squared residuals. 00:03:28.710 --> 00:03:35.610 If one observation is very large residuals  then it pulls very strongly the regression   00:03:35.610 --> 00:03:38.790 line because it's the square  of the residual that matters. 00:03:38.790 --> 00:03:44.880 Another concept is the leverage so if  we are pulling the regression line here,   00:03:44.880 --> 00:03:46.410 where there are few observations, 00:03:46.410 --> 00:03:51.630 then we have a lot more leverage  and the regression line moves more,  00:03:51.630 --> 00:03:55.560 than if we pull it from the middle here  where there are lots of observations. 00:03:55.560 --> 00:04:01.800 So pulling the regression line from here has zero  leverage and the outlier wouldn't really matter. 00:04:02.550 --> 00:04:07.110 We check at leverage and residual  when we do outlier diagnostics. 00:04:09.450 --> 00:04:13.980 When we identify outliers there are  three important steps in the process. 00:04:13.980 --> 00:04:17.940 And Deephouse's article is a really great  example of how you deal with outliers. 00:04:17.940 --> 00:04:24.990 First you report how did you identify the  outliers and Deephouse used residuals. 00:04:24.990 --> 00:04:29.520 They identified companies or  banks with large residuals,  00:04:29.520 --> 00:04:33.090 then they analyzed the outliers. 00:04:33.090 --> 00:04:37.110 So what is the outlier like,  is it the data entry mistake,  00:04:37.110 --> 00:04:42.420 is it a company that shouldn't be in the sample, or is it a unique case that is not   00:04:42.420 --> 00:04:47.790 representative of the other banks, even if it belongs technically to population. 00:04:47.790 --> 00:04:52.230 They identified that there were  two banks that were merging. 00:04:52.230 --> 00:04:56.640 And if you have banks that  are merging then that is   00:04:56.640 --> 00:04:59.100 probably quite different observation than others. 00:04:59.100 --> 00:05:02.760 And they decided to drop that  observation from the sample. 00:05:02.760 --> 00:05:03.960 So that's the third step.  00:05:03.960 --> 00:05:08.280 Explain what you did and what  was the outcome of doing so. 00:05:08.280 --> 00:05:11.880 They explained that what was the  effect of dropping the outlier, 00:05:11.880 --> 00:05:14.400 and they conclude that it didn't really make   00:05:14.400 --> 00:05:18.330 a difference of whether they include  that observation as a sample or not. 00:05:18.330 --> 00:05:19.740 And that's a very good example. 00:05:21.000 --> 00:05:24.870 If you want read more about  outliers and good practices,  00:05:24.870 --> 00:05:27.810 I recommend this paper by  Aguinis and his students. 00:05:27.810 --> 00:05:32.610 They write how you identify  outliers in regression analysis,  00:05:32.610 --> 00:05:36.630 structure regression models and  multi-level models and what you can deal,  00:05:36.630 --> 00:05:38.490 how you can deal with the outliers. 00:05:38.490 --> 00:05:44.160 Sometimes outliers are problematic, sometimes  there are data entry mistakes which can be fixed. 00:05:44.160 --> 00:05:49.980 Sometimes outliers are truly interesting  cases that you should study separately. 00:05:49.980 --> 00:05:54.480 Yeah so that's what the Deephouse paper did.