WEBVTT 00:00:00.320 --> 00:00:04.240 I will now show an example of factor analysis. Which will hopefully help   00:00:04.240 --> 00:00:10.160 in understanding the basic concepts. Briefly, factor analysis is a technique that   00:00:10.160 --> 00:00:14.880 allows you to discover dimensions from the data. It answers the question what do these   00:00:14.880 --> 00:00:17.200 variables have in common. And there are two variants.  00:00:17.200 --> 00:00:21.200 There's exploratory factor analysis and  then there's confirmatory factor analysis.  00:00:21.200 --> 00:00:26.160 Explorative factor analysis is easier to apply  and that is something that I usually teach for   00:00:26.160 --> 00:00:28.480 beginners at first. Confirmatory factor   00:00:28.480 --> 00:00:33.920 analysis is a more advanced technique. This video is about exploratory factor analysis.  00:00:33.920 --> 00:00:37.200 To do factor analysis, we need some data. And our data are here.  00:00:37.840 --> 00:00:42.640 So these data are from r and I'll  be using r for their analysis.  00:00:42.640 --> 00:00:49.040 And the data are the scores for  decathlon in the men's 1988 olympics.  00:00:49.040 --> 00:00:54.320 So decathlon is a sport where you  have to do 10 individual sports.  00:00:54.320 --> 00:01:02.320 We do 100 meters run, long jump, shot put,  high jump, 400 meters run, 110 meter hurdles,   00:01:02.320 --> 00:01:06.240 discus throw, pole wall, javelin  throw and 1500 meter run.  00:01:07.040 --> 00:01:10.800 We're going to factor analysis data. But before we do factor analysis,   00:01:10.800 --> 00:01:16.240 let's take a look at what the data look like. So this is the first 15 observations.  00:01:16.240 --> 00:01:19.520 Observations are in the rows and  the variables are on the columns.  00:01:19.520 --> 00:01:22.480 To understand what these numbers  mean, we need to understand the units.  00:01:23.040 --> 00:01:27.120 We have the running sports, which are seconds. Less is better   00:01:27.120 --> 00:01:32.320 more is lower and therefore not as good. And then we have the throwing sports.  00:01:32.960 --> 00:01:35.600 More is better. And jumping sport.  00:01:35.600 --> 00:01:37.360 More is better. So these are meters,   00:01:37.360 --> 00:01:44.080 more meters is better and less seconds is better. Okay, let's factor analysis variables to   00:01:44.080 --> 00:01:48.240 see what they have in common. We will take two factors first.  00:01:48.240 --> 00:01:53.360 So factor results and those results are here. And I have ordered the sports   00:01:53.360 --> 00:01:58.480 according to the factor loadings. And we can see that all the running sports,   00:01:58.480 --> 00:02:03.680 particularly those that are about short  distance running less than 400 meters.  00:02:03.680 --> 00:02:08.640 Or that involve short runnings,  short sprints like long jumping or   00:02:09.840 --> 00:02:16.880 pole vault, they belong to the first factor. This first factor, we could label it as running   00:02:16.880 --> 00:02:22.480 speed or running because that is what  these items, that load on it, are about.  00:02:23.200 --> 00:02:27.040 Then the second factor. We have a short put we have javelin throw,   00:02:27.040 --> 00:02:33.600 discus throw and pole wall, that load on that. So we could say that that is upper body strength.  00:02:33.600 --> 00:02:40.960 All the throwing sports are there. Why is this pole vault in first loading   00:02:40.960 --> 00:02:46.400 on both on running and upper body strength? Well if you consider what the sport is about.  00:02:46.400 --> 00:02:49.840 You first run very fast you  sprint you gather speed.  00:02:50.400 --> 00:02:53.920 Then you use the pole to put it  in the hole and then you hold it.  00:02:53.920 --> 00:02:58.400 And then you must use your upper body  strength to get yourself over the wall.  00:02:58.400 --> 00:03:01.920 So it requires both running  and upper body strength.  00:03:01.920 --> 00:03:08.160 Therefore it loads on both these two factors. We can extract also more factors.  00:03:08.160 --> 00:03:12.720 So if we take three factors, then we get  even more dimensionality from the data.  00:03:13.360 --> 00:03:18.640 So we can see now that the running  sports, there's actually running speed.  00:03:18.640 --> 00:03:25.280 So previously we had the 1500 meter  run loading on the first factor.  00:03:25.280 --> 00:03:28.080 But now it does not load on  the first factor anymore.  00:03:28.080 --> 00:03:32.160 So it's only about speed now. The second factor contains the   00:03:32.160 --> 00:03:36.720 upper body strength sport still. And then the third factor contains   00:03:37.840 --> 00:03:42.640 running stamina. You basically in the stamina for 1500 meter run,   00:03:42.640 --> 00:03:47.840 which is the main item that loads on this. And the others don't load as highly.  00:03:47.840 --> 00:03:52.880 400 meters run loads to some extent  because you need more stamina for that   00:03:52.880 --> 00:03:58.240 than for example the 100 meter run. Which does not really load on running stamina.  00:03:58.240 --> 00:04:04.160 So when we extract more and more factors, then  factor analysis quite often takes existing   00:04:04.160 --> 00:04:10.880 factors and it splits them into sub factors. Clearly if we think about these different sports,   00:04:10.880 --> 00:04:16.480 running fast and being able able to  maintain your speed for a longer time   00:04:16.480 --> 00:04:19.760 are two different capabilities  that the person could have.  00:04:19.760 --> 00:04:22.320 It would make sense to take  three factors from this data.  00:04:23.360 --> 00:04:26.320 We can also take more factors. So if we take four factors,   00:04:26.320 --> 00:04:32.480 what is difference here is that the fourth factor  here simply contains high jump and nothing else.  00:04:33.440 --> 00:04:39.040 High jump is quite unique sport because it's  not about running speed, it's not about stamina,   00:04:39.040 --> 00:04:44.000 it's simply about how high you can jump. And it's not related to upper body strength.  00:04:45.360 --> 00:04:52.400 When we start taking more and more factors  from these data eventually we will have each   00:04:52.400 --> 00:04:57.200 sport belonging to its individual factor. If we have 10 variable set like we have here   00:04:58.400 --> 00:05:03.520 we can extract 10 different factors because  there's always some uniqueness to each sport.  00:05:04.080 --> 00:05:09.600 But quite commonly we start by extracting  two or three factors depending on our theory.  00:05:09.600 --> 00:05:15.760 Then we stop when we consider that adding  more factors would not add any value.  00:05:15.760 --> 00:05:21.520 So we will just start getting these individual  sports and saying that all sports are different.  00:05:21.520 --> 00:05:25.360 This is pretty obvious, it does not answer the  question of what the sports have in common.  00:05:26.560 --> 00:05:30.320 We can take a look at the correlation  matrix to see what the factor analysis does.  00:05:31.360 --> 00:05:37.280 These are the correlations between the sports. And factor analysis basically finds those   00:05:37.280 --> 00:05:41.600 combinations of sports that are  highly correlated with one another.  00:05:42.160 --> 00:05:47.280 We have here the running sports, they are highly  correlated and they're less correlated with other   00:05:47.280 --> 00:05:51.200 sports than they are with one another. We could see that these   00:05:51.200 --> 00:05:56.560 all sports measure your running speed. Then we have the upper body strength   00:05:56.560 --> 00:06:04.320 factor here, all the throwing sports belong here. So they are correlated highly with one another   00:06:04.880 --> 00:06:10.320 and less with the running sport items. Then we have the running stamina factor.  00:06:10.320 --> 00:06:15.760 So we have 1500 meter run here we have the  400 meter run here which are correlated   00:06:15.760 --> 00:06:19.360 because they require stamina. For some reason long jump is here.  00:06:19.360 --> 00:06:24.800 Maybe the athletes that are good in long  jumping are also good at these stamina sports.  00:06:25.440 --> 00:06:29.280 And we can also see that the  fourth factor here high jump.  00:06:29.280 --> 00:06:32.320 It is very unique. It's not highly correlated with any of the   00:06:32.320 --> 00:06:38.480 other sports so that is something that depends  on different sets of skills than the others.  00:06:39.200 --> 00:06:45.920 Of course we could simply be interpreting  this factor, this correlation matrix directly   00:06:45.920 --> 00:06:50.640 but it's a lot easier to do it with  the factor analysis in your computer.  00:06:50.640 --> 00:06:54.640 Because it simplifies, you have  less numbers to look at particularly   00:06:54.640 --> 00:07:00.400 if the number of variables grows larger than 10.