WEBVTT WEBVTT Kind: captions Language: en 00:00:00.120 --> 00:00:03.480 There are two things that we need  to consider before we can even   00:00:03.480 --> 00:00:09.660 start estimating confirmatory factor analysis  model called scale setting and identification. 00:00:09.660 --> 00:00:14.790 The scale setting means that every variable  must have a metric. So we have to be able to   00:00:14.790 --> 00:00:21.150 estimate the variance and sometimes the mean  of every variable. An identification means   00:00:21.150 --> 00:00:26.460 that the data provides enough information to  estimate the model that we want to estimate. 00:00:26.460 --> 00:00:34.080 So the confirmatory factor analysis framework is  very flexible and it's possible to define models   00:00:34.080 --> 00:00:40.080 that are mathematically impossible to estimate  uniquely. So in this video we will go through   00:00:40.080 --> 00:00:45.870 what requirements you have to consider before  you can even estimate the model meaningfully. 00:00:45.870 --> 00:00:50.520 Let's take a look at this model  with just two indicators. We have   00:00:50.520 --> 00:00:57.420 indicator a1 and a2 and then we want  to estimate factor A. And we have two   00:00:57.420 --> 00:01:01.470 variances - these two error variances here  - and then we have two factor loadings. So   00:01:01.470 --> 00:01:06.990 we have four things that we want to  estimate and so four free parameters. 00:01:06.990 --> 00:01:13.680 Then we start estimating it. We calculate the  model implied correlations. So we have two   00:01:13.680 --> 00:01:20.970 variances. Variance of a3 a2 and variance of  a1 and then one correlation. So we have three   00:01:20.970 --> 00:01:28.290 unique elements of information from the data  that we model using these four parameters. 00:01:28.290 --> 00:01:34.260 The problem is that now we have three  units of information and we have four   00:01:34.260 --> 00:01:37.590 things that we want to estimate.  So the degrees of freedom is minus   00:01:37.590 --> 00:01:42.150 one and that can be estimated or  it can be estimated meaningfully. 00:01:42.150 --> 00:01:48.300 The reason is that or intuitive  understanding insists that are   00:01:48.300 --> 00:01:53.400 you cannot estimate four things from  a three things. So that's the idea. 00:01:53.400 --> 00:02:00.840 You have to have more information than  what you want to estimate. So this is   00:02:00.840 --> 00:02:05.430 not identified and there are ways that  we can simplify the model to actually be   00:02:05.430 --> 00:02:09.330 able to estimate something or we can add  more indicators to make it identified. 00:02:09.330 --> 00:02:16.020 So this is not identified because the degrees  of freedom is negative. And factor analysis   00:02:16.020 --> 00:02:21.570 without additional constraints always  requires at least three indicators. 00:02:21.570 --> 00:02:28.050 Factor analysis of two indicators only it's not a  very meaningful analysis anyway because while you   00:02:28.050 --> 00:02:34.920 can make it identified by saying that these factor  loadings for example are the same - that would   00:02:34.920 --> 00:02:40.980 identify the model - then the estimation wouldn't  give you any meaningful information anyway. 00:02:40.980 --> 00:02:48.600 So let's take another example or work more  with this example. So let's assume that our   00:02:48.600 --> 00:02:56.490 correlation matrix for this two factor model  each with one indicator is - so we have a1 and   00:02:56.490 --> 00:03:02.970 b1 they're corralated at 0.1 one and we have  three parameters that we want to estimate. 00:03:02.970 --> 00:03:10.800 So you can't. We have one correlation that depends  on three parameters and these other variances   00:03:10.800 --> 00:03:16.860 don't depend on the model or they do depend but  we don't really care about those in this video. 00:03:16.860 --> 00:03:24.030 So why is the correlation with a1 and b1  so low? There are basically three different   00:03:24.030 --> 00:03:30.810 options. It's possible that a1 and b1 are both  highly reliable indicators of these factors A   00:03:30.810 --> 00:03:38.910 and B. It's also possible that A and B are  just weakly correlated. It's also possible   00:03:38.910 --> 00:03:45.150 that A and B are highly correlated but a1 is  unreliable and therefore we observe only a   00:03:45.150 --> 00:03:51.000 small correlation or it's possible that A and  B are highly correlated but b1 is unreliable. 00:03:51.000 --> 00:03:57.240 The problem is that we cannot know  which of these three options is correct   00:03:57.240 --> 00:04:02.040 because they all have the same empirical  implication which is that this correlation   00:04:02.040 --> 00:04:08.610 here is quite small. So that's another  example of non identification problem. 00:04:08.610 --> 00:04:14.700 Here we are estimating five things so we have two  error variances. We have two factor loadings and   00:04:14.700 --> 00:04:20.040 one factor correlation. We are trying to estimate  it from just three elements of information. 00:04:20.040 --> 00:04:23.340 We can't do that. The model is not identified. We   00:04:23.340 --> 00:04:27.600 cannot know which one of these three  explanations is correct empirically. 00:04:27.600 --> 00:04:34.950 Of course we can then use theory and rule out one  of these base alternate explanations - based on   00:04:34.950 --> 00:04:39.270 theory - but that goes beyond our factor  analysis estimates and identification. 00:04:39.270 --> 00:04:43.650 So this model is not identified. It  cannot be estimated meaningfully. 00:04:43.650 --> 00:04:48.930 Let's take a look at scale setting  now. So the identification basically   00:04:48.930 --> 00:04:54.780 means that you have more information  than what you estimate. So the number   00:04:54.780 --> 00:05:00.750 of unique elements in the correlation matrix  of the indicators must exceed or be the same   00:05:00.750 --> 00:05:03.570 as the number of three parameters  that you estimate from the model. 00:05:03.570 --> 00:05:11.160 Okay. So normally we have - in exploratory factor  analysis we have standardized factors - so the   00:05:11.160 --> 00:05:16.950 idea is that all the factors have variances of  one means of zero in the exploratory analysis   00:05:16.950 --> 00:05:23.250 and that defines the scale of these variables. So  every variable must have a variance in exploratory   00:05:23.250 --> 00:05:30.030 analysis. The factors are scaled to have unit  variance so they're standardized and then all   00:05:30.030 --> 00:05:34.230 the factor loadings are then standardized  regression coefficients for that reason. 00:05:34.230 --> 00:05:42.000 Then what if we don't standardize the factor  so we are saying that instead of saying that   00:05:42.000 --> 00:05:49.230 the factors variance is one we are estimating  the factors variances. So we add these factor   00:05:49.230 --> 00:05:53.880 variances here and factor variance here so  we have fifteen free parameters. We still   00:05:53.880 --> 00:05:59.400 have 21 units of information from which we  estimate but we estimate 15 different things   00:05:59.400 --> 00:06:06.600 so the degrees of freedom is 6 which means  that this model is overidentified. So it's   00:06:06.600 --> 00:06:11.490 positive. So in principle it is possible  to estimate this model meaningfully. 00:06:11.490 --> 00:06:18.030 We can do the estimation. So let's assume  that that's our observed correlation matrix.   00:06:18.030 --> 00:06:24.240 That's our implied correlation matrix.  Then we can find the values for the Y   00:06:24.240 --> 00:06:32.610 and the lambdas. So that this employed matrix  reproduces this correlation matrix perfectly. 00:06:32.610 --> 00:06:38.370 In this case that's possible because  these correlations all have the same   00:06:38.370 --> 00:06:41.580 values. Generally in small samples you will never   00:06:41.580 --> 00:06:45.960 completely reproduce the data but in this  example you do just to simplify things. 00:06:45.960 --> 00:06:54.570 So we can estimate and that's one set of  estimates that will give you the exact fit   00:06:54.570 --> 00:06:58.710 between the observed variable observed correlation  matrix and the implied correlation matrix. 00:06:58.710 --> 00:07:09.270 So we're fine right. Turns out we have a small  problem because there's another set of estimates   00:07:09.270 --> 00:07:14.730 that also reproduce the correlation matrix  perfectly using the employment correlation   00:07:14.730 --> 00:07:22.080 matrix. So you can plug in these values to  the equations and see that they produce the   00:07:22.080 --> 00:07:28.890 exact same implied correlations. So we have  here factor A's variance is 1 versus factor   00:07:28.890 --> 00:07:36.450 B's variance 2 and therefore they are  produced the same fit. So what do we   00:07:36.450 --> 00:07:43.590 do? We can go and come up with indefinitely  many examples. So if factor A's variance is   00:07:43.590 --> 00:07:51.390 0.5 then we will all have a different values  with factor loadings but still the empirical   00:07:51.390 --> 00:07:56.130 correlation matrix is reproduced perfectly  using the model implied correlation matrix. 00:07:56.130 --> 00:08:03.420 So this the problem of scale setting of latent  variables in confirmatory factor analysis models. 00:08:03.420 --> 00:08:12.810 So we need to set the metric. So the factors  themselves because we don't observe the factors   00:08:12.810 --> 00:08:18.780 they are just arbitrary entries we don't know  whether they vary from 0 to 1 or 0 to 1 million   00:08:18.780 --> 00:08:25.350 or minus 5 to plus 10 or whatever. We don't know  their range. We don't know their variances. We   00:08:25.350 --> 00:08:32.310 don't know their means. We have to specify the  scale of each latent each factor ourselves. 00:08:32.310 --> 00:08:37.590 In exploratory analysis we typically  don't model means and then we assume   00:08:37.590 --> 00:08:41.670 that the variances or we fix the  variances of the factors to be ones. 00:08:41.670 --> 00:08:52.650 In confirmatory analysis there are reasons why  we don't fix the variances to pons. That I'll   00:08:52.650 --> 00:09:03.390 explain a bit later. But the problem generally  is that we must define whether we are talking   00:09:03.390 --> 00:09:10.980 about centimetres or inches - do we talk about  Celsius or Fahrenheit. They quantify the same   00:09:10.980 --> 00:09:19.140 exact thing and they are equally good measures  from a statistical perspective to measure of   00:09:19.140 --> 00:09:23.910 length or temperature. We have to agree  on what is the scale that you're using. 00:09:23.910 --> 00:09:32.580 So also a regression gives us the one  unit change - the effect of one unit   00:09:32.580 --> 00:09:37.050 changing in the independent variable  on the dependent variable - considering   00:09:37.050 --> 00:09:42.540 regression coefficients only makes  sense after we have considered how   00:09:42.540 --> 00:09:48.120 we define the unit. So what is the  unit of A and what is the unit of B. 00:09:48.120 --> 00:09:55.440 We have to set them manually. So we have to  decide a scale setting approach. In exploratory   00:09:55.440 --> 00:10:01.800 analysis as I said we typically say that factor  A and factor B on all factors have variances of   00:10:01.800 --> 00:10:08.550 one. That produces standardized factor loadings  which are standard that regression coefficients   00:10:08.550 --> 00:10:16.110 of the indicators on the factors or in the case  of uncorrelated factors they equal correlations. 00:10:16.110 --> 00:10:22.830 We use that in exploratory factor analysis.  We cannot use that in structure regression   00:10:22.830 --> 00:10:26.880 model. Structure regression model  is an extension of a factor analysis   00:10:26.880 --> 00:10:30.810 model where we allow regressing  relationships between the factors. 00:10:30.810 --> 00:10:36.090 The reason why we can't use this  approach is that the variation of   00:10:36.090 --> 00:10:41.370 an endogenous variable - so a variable that  depends on other variables - is the sum of   00:10:41.370 --> 00:10:47.070 those other variables. So we can't say the  variable's variance is one if that variance   00:10:47.070 --> 00:10:51.300 depends on other things in the model.  But that's that's beyond this video. 00:10:51.300 --> 00:11:00.570 Another very common approach is that we set  the first indicator to be fixed the first   00:11:00.570 --> 00:11:05.610 indicators loading to be one. And this  is the default scale setting approach in   00:11:05.610 --> 00:11:10.290 most structural regression modelling or  confirmatory factor analysis software. 00:11:10.290 --> 00:11:16.830 The reason is that this can be used pretty much  always regardless of what kind of variables   00:11:16.830 --> 00:11:21.420 we have here as A and B and what kind of  relationship will be specified between A   00:11:21.420 --> 00:11:29.370 and B. And the idea is that we scale that  - if we assume that classical test theory   00:11:29.370 --> 00:11:36.420 holds - so all these errors here are just  random noise - then the variance of A is   00:11:36.420 --> 00:11:43.950 whatever is the variance of the true score of  a1. So that's also appealing if we consider   00:11:43.950 --> 00:11:51.660 that the only source of error is random noise -  then the variance of factor A is the variation   00:11:51.660 --> 00:11:57.420 of a1 or what the various in a1 would be if it  wasn't contaminated with this random noise here. 00:11:57.420 --> 00:12:05.520 So that's also a one way - one reason why  this is appealing. It allows us to consider   00:12:05.520 --> 00:12:14.730 the scale of these indicators without error varies  assuming classical test theory holds for the data. 00:12:14.730 --> 00:12:21.150 And this is such a common approach  that there's a rule of thumb that I   00:12:21.150 --> 00:12:24.900 present. Always use the first  indicators to fix the scale. 00:12:24.900 --> 00:12:32.670 We can see that the papers - that we have used  as examples in these videos - are using this   00:12:32.670 --> 00:12:39.840 approach. Mesquita and Lazzarini - you can see all  loadings of first indicators are ones. So they set   00:12:39.840 --> 00:12:46.830 the scale of the latent variable by fixing this  loading to one and then they have the Z-statistic   00:12:46.830 --> 00:12:54.060 here and you can see that the indicators - the  first indicators - don't have a Z- statistic.   00:12:54.060 --> 00:12:59.910 The reason is that they are not estimated from  the data - instead a researcher says that these   00:12:59.910 --> 00:13:05.730 are ones they are not estimated if something  is not estimated it doesn't vary from sample   00:13:05.730 --> 00:13:12.840 to sample. So it doesn't have a standard error.  So we can't calculate or the Z-statistic for it. 00:13:12.840 --> 00:13:20.760 We can see the same in Yli-Renko's paper. So  Yli-Renko's paper - the first loading it's not   00:13:20.760 --> 00:13:27.000 one but it doesn't have a standard error and  he doesn't have a Z-statistic he doesn't have   00:13:27.000 --> 00:13:33.090 a standard error so that's indication that they  actually are fix the first loading to be one to   00:13:33.090 --> 00:13:39.750 identify or the scale the latent variables. If  you want to have standardized factor loadings   00:13:39.750 --> 00:13:47.160 so if you want to have loadings that are  expressed in the scale of the Exploratory   00:13:47.160 --> 00:13:53.880 analysis where the factor variances are ones  then you can rescale the confirmatory factor   00:13:53.880 --> 00:14:00.420 analysis results afterwards. Your software  will produce that for you if you check the   00:14:00.420 --> 00:14:05.520 standardized estimates option there. So  these are standardized estimates but the   00:14:05.520 --> 00:14:10.320 scaling has been done after estimation.  So you first estimate and unstandardized   00:14:10.320 --> 00:14:18.690 confirmatory factor analysis where each factor  is scaled by fixing the first indicator - then   00:14:18.690 --> 00:14:25.320 you scale the resulting solution. That's the same  approach that you use for standardized regression   00:14:25.320 --> 00:14:30.300 coefficients. You first estimate regression  then you scale the parameter estimates later. 00:14:30.300 --> 00:14:36.270 So the summary of identification of  confirmatory factor analysis models.   00:14:36.270 --> 00:14:44.730 A model is identified if every latent variable  has a scale and if the degrees of freedom is   00:14:44.730 --> 00:14:50.940 positive for and it's also every part  of the model has to be identified. 00:14:50.940 --> 00:14:57.870 In confirmatory factor analysis - after we have  established every latent variable every factor has   00:14:57.870 --> 00:15:05.130 a scale - then all factors with three indicators  are always identified. So three indicators if you   00:15:05.130 --> 00:15:10.710 have three variables you can always run a factor  analysis no matter what. Then if you have two   00:15:10.710 --> 00:15:18.270 factors -then we can either say that fix that  both are equally reliable. So we fix the factor   00:15:18.270 --> 00:15:27.690 loadings to be ones or we can embed this factor  in a larger system. So just two variables alone   00:15:27.690 --> 00:15:35.040 we can't estimate a factor model unless we fix  these factor loadings to be the same. If we embed   00:15:35.040 --> 00:15:42.540 this two factor - the two indicator factor  - into a larger factor analysis then we can   00:15:42.540 --> 00:15:47.940 estimate because we can use information from other  indicators to estimate these factor loadings. 00:15:47.940 --> 00:15:52.140 And one single indicator rule - if  we have a factor with just a single   00:15:52.140 --> 00:15:56.280 indicator then we cannot estimate the  reliability of the indicator because   00:15:56.280 --> 00:16:00.150 you cannot estimate reliability based  on just one measure. That's the idea. 00:16:00.150 --> 00:16:07.770 We have to assume what is their variance and  typically we do that by constraining the error   00:16:07.770 --> 00:16:14.370 variance to be zero. So we say that this factor A  or construct A is measured without any error if we   00:16:14.370 --> 00:16:19.470 can't estimate it. Of course we could constraint  the error variance to be something else. If we   00:16:19.470 --> 00:16:26.610 know that the indicator has typically shown to  be eighty percent reliable - then we can fix this   00:16:26.610 --> 00:16:33.060 variance here to be a 80 percent of the observed  variance of the indicator but that's rarely done. 00:16:33.060 --> 00:16:37.290 So identification is a requirement for estimation.   00:16:37.290 --> 00:16:41.040 If our model is not identified it  cannot be meaningfully estimated. 00:16:41.040 --> 00:16:45.870 Identification basically means that do  you have enough information to estimate   00:16:45.870 --> 00:16:51.360 the model. If we have one correlation  we can't estimate two different things   00:16:51.360 --> 00:16:56.460 from one correlation. You need at  least one unit of information for   00:16:56.460 --> 00:17:01.170 everything that you estimate ideally you  have more information so the redundancy. 00:17:01.170 --> 00:17:07.620 So we need to have a scale for inlatent  variables and the degrees of freedom must be   00:17:07.620 --> 00:17:14.400 non-negative. Ideally it is positive and the more  positive it is the better our model tests are.