WEBVTT WEBVTT Kind: captions Language: en 00:00:00.030 --> 00:00:03.930 Let's take a look at an empirical  example of confirmatory factor analysis. 00:00:03.930 --> 00:00:10.800 Our data set for the example comes from Mequita  and Lazzarini. This is a nice paper because they   00:00:10.800 --> 00:00:17.010 present a correlation matrix of all the data on  the indicator level. So we can use their table one   00:00:17.010 --> 00:00:23.580 - shown here - to calculate all the confirmatory  factor analysis and structural regression models   00:00:23.580 --> 00:00:29.700 that the article presents and we will also get  for - the most parts - the exact same results. 00:00:29.700 --> 00:00:34.080 So let's check how the confirmatory factor   00:00:34.080 --> 00:00:38.370 analysis is estimated in and  what the results look like. 00:00:38.370 --> 00:00:46.170 Specifying the factor analysis model requires  a bit of work. I'll explain you the details of   00:00:46.170 --> 00:00:52.650 this syntax a bit later but generally what we do  first is that we specify the model. So we have to   00:00:52.650 --> 00:00:59.910 specify the indicators and for every indicator we  specify one factor - in this particular case - and   00:00:59.910 --> 00:01:08.310 then we estimate using covariance matrix and  finally we'll plot the results as a path diagram.   00:01:08.310 --> 00:01:13.680 So that's the plotting command and I have added  some options to make the plot to look a bit nicer. 00:01:13.680 --> 00:01:21.270 So let's take a look at the model specification  in more detail. We have here - I have color   00:01:21.270 --> 00:01:28.350 coded this blue is for factors and green is  for indicators. So we specify that we have   00:01:28.350 --> 00:01:33.870 about eight factors and then we specify  how each indicator loads on the factor. 00:01:33.870 --> 00:01:41.820 So we have factor horizontal measure with three  indicators. We have factor innovation measured   00:01:41.820 --> 00:01:46.800 with two indicators and then we have Factor  competition measure with a single indicator. 00:01:46.800 --> 00:01:50.580 So we have three indicator factors two indicator   00:01:50.580 --> 00:01:55.650 factors and single indicator factors  which are the three scenarios that I   00:01:55.650 --> 00:02:01.200 explained in the video about model scale  factor scale setting and identification. 00:02:01.200 --> 00:02:08.850 So what parameters do need to estimate?  We need to estimate factor loadings. We   00:02:08.850 --> 00:02:15.510 are going to be scaling each latent variable  using the first indicator fixing technique so   00:02:15.510 --> 00:02:21.480 we will estimate factor variances and factor  covariances and indicator error variances. 00:02:21.480 --> 00:02:28.470 And the moll is identified for using the  following approach. We need to set the   00:02:28.470 --> 00:02:34.230 scale of each variable. We set its latent  variable. We use the first indicator fixing   00:02:34.230 --> 00:02:40.920 and so we fix first indicator at one. That's the  default setting. So we don't have to specify it   00:02:40.920 --> 00:02:48.660 anyhow here and then we need to consider how the  three two and one indicator rules are applied. 00:02:48.660 --> 00:02:54.330 So we have these three indicator factors.  They're always identified. We have two   00:02:54.330 --> 00:02:59.970 indicator factors. They're identified because  they are embedded in a larger system factor.   00:02:59.970 --> 00:03:05.430 So we have these two indicator factors  where we can use information from other   00:03:05.430 --> 00:03:09.630 factors to identify those loadings so  we don't have to do anything special   00:03:09.630 --> 00:03:17.130 and then for one indicator factors we  fix the error variances to be zero. 00:03:17.130 --> 00:03:20.550 So we say that these single indicators or single   00:03:20.550 --> 00:03:24.660 indicator factors are perfectly  reliable. So we say that the error   00:03:24.660 --> 00:03:29.490 variances are zero for indicators that  are sole indicators of their factors. 00:03:29.490 --> 00:03:38.010 So in as a path diagram the result looks like  that. So we have factor variances here or factor   00:03:39.720 --> 00:03:46.170 covariance is here - these curves. We have factor  variances this curve that start from a factor and   00:03:46.170 --> 00:03:52.380 then comes back to the factor. Qe have factor  loadings - these arrows are from factors to   00:03:52.380 --> 00:03:56.940 the indicators and then we have indicator  error variances these curved arrows here. 00:03:56.940 --> 00:04:05.940 Then these dashed arrows are something that  has been fixed. So that's constrained to be   00:04:05.940 --> 00:04:11.190 one and that's constrained to be  zero. So that's a single indicator   00:04:11.190 --> 00:04:15.180 factor's error variance is constrained to be zero. 00:04:15.180 --> 00:04:23.400 So that's what we have and there are  funny things. So we can see here that   00:04:23.400 --> 00:04:27.450 we have some error variances that are  negative. So this is a Heywood case and   00:04:27.450 --> 00:04:32.040 I have another video explaining what  a Haywood case is and why it occurs. 00:04:32.040 --> 00:04:39.240 So we have negative variances - they have  close to zero so we can conclude that maybe   00:04:39.240 --> 00:04:46.470 these indicators are just highly reliable and  the error variance is actually close to zero.   00:04:46.470 --> 00:04:53.160 Its positive but close to zero and because of  sampling error we'll get small negative values. 00:04:53.160 --> 00:04:57.600 So these are small negative values. We  don't really care about that. We assume   00:04:57.600 --> 00:05:02.400 that they are highly reliable instead of this  being a symptom of model missspecification. 00:05:02.400 --> 00:05:08.220 Then I say that these results mostly  match what's reported in the paper.   00:05:08.220 --> 00:05:12.300 So there's a small mismatch in the factor loadings   00:05:12.300 --> 00:05:18.570 but otherwise these factor loadings here  match exactly what the article reports. 00:05:18.570 --> 00:05:28.560 In text form there are outputs. A couple  of things for us. So we have our estimation   00:05:28.560 --> 00:05:32.910 information first we have the degrees of  freedom and we have chi-square that I'll   00:05:32.910 --> 00:05:38.790 explain in the next video then we have  the actual estimates and the estimates   00:05:38.790 --> 00:05:45.630 list we have estimate standard error  Z value and P-value and this goes on   00:05:45.630 --> 00:05:50.070 for - it's a lot very long printout  - and then we have some warnings. 00:05:50.070 --> 00:05:55.470 So the warning here is that we have the Heywood  case so both of these warnings relate to that. 00:05:55.470 --> 00:06:01.890 Let's take a look at the estimation  information part next. So this is the   00:06:01.890 --> 00:06:08.610 same kind of information it's given you by  any structure regression modeling software.   00:06:08.610 --> 00:06:13.980 So it's not exclusive to R. You will get this  estimation information and an actual estimates. 00:06:13.980 --> 00:06:20.370 Let's take a look at the estimation information  and the decrease of freedom first. So the degrees   00:06:20.370 --> 00:06:29.490 of freedom is 147 and that's the same as in the  reported article. So where did that 147 come from? 00:06:29.490 --> 00:06:36.990 This is a good exercise to calculate the  degrees of freedom by hand because then you   00:06:36.990 --> 00:06:42.930 will understand what was estimated and there's  a nice paper by Cortina and colleagues were   00:06:42.930 --> 00:06:49.140 they calculate these degrees of freedoms  by from published articles and they check   00:06:49.140 --> 00:06:53.310 whether they actually match in the reported  degrees of freedom and they don't always   00:06:53.310 --> 00:06:58.020 match so that's an indication that there is  something funny going on in the analysis. 00:06:58.020 --> 00:07:01.560 Let's do the degrees of freedom  calculation. So where does the   00:07:01.560 --> 00:07:09.870 147 come from? We have first 231 unique  elements of information. So we had the   00:07:09.870 --> 00:07:15.180 correlation matrix all the indicators  has 231 unique elements. So that's the   00:07:15.180 --> 00:07:19.590 amount of information. Then we start  to substract things that we estimate. 00:07:19.590 --> 00:07:25.170 So we estimate 10 factor variances. So  we have 10 factors. Each factor has an   00:07:25.170 --> 00:07:30.660 estimated variance. Then we estimate  45 factor covariances. So 10 variables   00:07:30.660 --> 00:07:35.700 have 45 unique correlations. Then  we subtract 11 factor loadings. 00:07:35.700 --> 00:07:43.650 So remember that when we always fix the first  loading to be 1 to identify the factor so we had   00:07:43.650 --> 00:07:49.020 on 21 indicators - 10 are used to writing  for scaling the factor then we estimate   00:07:49.020 --> 00:07:55.800 11 loadings - then we have 18 indicator error  variances. We had 21 indicators but three are   00:07:55.800 --> 00:08:01.320 single indicator factors so we have to fix the  error variance to be zero and that gives 147. 00:08:01.320 --> 00:08:07.110 So that's the degrees of freedom. We can check  that our analysis actually matches what was done   00:08:07.110 --> 00:08:12.570 in the paper by comparing the degrees of  freedom and also comparing the chi-square. 00:08:12.570 --> 00:08:19.350 The 147 degrees of freedom tells us  that we have excess information that   00:08:19.350 --> 00:08:27.120 we could estimate 147 more parameters  if we want to. After 147 parameters   00:08:27.120 --> 00:08:29.910 we have used all information or we  couldn't estimate anything anymore. 00:08:29.910 --> 00:08:36.030 We can also use the excess information to  check if the excess information matches   00:08:36.030 --> 00:08:42.240 the predictions from our model and that  is the idea of model testing. So we can   00:08:42.240 --> 00:08:47.250 use the redundant information to test  the model. So we have more information   00:08:47.250 --> 00:08:52.980 that we need for model estimation. We can  ask whether the additional information is   00:08:52.980 --> 00:08:58.680 consistent with our estimates. If it is then  we conclude that the model fits the data well. 00:08:58.680 --> 00:09:04.590 So the idea of model testing is  that we have the data correlation   00:09:04.590 --> 00:09:08.430 matrix here - so that's the first  six indicators - then we have the   00:09:08.430 --> 00:09:12.690 implied correlation matrix here and then we  have the residual correlation matrix here. 00:09:12.690 --> 00:09:19.950 Again the estimation criterion was to make this  residual correlation matrix as close to all zeros   00:09:19.950 --> 00:09:25.560 as possible by adjusting the model parameters  that produce the employed correlation matrix. 00:09:28.110 --> 00:09:34.230 These are pretty close to zero and if our  model fits the data perfectly it means   00:09:34.230 --> 00:09:40.380 thatit preproduces the data perfectly  - low residuals are zero - and we want   00:09:40.380 --> 00:09:44.610 to know if the model is correct for  the population. So there are - the   00:09:44.610 --> 00:09:55.050 question that we ask now is whether this  model would have produced the population   00:09:55.050 --> 00:09:59.700 correlation matrix if we had access to  that actual population correlation matrix. 00:09:59.700 --> 00:10:05.070 In small samples the actual sample correlations  are slightly off so they're not exactly in the   00:10:05.070 --> 00:10:09.450 population values and therefore the  residuals are not exactly at zero. 00:10:09.450 --> 00:10:17.400 So we ask the question are these differences  from zero small enough that we can attribute   00:10:17.400 --> 00:10:24.240 them to chance? So is it plausible to say  that the model is correct but it doesn't   00:10:24.240 --> 00:10:30.660 reproduce the data exactly because of small  sample fluctuations in the correlations? 00:10:30.660 --> 00:10:37.890 This question can this residual correlations be  by chance only is what the chi-square statistic   00:10:37.890 --> 00:10:44.460 quantifies. So we have the chi-square  statistic here. It's a a function of   00:10:44.460 --> 00:10:52.020 these residuals and we have - it doesn't  really have an interpretation but it's   00:10:52.020 --> 00:11:00.510 distributed that chi-square with 147 degrees of  freedom and we can calculate the p-value for it. 00:11:00.510 --> 00:11:10.980 The p-value here is 0.25. So we say that if  the residuals were all 0 in the population   00:11:10.980 --> 00:11:20.250 then getting this kind of result - by chance only  or greater - we would get 25% of the time. So we   00:11:20.250 --> 00:11:28.110 then cannot reject the null hypothesis. The null  hypothesis is that these are by chance only. We   00:11:28.110 --> 00:11:33.180 cannot reject the null hypothesis therefore  we say that the model fits the data well. 00:11:33.180 --> 00:11:35.670 This is the logic of the chi-square testing   00:11:35.670 --> 00:11:40.110 in confirmatory factor analysis  and structural regression models. 00:11:40.110 --> 00:11:44.640 So we want to say that these differences  are small enough that we can attribute   00:11:44.640 --> 00:11:52.290 them to chance only and we accept the null or  actually we fail to reject the null. So then   00:11:52.290 --> 00:11:58.590 we conclude that this evidence does not allow  us to conclude that the model is misspecified. 00:11:58.590 --> 00:12:05.850 So we want to have a p-value here that is  non significant because it indicates that   00:12:05.850 --> 00:12:12.810 our model is a plausible representation of  the data and we conclude that the model fits. 00:12:12.810 --> 00:12:17.370 Let's take a look at the estimation  information again. So estimation   00:12:18.570 --> 00:12:25.350 information gives us the p-value the  degrees of freedom and chi-square   00:12:25.350 --> 00:12:28.500 statistic- then we get estimates  and then we get these warnings. 00:12:28.500 --> 00:12:35.670 So every time when you get warnings then you  need to actually look at what the warnings   00:12:35.670 --> 00:12:44.250 mean. So here our code actually tells us  that we should run inspect fit theta. So   00:12:44.250 --> 00:12:50.760 theta matrix is the error correlation  or the residual indicator error term   00:12:50.760 --> 00:12:56.160 covariance matrix estimated from the  data. And we should investigate it. 00:12:56.160 --> 00:13:03.360 So recall that we have the Heywood case. We have  these three negative error variances and then   00:13:03.360 --> 00:13:10.560 when we do inspection of the theta matrix - so  the theta matrix contains here there estimated   00:13:10.560 --> 00:13:17.070 error term variances so estimated indicator  error term variances - all the covariance   00:13:17.070 --> 00:13:20.580 within the error terms are constrained  to be 0 because we didn't estimate them   00:13:20.580 --> 00:13:26.490 in this model. And we can see here that  we have these three negative values here. 00:13:26.490 --> 00:13:34.350 So what do we do with that? We conclude that  these are so close to zero that it's plausible   00:13:34.350 --> 00:13:42.300 that there are actually small positive numbers but  this is just a small sampling fluctuation outcome.