WEBVTT
WEBVTT
Kind: captions
Language: en
00:00:00.030 --> 00:00:03.930
Let's take a look at an empirical
example of confirmatory factor analysis.
00:00:03.930 --> 00:00:10.800
Our data set for the example comes from Mequita
and Lazzarini. This is a nice paper because they
00:00:10.800 --> 00:00:17.010
present a correlation matrix of all the data on
the indicator level. So we can use their table one
00:00:17.010 --> 00:00:23.580
- shown here - to calculate all the confirmatory
factor analysis and structural regression models
00:00:23.580 --> 00:00:29.700
that the article presents and we will also get
for - the most parts - the exact same results.
00:00:29.700 --> 00:00:34.080
So let's check how the confirmatory factor
00:00:34.080 --> 00:00:38.370
analysis is estimated in and
what the results look like.
00:00:38.370 --> 00:00:46.170
Specifying the factor analysis model requires
a bit of work. I'll explain you the details of
00:00:46.170 --> 00:00:52.650
this syntax a bit later but generally what we do
first is that we specify the model. So we have to
00:00:52.650 --> 00:00:59.910
specify the indicators and for every indicator we
specify one factor - in this particular case - and
00:00:59.910 --> 00:01:08.310
then we estimate using covariance matrix and
finally we'll plot the results as a path diagram.
00:01:08.310 --> 00:01:13.680
So that's the plotting command and I have added
some options to make the plot to look a bit nicer.
00:01:13.680 --> 00:01:21.270
So let's take a look at the model specification
in more detail. We have here - I have color
00:01:21.270 --> 00:01:28.350
coded this blue is for factors and green is
for indicators. So we specify that we have
00:01:28.350 --> 00:01:33.870
about eight factors and then we specify
how each indicator loads on the factor.
00:01:33.870 --> 00:01:41.820
So we have factor horizontal measure with three
indicators. We have factor innovation measured
00:01:41.820 --> 00:01:46.800
with two indicators and then we have Factor
competition measure with a single indicator.
00:01:46.800 --> 00:01:50.580
So we have three indicator factors two indicator
00:01:50.580 --> 00:01:55.650
factors and single indicator factors
which are the three scenarios that I
00:01:55.650 --> 00:02:01.200
explained in the video about model scale
factor scale setting and identification.
00:02:01.200 --> 00:02:08.850
So what parameters do need to estimate?
We need to estimate factor loadings. We
00:02:08.850 --> 00:02:15.510
are going to be scaling each latent variable
using the first indicator fixing technique so
00:02:15.510 --> 00:02:21.480
we will estimate factor variances and factor
covariances and indicator error variances.
00:02:21.480 --> 00:02:28.470
And the moll is identified for using the
following approach. We need to set the
00:02:28.470 --> 00:02:34.230
scale of each variable. We set its latent
variable. We use the first indicator fixing
00:02:34.230 --> 00:02:40.920
and so we fix first indicator at one. That's the
default setting. So we don't have to specify it
00:02:40.920 --> 00:02:48.660
anyhow here and then we need to consider how the
three two and one indicator rules are applied.
00:02:48.660 --> 00:02:54.330
So we have these three indicator factors.
They're always identified. We have two
00:02:54.330 --> 00:02:59.970
indicator factors. They're identified because
they are embedded in a larger system factor.
00:02:59.970 --> 00:03:05.430
So we have these two indicator factors
where we can use information from other
00:03:05.430 --> 00:03:09.630
factors to identify those loadings so
we don't have to do anything special
00:03:09.630 --> 00:03:17.130
and then for one indicator factors we
fix the error variances to be zero.
00:03:17.130 --> 00:03:20.550
So we say that these single indicators or single
00:03:20.550 --> 00:03:24.660
indicator factors are perfectly
reliable. So we say that the error
00:03:24.660 --> 00:03:29.490
variances are zero for indicators that
are sole indicators of their factors.
00:03:29.490 --> 00:03:38.010
So in as a path diagram the result looks like
that. So we have factor variances here or factor
00:03:39.720 --> 00:03:46.170
covariance is here - these curves. We have factor
variances this curve that start from a factor and
00:03:46.170 --> 00:03:52.380
then comes back to the factor. Qe have factor
loadings - these arrows are from factors to
00:03:52.380 --> 00:03:56.940
the indicators and then we have indicator
error variances these curved arrows here.
00:03:56.940 --> 00:04:05.940
Then these dashed arrows are something that
has been fixed. So that's constrained to be
00:04:05.940 --> 00:04:11.190
one and that's constrained to be
zero. So that's a single indicator
00:04:11.190 --> 00:04:15.180
factor's error variance is constrained to be zero.
00:04:15.180 --> 00:04:23.400
So that's what we have and there are
funny things. So we can see here that
00:04:23.400 --> 00:04:27.450
we have some error variances that are
negative. So this is a Heywood case and
00:04:27.450 --> 00:04:32.040
I have another video explaining what
a Haywood case is and why it occurs.
00:04:32.040 --> 00:04:39.240
So we have negative variances - they have
close to zero so we can conclude that maybe
00:04:39.240 --> 00:04:46.470
these indicators are just highly reliable and
the error variance is actually close to zero.
00:04:46.470 --> 00:04:53.160
Its positive but close to zero and because of
sampling error we'll get small negative values.
00:04:53.160 --> 00:04:57.600
So these are small negative values. We
don't really care about that. We assume
00:04:57.600 --> 00:05:02.400
that they are highly reliable instead of this
being a symptom of model missspecification.
00:05:02.400 --> 00:05:08.220
Then I say that these results mostly
match what's reported in the paper.
00:05:08.220 --> 00:05:12.300
So there's a small mismatch in the factor loadings
00:05:12.300 --> 00:05:18.570
but otherwise these factor loadings here
match exactly what the article reports.
00:05:18.570 --> 00:05:28.560
In text form there are outputs. A couple
of things for us. So we have our estimation
00:05:28.560 --> 00:05:32.910
information first we have the degrees of
freedom and we have chi-square that I'll
00:05:32.910 --> 00:05:38.790
explain in the next video then we have
the actual estimates and the estimates
00:05:38.790 --> 00:05:45.630
list we have estimate standard error
Z value and P-value and this goes on
00:05:45.630 --> 00:05:50.070
for - it's a lot very long printout
- and then we have some warnings.
00:05:50.070 --> 00:05:55.470
So the warning here is that we have the Heywood
case so both of these warnings relate to that.
00:05:55.470 --> 00:06:01.890
Let's take a look at the estimation
information part next. So this is the
00:06:01.890 --> 00:06:08.610
same kind of information it's given you by
any structure regression modeling software.
00:06:08.610 --> 00:06:13.980
So it's not exclusive to R. You will get this
estimation information and an actual estimates.
00:06:13.980 --> 00:06:20.370
Let's take a look at the estimation information
and the decrease of freedom first. So the degrees
00:06:20.370 --> 00:06:29.490
of freedom is 147 and that's the same as in the
reported article. So where did that 147 come from?
00:06:29.490 --> 00:06:36.990
This is a good exercise to calculate the
degrees of freedom by hand because then you
00:06:36.990 --> 00:06:42.930
will understand what was estimated and there's
a nice paper by Cortina and colleagues were
00:06:42.930 --> 00:06:49.140
they calculate these degrees of freedoms
by from published articles and they check
00:06:49.140 --> 00:06:53.310
whether they actually match in the reported
degrees of freedom and they don't always
00:06:53.310 --> 00:06:58.020
match so that's an indication that there is
something funny going on in the analysis.
00:06:58.020 --> 00:07:01.560
Let's do the degrees of freedom
calculation. So where does the
00:07:01.560 --> 00:07:09.870
147 come from? We have first 231 unique
elements of information. So we had the
00:07:09.870 --> 00:07:15.180
correlation matrix all the indicators
has 231 unique elements. So that's the
00:07:15.180 --> 00:07:19.590
amount of information. Then we start
to substract things that we estimate.
00:07:19.590 --> 00:07:25.170
So we estimate 10 factor variances. So
we have 10 factors. Each factor has an
00:07:25.170 --> 00:07:30.660
estimated variance. Then we estimate
45 factor covariances. So 10 variables
00:07:30.660 --> 00:07:35.700
have 45 unique correlations. Then
we subtract 11 factor loadings.
00:07:35.700 --> 00:07:43.650
So remember that when we always fix the first
loading to be 1 to identify the factor so we had
00:07:43.650 --> 00:07:49.020
on 21 indicators - 10 are used to writing
for scaling the factor then we estimate
00:07:49.020 --> 00:07:55.800
11 loadings - then we have 18 indicator error
variances. We had 21 indicators but three are
00:07:55.800 --> 00:08:01.320
single indicator factors so we have to fix the
error variance to be zero and that gives 147.
00:08:01.320 --> 00:08:07.110
So that's the degrees of freedom. We can check
that our analysis actually matches what was done
00:08:07.110 --> 00:08:12.570
in the paper by comparing the degrees of
freedom and also comparing the chi-square.
00:08:12.570 --> 00:08:19.350
The 147 degrees of freedom tells us
that we have excess information that
00:08:19.350 --> 00:08:27.120
we could estimate 147 more parameters
if we want to. After 147 parameters
00:08:27.120 --> 00:08:29.910
we have used all information or we
couldn't estimate anything anymore.
00:08:29.910 --> 00:08:36.030
We can also use the excess information to
check if the excess information matches
00:08:36.030 --> 00:08:42.240
the predictions from our model and that
is the idea of model testing. So we can
00:08:42.240 --> 00:08:47.250
use the redundant information to test
the model. So we have more information
00:08:47.250 --> 00:08:52.980
that we need for model estimation. We can
ask whether the additional information is
00:08:52.980 --> 00:08:58.680
consistent with our estimates. If it is then
we conclude that the model fits the data well.
00:08:58.680 --> 00:09:04.590
So the idea of model testing is
that we have the data correlation
00:09:04.590 --> 00:09:08.430
matrix here - so that's the first
six indicators - then we have the
00:09:08.430 --> 00:09:12.690
implied correlation matrix here and then we
have the residual correlation matrix here.
00:09:12.690 --> 00:09:19.950
Again the estimation criterion was to make this
residual correlation matrix as close to all zeros
00:09:19.950 --> 00:09:25.560
as possible by adjusting the model parameters
that produce the employed correlation matrix.
00:09:28.110 --> 00:09:34.230
These are pretty close to zero and if our
model fits the data perfectly it means
00:09:34.230 --> 00:09:40.380
thatit preproduces the data perfectly
- low residuals are zero - and we want
00:09:40.380 --> 00:09:44.610
to know if the model is correct for
the population. So there are - the
00:09:44.610 --> 00:09:55.050
question that we ask now is whether this
model would have produced the population
00:09:55.050 --> 00:09:59.700
correlation matrix if we had access to
that actual population correlation matrix.
00:09:59.700 --> 00:10:05.070
In small samples the actual sample correlations
are slightly off so they're not exactly in the
00:10:05.070 --> 00:10:09.450
population values and therefore the
residuals are not exactly at zero.
00:10:09.450 --> 00:10:17.400
So we ask the question are these differences
from zero small enough that we can attribute
00:10:17.400 --> 00:10:24.240
them to chance? So is it plausible to say
that the model is correct but it doesn't
00:10:24.240 --> 00:10:30.660
reproduce the data exactly because of small
sample fluctuations in the correlations?
00:10:30.660 --> 00:10:37.890
This question can this residual correlations be
by chance only is what the chi-square statistic
00:10:37.890 --> 00:10:44.460
quantifies. So we have the chi-square
statistic here. It's a a function of
00:10:44.460 --> 00:10:52.020
these residuals and we have - it doesn't
really have an interpretation but it's
00:10:52.020 --> 00:11:00.510
distributed that chi-square with 147 degrees of
freedom and we can calculate the p-value for it.
00:11:00.510 --> 00:11:10.980
The p-value here is 0.25. So we say that if
the residuals were all 0 in the population
00:11:10.980 --> 00:11:20.250
then getting this kind of result - by chance only
or greater - we would get 25% of the time. So we
00:11:20.250 --> 00:11:28.110
then cannot reject the null hypothesis. The null
hypothesis is that these are by chance only. We
00:11:28.110 --> 00:11:33.180
cannot reject the null hypothesis therefore
we say that the model fits the data well.
00:11:33.180 --> 00:11:35.670
This is the logic of the chi-square testing
00:11:35.670 --> 00:11:40.110
in confirmatory factor analysis
and structural regression models.
00:11:40.110 --> 00:11:44.640
So we want to say that these differences
are small enough that we can attribute
00:11:44.640 --> 00:11:52.290
them to chance only and we accept the null or
actually we fail to reject the null. So then
00:11:52.290 --> 00:11:58.590
we conclude that this evidence does not allow
us to conclude that the model is misspecified.
00:11:58.590 --> 00:12:05.850
So we want to have a p-value here that is
non significant because it indicates that
00:12:05.850 --> 00:12:12.810
our model is a plausible representation of
the data and we conclude that the model fits.
00:12:12.810 --> 00:12:17.370
Let's take a look at the estimation
information again. So estimation
00:12:18.570 --> 00:12:25.350
information gives us the p-value the
degrees of freedom and chi-square
00:12:25.350 --> 00:12:28.500
statistic- then we get estimates
and then we get these warnings.
00:12:28.500 --> 00:12:35.670
So every time when you get warnings then you
need to actually look at what the warnings
00:12:35.670 --> 00:12:44.250
mean. So here our code actually tells us
that we should run inspect fit theta. So
00:12:44.250 --> 00:12:50.760
theta matrix is the error correlation
or the residual indicator error term
00:12:50.760 --> 00:12:56.160
covariance matrix estimated from the
data. And we should investigate it.
00:12:56.160 --> 00:13:03.360
So recall that we have the Heywood case. We have
these three negative error variances and then
00:13:03.360 --> 00:13:10.560
when we do inspection of the theta matrix - so
the theta matrix contains here there estimated
00:13:10.560 --> 00:13:17.070
error term variances so estimated indicator
error term variances - all the covariance
00:13:17.070 --> 00:13:20.580
within the error terms are constrained
to be 0 because we didn't estimate them
00:13:20.580 --> 00:13:26.490
in this model. And we can see here that
we have these three negative values here.
00:13:26.490 --> 00:13:34.350
So what do we do with that? We conclude that
these are so close to zero that it's plausible
00:13:34.350 --> 00:13:42.300
that there are actually small positive numbers but
this is just a small sampling fluctuation outcome.