NMST539 | Lab Session 8

Factor Analysis and Applications in R

Summer Term 2021/2022 | 12/04/22

source file (UTF8 encoding)

The R software is available for download (free of charge) from the website: https://www.r-project.org

A user-friendly interface (one of many): RStudio.

Manuals and introduction into R (in Czech or English):

Bína, V., Komárek, A. a Komárková, L.: Jak na jazyk R. (in Czech) | (PDF manual)
Komárek, A.: Základy práce s R. (in Czech) | (PDF manual)
Kulich, M.: Velmi stručný úvod do R. (in Czech) | (PDF manual)
Venables, W.N., Smith, D.M., and the R Core Team: An Introduction to R. | (PDF manual)
De Vries, A. a Meys, J.: R for Dummies. (ISBN-13: 978-1119055808) | (PDF book)

Theoretical references

Hardle, W. and Simar.L.: Applied Multivariate Statistical Analysis. Springer, 2015
Mardia, K., Kent, J., and Bibby, J.:Multivariate Analysis, Academic Press, 1979.

1. Factor Analysis

Factor analysis is an effective (not just exploratory) statistical method meant for dimensionaly reduction. It is suitable especially in situations where one needs to use the data for some further analysis and statistical modeling (e.g. regression). The factor analysis is used to describe the overall correlation structure among some set of variables by using a potentially lower number of variables which are called factors. These factors are, however, unobserved – latent random variables.

The factor analysis approach searches for similar covariates with respect to their mutual correlation. All variables with mutually high correlation are represented with a factor (or a linear combination of factors) instead.

In some sense the factor analysis approach can be considered to be a generalization of the classical principal component analysis (PCA) with one great advantage at the end – usually a more convenient interpretation of the factors.

In the statistical software R there is function factanal() available under the standard R installation. Beside that, there are many more additional functions and packages which can be downloaded and installed in R (for instance, Factanal() function in the library ‘FAiR’; fa.promax() function in the library ‘psych’;).

For our purposes we mainly use the standard R function factanal(). Let us again recall the biological metrics data from the Czech Republic. The data represent 65 different river localities in the Czech Republic where on each locality there are various biological metrics assessed (17 metrics in total).

rm(list = ls())
bioData <- read.csv("http://msekce.karlin.mff.cuni.cz/~maciak/NMST539/bioData.csv", header = T)

The correlation structure (which will be later assessed using the factor analysis approach) can be either estimated using a standard variance covariance matrix (command var(bioData[,2:18])) or, instead, it can be visualized using the corrplot() function from the ‘corrplot’ package.

library(corrplot)
par(mfrow = c(1,2))
corrplot(cor(bioData[,2:18]), method="ellipse")
corrplot(cor(bioData[,2:18]), method="number",  number.cex = 0.5)

Note, that the (sample) variance-covariance matrix depends on the scale used to measure the given covariates. On the other hand, the (sample) correlation matrix is already scale invariant. Compare the following outputs:

all.equal(cov(bioData[,2:18]), cov(scale(bioData[,2:18])))
all.equal(cor(bioData[,2:18]), cor(scale(bioData[,2:18])))

cor(bioData[,2:18])
cov(scale(bioData[,2:18]))
cor(scale(bioData[,2:18]))

The idea behind the factanal() function (the factor analysis method itself) is to consider a $p$-dimensional (random) vector $\boldsymbol{X}$ and to express this vector using a set of $k$ factors where we typically require that $k \ll p$ (dimensionaly reduction). The model, which is fitted by the factor analysis approach when appling the factanal() function, can be expressed as

\[ \boldsymbol{X} = \Lambda \boldsymbol{F} + \boldsymbol{e}, \]

where $\Lambda$ is a $p\times k$ dimensional (theoretically nonrandom) matrix of so called factor loadings, $\boldsymbol{F}$ is a $k$-dimensional (random) vector represending $k$ (random) factors (common factors or scores respectively) and $\boldsymbol{e} = (e_1, \dots, e_p)^\top$ is an approximation error (or specific factors). Thus, general notation assumes some common factors in $\boldsymbol{F}$ and specific factors in $\boldsymbol{e}$.

The form of the expression above closely reminds the model of the linear regression, where

\[ \boldsymbol{Y} = \mathbb{X} \boldsymbol{\beta} + \boldsymbol{\varepsilon}, \] for $\boldsymbol{Y}$ being the vector of dependent covariates, $\mathbb{X}$ is the given model matrix (random or non-random) of the type $n \times p$, the vector of unknown parameters (to be estimated) is denoted as $\boldsymbol{\beta} \in \mathbb{R}^p$, and $\boldsymbol{\varepsilon}$ is an (unobserved) random vector – the errror terms.

However, the tricky part in the factor decomposition $\boldsymbol{X} = \Lambda \boldsymbol{F} + \boldsymbol{e}$ is that beside the random vector $\boldsymbol{X}$ no other quantity is directly observed. The problem, as such, is unsolvable in a direct and unique way unless we impose some additional restrictions. This is done by specifying a varinace covariance structure among all quantities which appear in the given expression. The following restrictions are commonly used:

The (random) common factors $\boldsymbol{F}$ are required to be uncorrelated, with a unit variance, such that $Var (\boldsymbol{F}) = \mathbb{I}_{k};$
The specific factors (error terms) in $\boldsymbol{e}$ are independent with some variance, thus $Var (\boldsymbol{e}) = \boldsymbol{\Psi} = (\Psi_{ij})_{i,j = 1}^p$, where $Var (e_{i}) = \psi_{ii}$;
The variance-covariance matrix (covariance structure respectively) of $\boldsymbol{X}$ is decomposed as $Var(\boldsymbol{X}) = \Sigma = \Lambda \Lambda^{T} + \boldsymbol{\Psi}$;
The mutual covariance between the common factors $\boldsymbol{F}$ and the specific factors $\boldsymbol{e}$ is zero, thus $Cov(\boldsymbol{F}, \boldsymbol{e}) = \boldsymbol{0}$;

A common technique used in the factor analysis approach is to assume that the matrix $\Lambda^\top \boldsymbol{\Psi}^{-1} \Lambda$ is dianogal. This introduces additional equations needed to solve the problem.

from the factor decomposition itself, thus the equation $\Sigma = \Lambda \Lambda^\top + \boldsymbol{\Psi}$, we have $p (p + 1)/2$ equations in total;
the total number of unknown parameters to find is $p \times k$ parameters from the matrix $\Lambda$ and, in addition $p$ parameters from the diagonal matrix $\boldsymbol{\Psi}$;
the additional requirement on the diagonality of $\Lambda^\top \boldsymbol{\Psi}^{-1} \Lambda$ introduces additional $k(k + 1)/2$ equations;
therefore, the overall degrees of freedom of the factor model can be obtained as \[ \frac{(p - k)^2}{2} - \frac{k + p}{2}; \]

The degrees of freedom value as obtained by the equation above can be used to get an idea on what is the overall (upper) number of factors which can be effectively estimated from the underlying model.

Note

Factor analysis is not unique – any rotation of $X$ gives the same results;
(use the fact that a matrix $\boldsymbol{R}$ is a rotation matrix if and only if $\boldsymbol{R}^\top = \boldsymbol{R}^{-1}$ and prove the statement given above)
Factor analysis is invariant with respect to the scale;
(common factors in $\boldsymbol{F}$ will remain the same; matrices $\Lambda$ and $\boldsymbol{\Psi}$ will change correspondingly)

Since the factors are not unique with respect to the rotation, it is usefull to provide some rotation which makes sense. The automatic proceduce which tries to find factors in a way that original covariates can be splitted into disjoint sents is called the varimax procedure.
To use the varimax procedure for detecting the right rotation we can use the additional parameter ‘rotation=“varimax”’ when calling the function factanal() in R.

fa1 <- factanal(bioData[,2:18], factors = 3, rotation="varimax")
print(fa1, digits=2, cutoff=.6, sort=TRUE)

## 
## Call:
## factanal(x = bioData[, 2:18], factors = 3, rotation = "varimax")
## 
## Uniquenesses:
##    SaprInd      Lital       RETI     EPTAbu       Marg   Metaritr     JepAbu 
##       0.15       0.09       0.05       0.05       0.03       0.02       0.10 
##  Epiritral Hyporitral     Ntaxon     Nceled     Bindex     EPTTax     PosAbu 
##       0.07       0.39       0.21       0.09       0.19       0.10       0.35 
##    Spasaci      MDS_1      MDS_2 
##       0.11       0.13       0.45 
## 
## Loadings:
##            Factor1 Factor2 Factor3
## SaprInd    -0.79                  
## Lital       0.94                  
## RETI        0.94                  
## EPTAbu      0.90                  
## Metaritr    0.96                  
## JepAbu      0.68            0.65  
## Epiritral   0.89                  
## Hyporitral  0.71                  
## PosAbu      0.79                  
## Spasaci     0.94                  
## MDS_1       0.91                  
## Marg                0.98          
## Ntaxon              0.65          
## Nceled              0.94          
## Bindex              0.79          
## EPTTax              0.83          
## MDS_2                       0.71  
## 
##                Factor1 Factor2 Factor3
## SS loadings       8.96    4.05    1.41
## Proportion Var    0.53    0.24    0.08
## Cumulative Var    0.53    0.77    0.85
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 414.98 on 88 degrees of freedom.
## The p-value is 6.88e-44

Is such factor representation enough? Compare the model with some others for the increasing number of factors used:

fa2 <- factanal(bioData[,2:18], factors = 4, rotation="varimax")
fa3 <- factanal(bioData[,2:18], factors = 5, rotation="varimax")
fa4 <- factanal(bioData[,2:18], factors = 9, rotation="varimax")

There are mainly two different approaches on how to define the number of factors which should be used. For instance, it is easy to see that for each element $X_{j}$ from the random vector $\boldsymbol{X} = (X_{1}, \dots, X_{p})^\top$ it holds that

\[ var(X_{j}) =h_{j}^2 + \psi_{j}, \qquad \textrm{where} \qquad h_{j}^2 = \sum_{\ell = 1}^{k} q_{j \ell}^2, \]

and $\boldsymbol{\psi} = Diag\{\psi_{1}, \dots, \psi_{p}\}$, and $\Lambda = \{q_{i j}\}_{i, j = 1}^{p,k}$. The quantity $h_{j}^2$ is the overall variability of $X_{j}$ which is explained by the common factors in $F$ (also called the communality) while the second quantity, $\psi_{j}$ is the variability which is still left unexplained (also called the specific variability).

Theoretically speaking (see the limitations with respect to the degrees of freedom) it is obvious, that for $p$ factors we would obtain that $var(X_{j}) = \sum_{\ell = 1}^{p} q_{j \ell}^2$ and the specific variability would be equal to zero, thus $\psi_j = 0$, for any $j = 1, \dots, p$.

Therefore, to decide on how many factors are needed, one can base the decision on a comparison of the communality and the specific variability.

Alternative approaches to judge the right number of factors:

Statistical pre-analysis
One usually runs, for instance, a principal components analysis to determine how many factors should be enough (in order to have some reasonable proportion of the overal variability explained);
Expert judgement
Sometimes (under some optimal interpretation options) the estimated factors nicely correspond with some latent variables which are not observed, however, can be indentified by some expert judgement.
Maximum Likelihood Test
Using the likelihood approach we can also test the null hypothesis $H_{0}$ that $\Sigma = \Lambda\Lambda^\top + \boldsymbol{\psi}$ against a general altternative $H_{1}$ specifing no restrictions on the variance covariance matrix $\Sigma$. The likelihood ratio test is given by the test statistic \[ T = - 2 log \Bigg[\frac{\textrm{maximum likelihood under $H_{0}$}}{\textrm{maximum likelihood under $H_{1}$}}\Bigg]. \] However, this approach assumes that the data follow the normal distribution with some specific parameters. Moreover, usually the factor analysis is used in a way that the factors adopt nice interpretation rather than performing a formal statistical test.

Note

Another option on how to determine the optimal number of factor to be extracted in the analysis is to use the tools from the library ‘nFactors’ (the library can be installed by install.packages("nFactors"));

library(nFactors)
ev <- eigen(cor(bioData[,2:18])) 
ap <- parallel(subject=nrow(bioData[,2:18]),var=ncol(bioData[,2:18]), rep=100, cent=.05)
nS <- nScree(x=ev$values, aparallel=ap$eigen$qevpea)

plotnScree(nS)

load <- fa1$loadings[,1:2] 
plot(load,type="n") 
text(load,labels=names(bioData[,2:18]),cex=.7)

lines(c(-1,1),c(0,0), lty = 3)
lines(c(0,0),c(-1,1), lty = 3)

Alternatively, using the library FactoMineR and function PCA() we can obtain the complete factor map (covariate map with the mutual correlation).

library(FactoMineR)
result <- PCA(bioData[,2:18])

Note

The factanal() function in R uses scaled variables as the starting point for the factor analysis – the variance-covariance matrix of $\boldsymbol{X}$ is replaced by the corresponding correlation matrix with ones on its diagonal. Remember, that the factor analysis is, unlike the principal component analysis, scale invariant as we are focussing on the covariance structure (the off diagonal elements of the variance-covariance matrix) rather than the variance structure (diagonal elements of the variance-covariance matrix).

Now, we an idea that maybe three factors should be ok. In orther to get the corresponding score values (those are values of $\boldsymbol{F}$ in the expression above given for each observations - in our case $n = 65$) from the factanal() function we need an additional parameter to be specified:

fa1 <- factanal(bioData[,2:18], factors = 3, scores = "regression")
fa1$scores

##           Factor1     Factor2      Factor3
##  [1,] -0.12449662  0.49318396  0.019226166
##  [2,] -0.33849930 -0.39230318  0.119550138
##  [3,] -0.22044557 -0.38580832  2.320568152
##  [4,]  0.82566318  1.20827606 -0.940891112
##  [5,] -0.13932079  0.89310776 -0.532355972
##  [6,] -1.01455286 -0.65600386 -0.375228204
##  [7,] -0.82343873 -0.62783914 -0.701274829
##  [8,] -0.98728818 -0.13782643 -0.433312546
##  [9,]  0.22418720  0.88632819  0.795206993
## [10,]  0.52292639  1.06261823  0.295526977
## [11,]  0.29559045  1.25010730 -0.529517247
## [12,] -0.50645369 -0.62925864 -0.373956123
## [13,] -0.93565845 -1.35045706 -0.637327245
## [14,] -0.95854352 -0.20085958  1.192711934
## [15,]  0.42955162 -0.44374222 -1.634492803
## [16,]  1.52669131  0.90258023 -0.647334942
## [17,] -1.06358562 -0.98644111 -0.614191669
## [18,] -0.17714883  0.29705426 -0.273963895
## [19,] -1.17296128  0.15482177  0.740586340
## [20,] -0.78876331  1.64768437  1.135756173
## [21,] -0.68592393  1.86494676  0.622906404
## [22,]  0.03378541  0.51636339 -0.492220100
## [23,] -0.98102088  0.67287079  0.410026081
## [24,]  1.72144792  0.93062040 -1.616832483
## [25,] -0.76259043 -1.35110582 -0.851702050
## [26,] -0.91691053 -1.09563523 -0.287746278
## [27,] -0.28310899 -0.08405725 -0.554590635
## [28,]  1.01603809  0.64177486 -1.212012177
## [29,]  0.02890208  0.83487030  0.220552649
## [30,]  2.69443073 -1.16900885  0.606585871
## [31,] -0.47082315 -0.18615905  2.222434682
## [32,]  1.95090263 -1.29573724  1.303667307
## [33,]  0.28558011  1.04412770  0.364248451
## [34,] -0.18909235  0.97786635  0.090795648
## [35,] -0.48400090 -0.41407901 -0.459370022
## [36,] -0.86086894 -1.09247855 -0.356923055
## [37,] -0.83806848 -0.92300383 -0.815412673
## [38,] -0.91971280 -0.62821762 -0.098009986
## [39,] -0.87317702 -0.29454466 -0.077311663
## [40,]  0.96502394 -1.12911936  4.327243670
## [41,] -0.63704689  0.20145093  0.197835134
## [42,] -0.19959392  1.30938359  0.667994311
## [43,]  3.12996486 -1.62033577  0.044354682
## [44,]  1.20880189  0.64341909 -0.149924195
## [45,]  0.06135524 -0.02822842 -0.557196522
## [46,] -0.86225779 -1.61696484 -1.552055930
## [47,] -0.61532943 -0.45390266 -0.419664831
## [48,] -0.14042956  0.55569570 -0.113699662
## [49,] -0.76637088  0.22722578 -0.178657943
## [50,]  0.92871029  1.82419989 -0.448391534
## [51,]  1.52938447  1.37527251 -1.001758261
## [52,] -0.06380299 -2.13721348  0.714719247
## [53,] -0.48147547  0.55671249  1.028660417
## [54,] -0.01773075 -0.25909705 -0.961153139
## [55,]  0.44621941  0.62532458 -0.231747386
## [56,] -0.48055946  0.70571466  0.712530342
## [57,] -0.57932009 -0.16888336  0.267753908
## [58,]  0.41294090  0.99097762 -0.110134319
## [59,] -0.54167359  0.01836471  0.755755149
## [60,] -0.08414099  0.39657781  0.326141043
## [61,] -0.92105906 -1.35457565 -0.853275093
## [62,]  2.50167388 -1.84503963 -0.802830680
## [63,] -0.84252936 -0.72197336 -0.191797001
## [64,]  1.79639695 -1.48006272  0.583465441
## [65,]  0.21360641  1.45044090  0.001460895

and the corresponding loadings (matrix $\Lambda$ in the expression above) is obtained as

fa1$loadings

## 
## Loadings:
##            Factor1 Factor2 Factor3
## SaprInd    -0.794  -0.457  -0.109 
## Lital       0.939   0.164         
## RETI        0.941   0.203  -0.142 
## EPTAbu      0.903           0.357 
## Marg                0.984         
## Metaritr    0.959   0.193  -0.150 
## JepAbu      0.684  -0.115   0.646 
## Epiritral   0.886   0.182  -0.340 
## Hyporitral  0.710   0.184   0.275 
## Ntaxon     -0.572   0.654  -0.190 
## Nceled      0.102   0.936  -0.151 
## Bindex      0.421   0.789         
## EPTTax      0.456   0.834         
## PosAbu      0.785          -0.159 
## Spasaci     0.939                 
## MDS_1       0.907   0.190   0.113 
## MDS_2      -0.165  -0.129   0.714 
## 
##                Factor1 Factor2 Factor3
## SS loadings      8.961   4.053   1.409
## Proportion Var   0.527   0.238   0.083
## Cumulative Var   0.527   0.766   0.848

With some additional R code we can even nicely visualize the corresponding factor loadings (assuming there are three latent factors all together). This can be, for instance, helpfull for experts when deciding about the number of factors to be used (interpreting the undelrying latent variables).

rbPal <- colorRampPalette(c('red','blue'))

loadings <- data.frame(fa1$loadings[,1:3])
loadings$colF1 <- rbPal(20)[as.numeric(cut(loadings[,1],breaks = seq(-1,1,length = 20)))]
loadings$nam1 <- paste(row.names(loadings), round(loadings[,1], digits = 3), sep = "\n")
loadings$colF2 <- rbPal(20)[as.numeric(cut(loadings[,2],breaks = seq(-1,1,length = 20)))]
loadings$nam2 <- paste(row.names(loadings), round(loadings[,2], digits = 3), sep = "\n")
loadings$colF3 <- rbPal(20)[as.numeric(cut(loadings[,3],breaks = seq(-1,1,length = 20)))]
loadings$nam3 <- paste(row.names(loadings), round(loadings[,3], digits = 3), sep = "\n")

par(mfrow = c(3,1))
barplot(abs(loadings[,1]), col = loadings$colF1, ylim = c(0,1), ylab = "Factor 1", names.arg = loadings$nam1, cex.names = 0.8)
barplot(abs(loadings[,2]), col = loadings$colF2, ylim = c(0,1), ylab = "Factor 2", names.arg = loadings$nam2, cex.names = 0.8)
barplot(abs(loadings[,3]), col = loadings$colF3, ylim = c(0,1), ylab = "Factor 3", names.arg = loadings$nam3, cex.names = 0.8)

Try by yourself

Use the factor analysis approach and consider various number of factors used to explain the variance-covariance structure.
Obtain the corresponding regression scores and reconstruct the original covariates using the factor values and the given factor loadings.
Use some reasonable error measure and compare the approximation error with respect to the various number of factors.

2. Application in Regression or SEM

Structural Equation Models (SEM) are statistical modelling techniqes showing great potential especially in causal dependencies between endogenous and exogenous variables – the measurement model shows the relations between latent variables and their indicators (which is also the case of the latent factors $\boldsymbol{F}$ in the factor analysis).

In such situations, it is usually common to firstly draw a schematic diagram with the assumed underlying structure. Later, the diagram is used to specify the model which is fitted at the end.

An example of such diagram is below.

In the R software there is package ‘sem’ available for the structural equation modelling.

library("sem")

and a simple example using the dataset ‘PoliticalDemocracy’ from the package library("lavaan"):

library("lavaan")
data(PoliticalDemocracy)
PoliticalDemocracy

##       y1       y2        y3        y4        y5       y6       y7        y8
## 1   2.50 0.000000  3.333333  0.000000  1.250000 0.000000 3.726360  3.333333
## 2   1.25 0.000000  3.333333  0.000000  6.250000 1.100000 6.666666  0.736999
## 3   7.50 8.800000  9.999998  9.199991  8.750000 8.094061 9.999998  8.211809
## 4   8.90 8.800000  9.999998  9.199991  8.907948 8.127979 9.999998  4.615086
## 5  10.00 3.333333  9.999998  6.666666  7.500000 3.333333 9.999998  6.666666
## 6   7.50 3.333333  6.666666  6.666666  6.250000 1.100000 6.666666  0.368500
## 7   7.50 3.333333  6.666666  6.666666  5.000000 2.233333 8.271257  1.485166
## 8   7.50 2.233333  9.999998  1.496333  6.250000 3.333333 9.999998  6.666666
## 9   2.50 3.333333  3.333333  3.333333  6.250000 3.333333 3.333333  3.333333
## 10 10.00 6.666666  9.999998  8.899991  8.750000 6.666666 9.999998 10.000000
## 11  7.50 3.333333  9.999998  6.666666  8.750000 3.333333 9.999998  6.666666
## 12  7.50 3.333333  6.666666  6.666666  8.750000 3.333333 6.666666  6.666666
## 13  7.50 3.333333  9.999998  6.666666  7.500000 3.333333 6.666666 10.000000
## 14  7.50 7.766664  9.999998  6.666666  7.500000 0.000000 9.999998  0.000000
## 15  7.50 9.999998  3.333333 10.000000  7.500000 6.666666 9.999998 10.000000
## 16  7.50 9.999998  9.999998  7.766666  7.500000 1.100000 6.666666  6.666666
## 17  2.50 3.333333  6.666666  6.666666  5.000000 1.100000 6.666666  0.368500
## 18  1.25 0.000000  3.333333  3.333333  1.250000 3.333333 3.333333  3.333333
## 19 10.00 9.999998  9.999998 10.000000  8.750000 9.999998 9.999998 10.000000
## 20  7.50 3.333299  3.333333  6.666666  7.500000 2.233299 6.666666  2.948164
## 21 10.00 9.999998  9.999998 10.000000 10.000000 9.999998 9.999998 10.000000
## 22  1.25 0.000000  0.000000  0.000000  2.500000 0.000000 0.000000  0.000000
## 23  2.50 0.000000  3.333333  3.333333  2.500000 0.000000 3.333333  3.333333
## 24  7.50 6.666666  9.999998 10.000000  7.500000 6.666666 9.999998  7.766666
## 25  8.50 9.999998  6.666666  6.666666  8.750000 9.999998 7.351018  6.666666
## 26  6.10 0.000000  5.400000  3.333333  0.000000 0.000000 4.696028  3.333333
## 27  3.30 0.000000  6.666666  3.333333  6.250000 0.000000 6.666666  3.333333
## 28  2.90 3.333333  6.666666  3.333333  2.385559 0.000000 3.177568  1.116666
## 29  9.20 0.000000  9.900000  3.333333  7.609660 0.000000 8.118828  3.333333
## 30  6.90 0.000000  6.666666  3.333333  4.226033 0.000000 0.000000  0.000000
## 31  2.90 0.000000  3.333333  3.333333  5.000000 0.000000 3.333333  3.333333
## 32  2.00 0.000000  0.000000  0.000000  0.000000 0.000000 0.000000  0.000000
## 33  5.00 0.000000  3.333333  3.333333  5.000000 0.000000 3.333333  3.333333
## 34  5.00 0.000000  9.999998  3.333333  0.000000 0.000000 3.333333  0.744370
## 35  4.10 9.999998  4.700000  6.666666  3.750000 0.000000 7.827667  6.666666
## 36  6.30 9.999998  9.999998  6.666666  6.250000 2.233333 6.666666  2.955702
## 37  5.20 4.999998  6.600000  3.333333  3.633403 1.100000 3.314128  3.333333
## 38  5.00 3.333333  6.400000  6.666666  2.844997 0.000000 4.429657  1.485166
## 39  3.10 4.999998  4.200000  5.000000  3.750000 0.000000 6.164304  3.333333
## 40  4.10 9.999998  6.666666  3.333333  5.000000 0.000000 4.938089  2.233333
## 41  5.00 9.999998  6.666666  1.666666  5.000000 0.000000 6.666666  0.368500
## 42  5.00 7.700000  6.666666  8.399997  6.250000 4.358243 9.999998  4.141377
## 43  5.00 6.200000  9.999998  6.060997  5.000000 2.782771 6.666666  4.974739
## 44  5.60 4.900000  0.000000  0.000000  6.555647 4.055463 6.666666  3.821796
## 45  5.70 4.800000  0.000000  0.000000  6.555647 4.055463 0.000000  0.000000
## 46  7.50 9.999998  7.900000  6.666666  3.750000 9.999998 7.631891  6.666666
## 47  2.50 0.000000  6.666666  3.333333  2.500000 0.000000 0.000000  0.000000
## 48  8.90 9.999998  9.700000  6.666666  5.000000 9.999998 9.556024  6.666666
## 49  7.60 0.000000 10.000000  0.000000  5.000000 1.100000 6.666666  1.099999
## 50  7.80 9.999998  6.666666  6.666666  5.000000 3.333333 6.666666  6.666666
## 51  2.50 0.000000  6.666666  3.333333  5.000000 0.000000 6.666666  3.333333
## 52  3.80 0.000000  5.100000  0.000000  3.750000 0.000000 6.666666  1.485166
## 53  5.00 3.333333  3.333333  2.233333  5.000000 3.333333 6.666666  5.566663
## 54  6.25 3.333333  9.999998  2.955702  6.250000 5.566663 9.999998  6.666666
## 55  1.25 0.000000  3.333333  0.000000  2.500000 0.000000 0.000000  0.000000
## 56  1.25 0.000000  4.700000  0.736999  2.500000 0.000000 3.333333  3.333333
## 57  1.25 0.000000  6.666666  0.000000  2.500000 0.000000 5.228375  0.000000
## 58  7.50 7.766664  9.999998  6.666666  7.500000 3.333333 9.999998  6.666666
## 59  2.50 0.000000  6.666666  4.433333  5.000000 0.000000 6.666666  1.485166
## 60  7.50 9.999998  9.999998 10.000000  8.750000 9.999998 9.999998 10.000000
## 61  1.25 0.000000  0.000000  0.000000  1.250000 0.000000 0.000000  0.000000
## 62  1.25 0.000000  0.000000  0.000000  0.000000 0.000000 0.000000  0.000000
## 63  2.50 0.000000  0.000000  0.000000  0.000000 0.000000 6.666666  2.948164
## 64  6.25 2.233299  6.666666  2.970332  3.750000 3.333299 6.666666  3.333333
## 65  7.50 9.999998  9.999998 10.000000  7.500000 9.999998 9.999998 10.000000
## 66  5.00 0.000000  6.100000  0.000000  5.000000 3.333333 9.999998  3.333333
## 67  7.50 9.999998  9.999998 10.000000  3.750000 9.999998 9.999998 10.000000
## 68  4.90 2.233333  9.999998  0.000000  5.000000 0.000000 3.621989  3.333333
## 69  5.00 0.000000  8.200000  0.000000  5.000000 0.000000 0.000000  0.000000
## 70  2.90 3.333333  6.666666  3.333333  2.500000 3.333333 6.666666  3.333333
## 71  5.40 9.999998  6.666666  3.333333  3.750000 6.666666 6.666666  1.485166
## 72  7.50 8.800000  9.999998  6.066666  7.500000 6.666666 9.999998  6.666666
## 73  7.50 7.000000  9.999998  6.852998  7.500000 6.348340 6.666666  7.508044
## 74 10.00 6.666666  9.999998 10.000000 10.000000 6.666666 9.999998 10.000000
## 75  3.75 3.333333  0.000000  0.000000  1.250000 3.333333 0.000000  0.000000
##          x1       x2       x3
## 1  4.442651 3.637586 2.557615
## 2  5.384495 5.062595 3.568079
## 3  5.961005 6.255750 5.224433
## 4  6.285998 7.567863 6.267495
## 5  5.863631 6.818924 4.573679
## 6  5.533389 5.135798 3.892270
## 7  5.308268 5.075174 3.316213
## 8  5.347108 4.852030 4.263183
## 9  5.521461 5.241747 4.115168
## 10 5.828946 5.370638 4.446216
## 11 5.916202 6.423247 3.791545
## 12 5.398163 6.246107 4.535708
## 13 6.622736 7.872074 4.906154
## 14 5.204007 5.225747 4.561047
## 15 5.509388 6.202536 4.586286
## 16 5.262690 5.820083 3.948911
## 17 4.700480 5.023881 4.394491
## 18 5.209486 4.465908 4.510268
## 19 5.916202 6.732211 5.829084
## 20 6.523562 6.992096 6.424591
## 21 6.238325 6.746412 5.741711
## 22 5.976351 6.712956 5.948168
## 23 5.631212 5.937536 5.686755
## 24 6.033086 6.093570 4.611429
## 25 6.196444 6.704414 5.475261
## 26 4.248495 2.708050 1.740830
## 27 5.141664 4.564348 2.255134
## 28 4.174387 3.688879 3.046927
## 29 4.382027 2.890372 1.711279
## 30 4.290459 1.609438 1.001674
## 31 4.934474 4.234107 1.418971
## 32 3.850148 1.945910 2.345229
## 33 5.181784 4.394449 3.167167
## 34 5.062595 4.595120 3.834970
## 35 4.691348 4.143135 2.255134
## 36 4.248495 3.367296 3.217506
## 37 5.564520 5.236442 2.677633
## 38 4.727388 3.610918 1.418971
## 39 4.143135 2.302585 1.418971
## 40 4.317488 4.955827 4.249888
## 41 5.141664 4.430817 3.046927
## 42 4.488636 3.465736 2.013579
## 43 4.615121 4.941642 2.255134
## 44 3.850148 2.397895 1.740830
## 45 3.970292 2.397895 1.050741
## 46 3.784190 3.091042 2.113313
## 47 3.806662 2.079442 2.137561
## 48 4.532599 3.610918 1.587802
## 49 5.117994 4.934474 3.834970
## 50 5.049856 5.111988 4.381490
## 51 5.393628 5.638355 4.169451
## 52 4.477337 3.931826 2.474671
## 53 5.257495 5.840642 5.001796
## 54 5.379897 5.505332 3.299937
## 55 5.298317 6.274762 4.381490
## 56 4.859812 5.669881 3.537416
## 57 4.969813 5.564520 4.510268
## 58 6.011267 6.253829 5.001796
## 59 5.075174 5.252273 5.350708
## 60 6.736967 7.125283 6.330518
## 61 5.225747 5.451038 3.167167
## 62 4.025352 1.791759 2.657972
## 63 4.234107 2.708050 2.474671
## 64 4.644391 5.564520 3.046927
## 65 4.418841 4.941642 3.380653
## 66 4.262680 4.219508 4.368462
## 67 4.875197 4.700480 3.834970
## 68 4.189655 1.386294 1.418971
## 69 4.521789 4.127134 2.113313
## 70 4.653960 3.555348 1.881917
## 71 4.477337 3.091042 1.987909
## 72 5.337538 5.631212 3.491004
## 73 6.129050 6.403574 5.001796
## 74 5.003946 4.962845 3.976994
## 75 4.488636 4.897840 2.867566

with the corresponding model structure, which we want to fit:

Firstly, we define the model:

model <- '
  # measurement model
    ind60 =~ x1 + x2 + x3
    dem60 =~ y1 + y2 + y3 + y4
    dem65 =~ y5 + y6 + y7 + y8
  # regressions
    dem60 ~ ind60
    dem65 ~ ind60 + dem60
  # residual correlations
    y1 ~~ y5
    y2 ~~ y4 + y6
    y3 ~~ y7
    y4 ~~ y8
    y6 ~~ y8
'

And now, we can fit the corresponding SEM model:

fit <- sem(model, data=PoliticalDemocracy)
summary(fit, standardized=TRUE)

## lavaan 0.6-8 ended normally after 68 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        31
##                                                       
##   Number of observations                            75
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                38.125
##   Degrees of freedom                                35
##   P-value (Chi-square)                           0.329
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   ind60 =~                                                              
##     x1                1.000                               0.670    0.920
##     x2                2.180    0.139   15.742    0.000    1.460    0.973
##     x3                1.819    0.152   11.967    0.000    1.218    0.872
##   dem60 =~                                                              
##     y1                1.000                               2.223    0.850
##     y2                1.257    0.182    6.889    0.000    2.794    0.717
##     y3                1.058    0.151    6.987    0.000    2.351    0.722
##     y4                1.265    0.145    8.722    0.000    2.812    0.846
##   dem65 =~                                                              
##     y5                1.000                               2.103    0.808
##     y6                1.186    0.169    7.024    0.000    2.493    0.746
##     y7                1.280    0.160    8.002    0.000    2.691    0.824
##     y8                1.266    0.158    8.007    0.000    2.662    0.828
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   dem60 ~                                                               
##     ind60             1.483    0.399    3.715    0.000    0.447    0.447
##   dem65 ~                                                               
##     ind60             0.572    0.221    2.586    0.010    0.182    0.182
##     dem60             0.837    0.098    8.514    0.000    0.885    0.885
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##  .y1 ~~                                                                 
##    .y5                0.624    0.358    1.741    0.082    0.624    0.296
##  .y2 ~~                                                                 
##    .y4                1.313    0.702    1.871    0.061    1.313    0.273
##    .y6                2.153    0.734    2.934    0.003    2.153    0.356
##  .y3 ~~                                                                 
##    .y7                0.795    0.608    1.308    0.191    0.795    0.191
##  .y4 ~~                                                                 
##    .y8                0.348    0.442    0.787    0.431    0.348    0.109
##  .y6 ~~                                                                 
##    .y8                1.356    0.568    2.386    0.017    1.356    0.338
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .x1                0.082    0.019    4.184    0.000    0.082    0.154
##    .x2                0.120    0.070    1.718    0.086    0.120    0.053
##    .x3                0.467    0.090    5.177    0.000    0.467    0.239
##    .y1                1.891    0.444    4.256    0.000    1.891    0.277
##    .y2                7.373    1.374    5.366    0.000    7.373    0.486
##    .y3                5.067    0.952    5.324    0.000    5.067    0.478
##    .y4                3.148    0.739    4.261    0.000    3.148    0.285
##    .y5                2.351    0.480    4.895    0.000    2.351    0.347
##    .y6                4.954    0.914    5.419    0.000    4.954    0.443
##    .y7                3.431    0.713    4.814    0.000    3.431    0.322
##    .y8                3.254    0.695    4.685    0.000    3.254    0.315
##     ind60             0.448    0.087    5.173    0.000    1.000    1.000
##    .dem60             3.956    0.921    4.295    0.000    0.800    0.800
##    .dem65             0.172    0.215    0.803    0.422    0.039    0.039

Try by yourself

Consider the dataset HolzingerSwineford1939 from the R package library("lavaan") which can be loadid in by the command
```
data(HolzingerSwineford1939)
```

Apply the factor analysis approach for the covariates $X_{1}, \dots, X_9$ and consider three different factors when explaining the underlying covariance structure in amoung these covariats:

fa_Hol <- factanal(HolzingerSwineford1939[,7:15], factors = 3, rotation="varimax")
print(fa_Hol, digits=2, cutoff=.45, sort=TRUE)

## 
## Call:
## factanal(x = HolzingerSwineford1939[, 7:15], factors = 3, rotation = "varimax")
## 
## Uniquenesses:
##   x1   x2   x3   x4   x5   x6   x7   x8   x9 
## 0.51 0.75 0.54 0.28 0.24 0.31 0.50 0.47 0.54 
## 
## Loadings:
##    Factor1 Factor2 Factor3
## x4  0.83                  
## x5  0.86                  
## x6  0.80                  
## x1          0.62          
## x3          0.66          
## x7                  0.70  
## x8                  0.71  
## x9                  0.52  
## x2          0.49          
## 
##                Factor1 Factor2 Factor3
## SS loadings       2.18    1.34    1.33
## Proportion Var    0.24    0.15    0.15
## Cumulative Var    0.24    0.39    0.54
## 
## Test of the hypothesis that 3 factors are sufficient.
## The chi square statistic is 22.38 on 12 degrees of freedom.
## The p-value is 0.0335

What is the interpretation of the results from above? Could you propose a linear regresion model (or SEM alternatively) to get some quantitative results?

The confirmatory factor analysis can do the work for you… The R function cfa() from the R packages “lavaan” is very similar to the sem() function used before. In fact, these two functions are almost identical.

HS.model <- ' visual  =~ x1 + x2 + x3
              textual =~ x4 + x5 + x6
              speed   =~ x7 + x8 + x9 '

fit <- cfa(HS.model, data=HolzingerSwineford1939)
summary(fit, fit.measures=TRUE)

## lavaan 0.6-8 ended normally after 35 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        21
##                                                       
##   Number of observations                           301
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                                85.306
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               918.852
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.931
##   Tucker-Lewis Index (TLI)                       0.896
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -3737.745
##   Loglikelihood unrestricted model (H1)      -3695.092
##                                                       
##   Akaike (AIC)                                7517.490
##   Bayesian (BIC)                              7595.339
##   Sample-size adjusted Bayesian (BIC)         7528.739
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.092
##   90 Percent confidence interval - lower         0.071
##   90 Percent confidence interval - upper         0.114
##   P-value RMSEA <= 0.05                          0.001
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.065
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual =~                                           
##     x1                1.000                           
##     x2                0.554    0.100    5.554    0.000
##     x3                0.729    0.109    6.685    0.000
##   textual =~                                          
##     x4                1.000                           
##     x5                1.113    0.065   17.014    0.000
##     x6                0.926    0.055   16.703    0.000
##   speed =~                                            
##     x7                1.000                           
##     x8                1.180    0.165    7.152    0.000
##     x9                1.082    0.151    7.155    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   visual ~~                                           
##     textual           0.408    0.074    5.552    0.000
##     speed             0.262    0.056    4.660    0.000
##   textual ~~                                          
##     speed             0.173    0.049    3.518    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .x1                0.549    0.114    4.833    0.000
##    .x2                1.134    0.102   11.146    0.000
##    .x3                0.844    0.091    9.317    0.000
##    .x4                0.371    0.048    7.779    0.000
##    .x5                0.446    0.058    7.642    0.000
##    .x6                0.356    0.043    8.277    0.000
##    .x7                0.799    0.081    9.823    0.000
##    .x8                0.488    0.074    6.573    0.000
##    .x9                0.566    0.071    8.003    0.000
##     visual            0.809    0.145    5.564    0.000
##     textual           0.979    0.112    8.737    0.000
##     speed             0.384    0.086    4.451    0.000

What would be the corresponding diagram capturing the dependence structure from the model above?

Homework Assignment

(Deadline: Lab Session 9 / 19.04.2022)

Consider the same dataset you applied the principal component analysis to and apply the factor analysis now;
Use relevant covariates only and compare the outcomes of both, PCA and FA;
Use some graphical tools to visualize the resultsl;
Add your solution to your personal web page, on Monday, 18.04.2022, 23:59 (CET) at latest.