Department of Statistics and Data Science Seminar, NUS
Assistant Professor in Statistics, Universiti Brunei Darussalam
Visiting Fellow, London School of Economics and Political Science
October 18, 2024
Jamil, H., Moustaki, I., & Skinner, C. (2024). Pairwise likelihood estimation and limited information goodness-of-fit test statistics for binary factor analysis models under complex survey sampling. Br. J. Math. Stat. Psychol. Advance online publication. https://doi.org/10.1111/bmsp.12358
Context
Employ latent variable models (factor models) to analyse binary data \(y_1,\dots,y_p\) collected via simple random or complex sampling.
\(i\) | \(y_1\) | \(y_2\) | \(y_3\) | \(y_4\) | \(y_5\) |
---|---|---|---|---|---|
1 | 1 | 0 | 0 | 1 | 1 |
2 | 1 | 1 | 1 | 1 | 1 |
3 | 1 | 1 | 1 | 0 | 1 |
\(\vdots\) | \(\vdots\) | ||||
\(n\) | 1 | 0 | 0 | 1 | 1 |
\(i\) | \(y_1\) | \(y_2\) | \(y_3\) | \(y_4\) | \(y_5\) | Pattern |
---|---|---|---|---|---|---|
1 | 1 | 0 | 0 | 1 | 1 | 10011 |
2 | 1 | 1 | 1 | 1 | 1 | 11111 |
3 | 1 | 1 | 1 | 0 | 1 | 11101 |
\(\vdots\) | \(\vdots\) | \(\vdots\) | ||||
\(n\) | 1 | 0 | 0 | 1 | 1 | 10011 |
\(r\) | \(y_1\) | \(y_2\) | \(y_3\) | \(y_4\) | \(y_5\) | Pattern | Obs. freq |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 11111 | 343 |
2 | 1 | 1 | 0 | 1 | 1 | 11011 | 153 |
3 | 1 | 0 | 1 | 1 | 1 | 10111 | 71 |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) | ||||
\(R\) | 0 | 1 | 1 | 1 | 0 | 01110 | 1 |
\(R = 2^p\)
\(r\) | Pattern | Obs. freq |
---|---|---|
1 | 11111 | 343 |
2 | 11011 | 153 |
3 | 10111 | 71 |
\(\vdots\) | \(\vdots\) | \(\vdots\) |
32 | 01110 | 1 |
\(r\) | Pattern | Obs. freq | Exp. freq |
---|---|---|---|
1 | 11111 | 343 | 342.1 |
2 | 11011 | 153 | 151.3 |
3 | 10111 | 71 | 62.81 |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
32 | 01110 | 1 | 0.948 |
\(r\) | Pattern | Obs. freq | Exp. freq |
---|---|---|---|
1 | 11111 | 343 | 342.1 |
2 | 11011 | 153 | 151.3 |
3 | 10111 | 71 | 62.81 |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
28 | 01000 | 1 | 1.831 |
29 | 01010 | 1 | 3.276 |
30 | 01100 | 1 | 0.948 |
31 | 01101 | 0 | 0.013 |
32 | 01110 | 0 | 0.009 |
\(r\) | Pattern | Obs. freq | Exp. freq |
---|---|---|---|
1 | 11111 | 343 | 342.1 |
2 | 11011 | 153 | 151.3 |
3 | 10111 | 71 | 62.81 |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
28 | 01000 | 1 | 1.831 |
29 | 01010 | 1 | 3.276 |
30 | 01100 | 1 | 0.948 |
31 | 01101 | 0 | 0.013 |
32 | 01110 | 0 | 0.009 |
\(r\) | Pattern | Obs. freq | Exp. freq |
---|---|---|---|
1 | 11111 | 360.9 | 342.1 |
2 | 11011 | 181.2 | 151.3 |
3 | 10111 | 68.05 | 62.81 |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\vdots\) |
28 | 01000 | 1.716 | 1.831 |
29 | 01010 | 1.120 | 3.276 |
30 | 01100 | 0.591 | 0.948 |
31 | 01101 | 0 | 0.013 |
32 | 01110 | 0 | 0.009 |
The joint probability of observing a response pattern \({\mathbf c}_r=(c_{r1},\dots,c_{rp})^\top\), for each \(r=1,\dots,R:=2^p\), is given by \[ \pi_r = \operatorname{P}({\mathbf y}= {\mathbf c}_r) = \operatorname{P}(y_1=c_{r1},\dots,y_p=c_{rp}), \tag{1}\] with \(\sum_{r=1}^R \pi_r = 1\).
Suppose observations \({\mathcal Y}:= \big\{{\mathbf y}^{(h)}\big\}_{h=1}^n\) are obtained, where each unit \(h\) has a probability of selection \(1/w_h\).
Out of convenience, sampling weights are (typically) normalised so that \(\sum_{h=1}^n w_h = n\).
\[ \hat n_r = \sum_{h=1}^n w_h [{\mathbf y}^{(h)} = {\mathbf c}_r]. \tag{2}\]
\[ \mathop{\mathrm{E}}(\hat{\mathbf n}) = n{\boldsymbol\pi}\hspace{2em}\text{and}\hspace{2em} \mathop{\mathrm{Var}}(\hat{\mathbf n}) = n\big( \ {\color{lightgray}\overbrace{\color{black}\mathop{\mathrm{diag}}({\boldsymbol\pi}) - {\boldsymbol\pi}{\boldsymbol\pi}^\top}^{{\boldsymbol\Sigma}}} \ \big). \]
\[ H_0: {\boldsymbol\pi}= {\boldsymbol\pi}({\boldsymbol\theta}) \]
Component likelihoods \(L({\boldsymbol\theta}\mid {\mathbf y}\in {\mathcal A}_k)\) are either conditional (Besag, 1974; Liang, 1987; Molenberghs & Verbeke, 2006) or marginal (Chandler & Bate, 2007; Cox & Reid, 2004; Varin, 2008) densities.
Composite likelihood enjoys nice features (Varin et al., 2011): relatively efficient, robust, and easier to maximise (smoother surface).
One may enjoy the approximate picture despite
not being able to see every blade of grass.
The pairwise log-likelihood takes the form (Katsikatsou et al., 2012) \[ \log \operatorname{\mathcal L_{\text P}}({\boldsymbol\theta}\mid {\mathcal Y}) = \sum_{i<j} \sum_{c}\sum_{c'} \hat n_{cc'}^{(ij)} \log \pi_{cc'}^{(ij)}({\boldsymbol\theta}), \tag{7}\] where \(\hat n_{cc'}^{(ij)} = \sum_h w_h [{\mathbf y}^{(h)}_i = c, {\mathbf y}^{(h)}_j = c']\).
The evaluation of Equation 7 now involves only bivariate normal integrals! \[ \pi_{cc'}^{(ij)}({\boldsymbol\theta}) = \iint_{\tilde{{\mathcal C}}_{cc'}} \phi_2\big({\mathbf y}^*_{ij} \mid {\mathbf 0}, {\boldsymbol\Sigma}_{y^*}^{(ij)} ({\boldsymbol\theta})\big) \mathop{\mathrm{d}}\hspace{0.5pt}\!{\mathbf y}^*_{ij} \]
\[ \begin{aligned} \hat{\mathbf H}&= - \frac{1}{\sum_h w_h} \sum_h \nabla^2\log \operatorname{\mathcal L_{\text P}}({\boldsymbol\theta}\mid {\mathbf y}^{(h)}) \Bigg|_{{\boldsymbol\theta}= \hat{\boldsymbol\theta}_{\text{PL}}} \hspace{2em}\text{and} \\ \hat{\mathbf J}&= \frac{1}{\sum_h w_h} \sum_h \nabla\log \operatorname{\mathcal L_{\text P}}({\boldsymbol\theta}\mid {\mathbf y}^{(h)}) \nabla\log \operatorname{\mathcal L_{\text P}}({\boldsymbol\theta}\mid {\mathbf y}^{(h)})^\top \Bigg|_{{\boldsymbol\theta}= \hat{\boldsymbol\theta}_{\text{PL}}}. \end{aligned} \tag{9}\]
Most common tests are
These tests are asymptotically distributed as chi square.
Consider instead the fit of the lower order moments.
Univariate: \(\ \dot\pi_i := \operatorname{P}(y_i = 1)\)
Bivariate: \(\ \dot\pi_{ij} := \operatorname{P}(y_i = 1, y_j=1)\)
Collectively \[ {\boldsymbol\pi}_2 = \begin{pmatrix} \dot{\boldsymbol\pi}_1 \\ \dot{\boldsymbol\pi}_2 \\ \end{pmatrix} = \begin{pmatrix} (\dot\pi_1, \dots, \dot\pi_p)^\top \\ \big(\dot\pi_{ij}\big)_{i<j} \\ \end{pmatrix} \] This is of dimensions \[ S=p + p(p-1)/2 \ll R. \]
Consider \({\mathbf T}_2: \mathbb{R}^R \to \mathbb{R}^S\) defined by \({\boldsymbol\pi}\mapsto {\boldsymbol\pi}_2\). To illustrate, consider \(p=3\) so that \(R=2^3=8\) and \(S=3+3=6\).
\[ {\color{lightgray}\overbrace{\color{black} \left( \begin{array}{c} \dot\pi_1 \\ \dot\pi_2 \\ \dot\pi_3 \\ \hdashline \dot\pi_{12} \\ \dot\pi_{13} \\ \dot\pi_{23} \\ \end{array} \right) \vphantom{ \begin{array}{c} \pi_{000} \\ \pi_{100} \\ \pi_{010} \\ \pi_{001} \\ \pi_{110} \\ \pi_{101} \\ \pi_{011} \\ \pi_{111} \\ \end{array} } }^{{\boldsymbol\pi}_2}} = {\color{lightgray}\overbrace{\color{black} \left( \begin{array}{cccccccc} 0 & 1 & 0 & 0 & 1 & 1 & 0 & 1 \\ 0 & 0 & 1 & 0 & 1 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 0 & 1 & 1 & 1 \\ \hdashline 0 & 0 & 0 & 0 & 1 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{array} \right) \vphantom{ \begin{array}{c} \pi_{000} \\ \pi_{100} \\ \pi_{010} \\ \pi_{001} \\ \pi_{110} \\ \pi_{101} \\ \pi_{011} \\ \pi_{111} \\ \end{array} } }^{{\mathbf T}_2}} \ {\color{lightgray}\overbrace{\color{black} \left( \begin{array}{c} \pi_{000} \\ \pi_{100} \\ \pi_{010} \\ \pi_{001} \\ \pi_{110} \\ \pi_{101} \\ \pi_{011} \\ \pi_{111} \\ \end{array} \right) }^{{\boldsymbol\pi}}} \]
Theorem
Consider the lower order residuals \(\hat{\mathbf e}_2 = {\mathbf p}_2 - {\boldsymbol\pi}_2(\hat{\boldsymbol\theta}_{\text{PL}})\). Then as \(n\to\infty\), \[ \sqrt n \, \hat{\mathbf e}_2 \xrightarrow{\text D} {\mathop{\mathrm{N}}}_S\left({\mathbf 0}, {\boldsymbol\Omega}_2\right) \] where \({\boldsymbol\Omega}_2 = \left( {\mathbf I}- {\boldsymbol\Delta}_2{\mathcal H}({\boldsymbol\theta})^{-1} {\mathbf B}({\boldsymbol\theta}) \right) {\boldsymbol\Sigma}_2 \left( {\mathbf I}- {\boldsymbol\Delta}_2{\mathcal H}({\boldsymbol\theta})^{-1} {\mathbf B}({\boldsymbol\theta}) \right)^\top\),
To use \({\boldsymbol\Omega}_2\) in practice, replace with “hat versions” of relevant matrices.
LIGOF test statistics generally take the quadratic form \[ X^2 = n \hat{\mathbf e}_2^\top \hat{\boldsymbol\Xi}\hat{\mathbf e}_2, \] where \({\boldsymbol\Xi}(\hat{\boldsymbol\theta}) =: \hat{\boldsymbol\Xi}\xrightarrow{\text P} {\boldsymbol\Xi}\) is some \(S\times S\) weight matrix. Generally, \(X^2\) is referred to a chi square distribution under \(H_0\), because (Mathai & Provost, 1992) \[ X^2 \xrightarrow{\text D} \sum_{s=1}^S \delta_s\chi^2_1 \quad \text{as} \quad n\to\infty, \] where the \(\delta_s\) are the eigenvalues of \({\mathbf M}= {\boldsymbol\Omega}_2{\boldsymbol\Xi}\). Two cases:
\[\begin{gathered}X^2 = n \hat{\mathbf e}_2^\top \hat{\boldsymbol\Xi}\hat{\mathbf e}_2 \\\sqrt n \hat{\mathbf e}_2 \approx {\mathop{\mathrm{N}}}_S ({\mathbf 0}, \hat{\boldsymbol\Omega}_2)\end{gathered}\]
Name | \(\hat{{\boldsymbol\Xi}}\) | D.f. | |
---|---|---|---|
1 | Wald | \(\hat{{\boldsymbol\Omega}}^+_2\) | \(S-m\) |
2 | Wald (VCF) | \({\boldsymbol\Xi}\hat{{\boldsymbol\Omega}}_2{\boldsymbol\Xi}\) | \(S-m\) |
3 | Wald (Diag.) | \(\mathop{\mathrm{diag}}(\hat{{\boldsymbol\Omega}}_2)^{-1}\) | est. |
4 | Pearson | \(\mathop{\mathrm{diag}}(\hat{{\boldsymbol\pi}}_2)^{-1}\) | est. |
5 | RSS | \(\mathbf I\) | est. |
6 | Multinomial | \(\hat{{\boldsymbol\Sigma}}_2^{-1}\) | est. |
Loadings
5 x 1 sparse Matrix of class "dgCMatrix"
[1,] 0.80
[2,] 0.70
[3,] 0.47
[4,] 0.38
[5,] 0.34
Thresholds
[1] -1.43 -0.55 -0.13 -0.72
[5] -1.13
Factor correlations
[1] 1
Loadings
8 x 1 sparse Matrix of class "dgCMatrix"
[1,] 0.80
[2,] 0.70
[3,] 0.47
[4,] 0.38
[5,] 0.34
[6,] 0.80
[7,] 0.70
[8,] 0.47
Thresholds
[1] -1.43 -0.55 -0.13 -0.72
[5] -1.13 -1.43 -0.55 -0.13
Factor correlations
[1] 1
Loadings
15 x 1 sparse Matrix of class "dgCMatrix"
[1,] 0.80
[2,] 0.70
[3,] 0.47
[4,] 0.38
[5,] 0.34
[6,] 0.80
[7,] 0.70
[8,] 0.47
[9,] 0.38
[10,] 0.34
[11,] 0.80
[12,] 0.70
[13,] 0.47
[14,] 0.38
[15,] 0.34
Thresholds
[1] -1.43 -0.55 -0.13 -0.72
[5] -1.13 -1.43 -0.55 -0.13
[9] -0.72 -1.13 -1.43 -0.55
[13] -0.13 -0.72 -1.13
Factor correlations
[1] 1
Loadings
10 x 2 sparse Matrix of class "dgCMatrix"
[1,] 0.80 .
[2,] 0.70 .
[3,] 0.47 .
[4,] 0.38 .
[5,] 0.34 .
[6,] . 0.80
[7,] . 0.70
[8,] . 0.47
[9,] . 0.38
[10,] . 0.34
Thresholds
[1] -1.43 -0.55 -0.13 -0.72
[5] -1.13 -1.43 -0.55 -0.13
[9] -0.72 -1.13
Factor correlations
[,1] [,2]
[1,] 1.0 0.3
[2,] 0.3 1.0
Loadings
15 x 3 sparse Matrix of class "dgCMatrix"
[1,] 0.80 . .
[2,] 0.70 . .
[3,] 0.47 . .
[4,] 0.38 . .
[5,] 0.34 . .
[6,] . 0.80 .
[7,] . 0.70 .
[8,] . 0.47 .
[9,] . 0.38 .
[10,] . 0.34 .
[11,] . . 0.80
[12,] . . 0.70
[13,] . . 0.47
[14,] . . 0.38
[15,] . . 0.34
Thresholds
[1] -1.43 -0.55 -0.13 -0.72
[5] -1.13 -1.43 -0.55 -0.13
[9] -0.72 -1.13 -1.43 -0.55
[13] -0.13 -0.72 -1.13
Factor correlations
[,1] [,2] [,3]
[1,] 1.0 0.2 0.3
[2,] 0.2 1.0 0.4
[3,] 0.3 0.4 1.0
Using M1: 1F5V
, generate a fixed population size of \(N\). Then, assign each unit \(h\) a probability of selection as follows: \[
w_h^{-1} = \frac{1}{1 + \exp(y_1^*)}.
\] Larger values of \(y_1^*\) result in smaller probabilities of selection.
Sample \(n\in\{500, 1000, 5000\}\) units from a population of size \(N=n/0.01\) which ensures no need for FPC factor (Lumley, 2004).
In repeated sampling (\(B=1000\)), interested in performance of PMLE vis-à-vis
True values |
\(n = 500\) |
\(n = 1000\) |
\(n = 5000\) |
||||
---|---|---|---|---|---|---|---|
PML | PMLW | PML | PMLW | PML | PMLW | ||
Loadings |
|||||||
\(\lambda_1\) |
0.80 | −2.6% | −1.2% | −2.6% | −0.8% | −2.3% | −0.3% |
\(\lambda_2\) |
0.70 | 8.5% | 0.2% | 8.7% | 0.3% | 8.4% | −0.3% |
\(\lambda_3\) |
0.47 | 14.8% | 1.8% | 13.8% | 0.7% | 13.4% | 0.2% |
\(\lambda_4\) |
0.38 | 13.2% | 1.1% | 12.8% | 0.5% | 12.9% | 0.5% |
\(\lambda_5\) |
0.34 | 10.2% | −1.5% | 9.8% | −1.8% | 11.4% | −0.1% |
Thresholds |
|||||||
\(\tau_1\) |
-1.43 | −2.2% | 0.5% | −2.4% | 0.3% | −2.5% | 0.2% |
\(\tau_2\) |
-0.55 | 5.6% | 0.3% | 5.6% | 0.2% | 5.5% | 0.0% |
\(\tau_3\) |
-0.13 | 43.4% | 1.4% | 42.3% | 0.2% | 42.0% | −0.2% |
\(\tau_4\) |
-0.72 | 4.5% | 0.2% | 4.6% | 0.3% | 4.5% | 0.1% |
\(\tau_5\) |
-1.13 | 2.0% | 0.1% | 2.6% | 0.6% | 2.1% | 0.1% |
Factor correlations |
|||||||
\(\rho_{12}\) |
0.20 | 20.0% | 5.4% | 17.0% | 2.8% | 16.7% | 2.6% |
\(\rho_{13}\) |
0.30 | 14.6% | 1.7% | 12.6% | −0.3% | 13.0% | −0.0% |
\(\rho_{23}\) |
0.40 | 4.0% | 1.7% | 2.6% | −0.0% | 2.5% | −0.1% |
Coverage | SD/SE | Coverage | SD/SE | Coverage | SD/SE | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
\(n = 500\) |
\(n = 1000\) |
\(n = 5000\) |
||||||||||
PML | PMLW | PML | PMLW | PML | PMLW | PML | PMLW | PML | PMLW | PML | PMLW | |
Loadings |
||||||||||||
\(\lambda_1\) |
95.1% | 95.8% | 1.02 | 0.99 | 93.6% | 94.1% | 1.07 | 1.02 | 91.7% | 96.0% | 1.19 | 0.98 |
\(\lambda_2\) |
89.1% | 96.1% | 1.22 | 0.94 | 80.0% | 95.6% | 1.48 | 0.97 | 31.8% | 95.7% | 2.66 | 1.01 |
\(\lambda_3\) |
82.5% | 95.7% | 1.33 | 0.94 | 75.1% | 94.3% | 1.58 | 1.00 | 23.5% | 95.1% | 2.88 | 1.01 |
\(\lambda_4\) |
91.4% | 95.3% | 1.14 | 0.97 | 83.7% | 94.7% | 1.33 | 1.01 | 50.7% | 95.3% | 2.16 | 0.96 |
\(\lambda_5\) |
92.6% | 95.7% | 1.04 | 0.96 | 92.1% | 96.0% | 1.09 | 0.96 | 72.0% | 94.4% | 1.67 | 1.01 |
Thresholds |
||||||||||||
\(\tau_1\) |
95.2% | 96.1% | 1.01 | 0.94 | 92.7% | 96.8% | 1.09 | 0.92 | 72.7% | 96.6% | 1.68 | 0.93 |
\(\tau_2\) |
96.3% | 97.9% | 0.95 | 0.86 | 93.1% | 96.4% | 1.12 | 0.92 | 66.6% | 96.5% | 1.80 | 0.89 |
\(\tau_3\) |
86.0% | 94.8% | 1.39 | 0.99 | 73.4% | 96.9% | 1.64 | 0.93 | 10.3% | 96.9% | 3.19 | 0.94 |
\(\tau_4\) |
91.9% | 94.1% | 1.12 | 1.01 | 89.6% | 95.1% | 1.22 | 0.97 | 63.9% | 95.4% | 1.90 | 0.96 |
\(\tau_5\) |
95.1% | 95.8% | 1.02 | 0.98 | 92.5% | 95.7% | 1.12 | 0.98 | 80.5% | 95.4% | 1.46 | 1.02 |
Factor correlations |
||||||||||||
\(\rho_{12}\) |
92.1% | 93.6% | 1.10 | 1.01 | 91.2% | 94.6% | 1.13 | 1.00 | 79.6% | 95.3% | 1.49 | 0.97 |
\(\rho_{13}\) |
92.7% | 95.9% | 1.07 | 0.96 | 88.8% | 94.1% | 1.20 | 1.04 | 70.6% | 95.5% | 1.71 | 0.98 |
\(\rho_{23}\) |
91.8% | 92.7% | 1.06 | 1.04 | 95.6% | 95.1% | 0.99 | 0.99 | 93.7% | 95.1% | 1.05 | 0.99 |
Simulate a population of 106 students clustered within classrooms and stratified by school type (correlating with abilities).
Type | \(N\) |
Classes | Avg. class size |
---|---|---|---|
A | 400 | 33.0 | 15.2 |
B | 1000 | 19.6 | 25.6 |
C | 600 | 24.9 | 20.4 |
\(\text{ICC} \in (0.05, 0.60)\)
⚡ PML estimation offers a computationally efficient alternative that mitigates some challenges inherent in the UV approach for binary factor models.
🧩 For samples with unequal probability of selection, sampling weights are easily incorporated in the PML estimation routine.
🔍 Sparsity impairs the dependability of GOF tests, but are circumvented by considering lower order statistics.
R software implementation in {lavaan} (>= 0.6-17)
to carry out weighted PML, and separately in {lavaan.bingof}
for the LIGOF tests.
fit <- lavaan::sem(
model = "eta1 =~ y1 + y2 + y3 + y4 + y5",
data = lavaan.bingof::gen_data_bin_wt(n = 1000),
std.lv = TRUE,
estimator = "PML",
sampling.weights = "wt"
)
lavaan.bingof::all_tests(fit)
# A tibble: 6 × 6
X2 df name pval Xi_rank S
<dbl> <dbl> <chr> <dbl> <int> <int>
1 3.39 5 Wald 0.641 13 15
2 3.34 5 WaldVCF 0.648 5 15
3 1.40 3.00 WaldDiag,MM3 0.705 15 15
4 1.10 3.35 Pearson,MM3 0.826 15 15
5 1.77 4.04 RSS,MM3 0.782 15 15
6 1.11 3.35 Multn,MM3 0.825 15 15