Weighted pairwise likelihood and limited information goodness-of-fit tests for binary factor models

Department of Statistics and Data Science Seminar, NUS

Haziq Jamil

Assistant Professor in Statistics, Universiti Brunei Darussalam
Visiting Fellow, London School of Economics and Political Science

https://haziqj.ml/plgof-nus/

October 18, 2024

Irini Moustaki
*London School of Economics and Political Science*

Chris Skinner (1954-2020)
*London School of Economics and Political Science*

Jamil, H., Moustaki, I., & Skinner, C. (2024). Pairwise likelihood estimation and limited information goodness-of-fit test statistics for binary factor analysis models under complex survey sampling. Br. J. Math. Stat. Psychol. Advance online publication. https://doi.org/10.1111/bmsp.12358

Introduction

Context

Employ latent variable models (factor models) to analyse binary data \(y_1,\dots,y_p\) collected via simple random or complex sampling.

Introduction (cont.)

\(i\)	\(y_1\)	\(y_2\)	\(y_3\)	\(y_4\)	\(y_5\)
1	1	0	0	1	1
2	1	1	1	1	1
3	1	1	1	0	1
\(\vdots\)			\(\vdots\)
\(n\)	1	0	0	1	1

\(i\)	\(y_1\)	\(y_2\)	\(y_3\)	\(y_4\)	\(y_5\)	Pattern
1	1	0	0	1	1	10011
2	1	1	1	1	1	11111
3	1	1	1	0	1	11101
\(\vdots\)			\(\vdots\)			\(\vdots\)
\(n\)	1	0	0	1	1	10011

\(r\)	\(y_1\)	\(y_2\)	\(y_3\)	\(y_4\)	\(y_5\)	Pattern	Obs. freq
1	1	1	1	1	1	11111	343
2	1	1	0	1	1	11011	153
3	1	0	1	1	1	10111	71
\(\vdots\)			\(\vdots\)			\(\vdots\)	\(\vdots\)
\(R\)	0	1	1	1	0	01110	1

\(R = 2^p\)

\(r\)	Pattern	Obs. freq
1	11111	343
2	11011	153
3	10111	71
\(\vdots\)	\(\vdots\)	\(\vdots\)
32	01110	1

\(r\)	Pattern	Obs. freq	Exp. freq
1	11111	343	342.1
2	11011	153	151.3
3	10111	71	62.81
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
32	01110	1	0.948

\(r\)	Pattern	Obs. freq	Exp. freq
1	11111	343	342.1
2	11011	153	151.3
3	10111	71	62.81
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
28	01000	1	1.831
29	01010	1	3.276
30	01100	1	0.948
31	01101	0	0.013
32	01110	0	0.009

\(r\)	Pattern	Obs. freq	Exp. freq
1	11111	343	342.1
2	11011	153	151.3
3	10111	71	62.81
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
28	01000	1	1.831
29	01010	1	3.276
30	01100	1	0.948
31	01101	0	0.013
32	01110	0	0.009

\(r\)	Pattern	Obs. freq	Exp. freq
1	11111	360.9	342.1
2	11011	181.2	151.3
3	10111	68.05	62.81
\(\vdots\)	\(\vdots\)	\(\vdots\)	\(\vdots\)
28	01000	1.716	1.831
29	01010	1.120	3.276
30	01100	0.591	0.948
31	01101	0	0.013
32	01110	0	0.009

Sparsity affects reliability of goodness-of-fit tests.
- Limited information tests (Bartholomew & Leung, 2002; Maydeu-Olivares & Joe, 2005; Reiser, 1996)

Computational burden of likelihood-based models.
- Pairwise likelihood (Katsikatsou et al., 2012; Varin et al., 2011)

Unequal probability sampling (e.g. due to a complex design)
- Incorporate design weights (Muthén & Satorra, 1995; Skinner, 1989)

Definitions

Let \({\mathbf y}= (y_1, \ldots, y_p)^\top \in\{0,1\}^p\) be a vector of Bernoulli random variables.

The joint probability of observing a response pattern \({\mathbf c}_r=(c_{r1},\dots,c_{rp})^\top\), for each \(r=1,\dots,R:=2^p\), is given by \[ \pi_r = \operatorname{P}({\mathbf y}= {\mathbf c}_r) = \operatorname{P}(y_1=c_{r1},\dots,y_p=c_{rp}), \tag{1}\] with \(\sum_{r=1}^R \pi_r = 1\).
Suppose observations \({\mathcal Y}:= \big\{{\mathbf y}^{(h)}\big\}_{h=1}^n\) are obtained, where each unit \(h\) has a probability of selection \(1/w_h\).
Out of convenience, sampling weights are (typically) normalised so that \(\sum_{h=1}^n w_h = n\).
- Simple random sampling (SRS): \(w_h=1, \forall h\).
- Stratified sampling: \(w_h = N_s(h)/N\), the stratum fractions.
- Etc.

Multinomial distribution

Let \(p_r = \hat n_r \big/ \sum_h w_h\) be the the \(r\)th entry of the \(R\)-vector of proportions \({\mathbf p}\), with \(\hspace{1.4cm}\)

\[ \hat n_r = \sum_{h=1}^n w_h [{\mathbf y}^{(h)} = {\mathbf c}_r]. \tag{2}\]

The random vector \(\hat{\mathbf n}= (\hat n_1,\dots,\hat n_R)^\top\) follows a multinomial distribution with parameters \(n\), \(R\), and \({\boldsymbol\pi}:=(\pi_1,\dots,\pi_R)^\top\), with

\[ \mathop{\mathrm{E}}(\hat{\mathbf n}) = n{\boldsymbol\pi}\hspace{2em}\text{and}\hspace{2em} \mathop{\mathrm{Var}}(\hat{\mathbf n}) = n\big( \ {\color{lightgray}\overbrace{\color{black}\mathop{\mathrm{diag}}({\boldsymbol\pi}) - {\boldsymbol\pi}{\boldsymbol\pi}^\top}^{{\boldsymbol\Sigma}}} \ \big). \]

It is widely known that (Agresti, 2012) for IID samples that \[ \sqrt n ({\mathbf p}- {\boldsymbol\pi}) \xrightarrow{\text D} {\mathop{\mathrm{N}}}_R({\mathbf 0}, {\boldsymbol\Sigma}) \tag{3}\] as \(n\to\infty\). This also works for complex sampling (Fuller, 2009), but \({\boldsymbol\Sigma}\) need not take a multinomial form.

Parametric models

\[ H_0: {\boldsymbol\pi}= {\boldsymbol\pi}({\boldsymbol\theta}) \]

E.g. binary factor model with underlying variable approach (s.t. constraints) \[ \begin{gathered} y_i = \begin{cases} 1 & y_i^* > \tau_i \\ 0 & y_i^* \leq \tau_i \end{cases} \\ {\mathbf y}^* = {\boldsymbol\Lambda}{\boldsymbol\eta}+ {\boldsymbol\epsilon}\\ {\boldsymbol\eta}\sim {\mathop{\mathrm{N}}}_q({\mathbf 0}, {\boldsymbol\Psi}), \hspace{3pt} {\boldsymbol\epsilon}\sim {\mathop{\mathrm{N}}}_p({\mathbf 0}, {\boldsymbol\Theta}_{\epsilon}) \end{gathered} \tag{4}\]

The log-likelihood for \({\boldsymbol\theta}^\top = (\)\({\boldsymbol\lambda}\)\(,\,\)\({\boldsymbol\rho}\)\(,\,\)\({\boldsymbol\tau}\)\()\in\mathbb{R}^m\) is \[ \log L({\boldsymbol\theta}\mid {\mathcal Y}) = \sum_{r=1}^R \hat n_r \log \pi_r({\boldsymbol\theta}) \tag{5}\] where \(\pi_r({\boldsymbol\theta}) = \int_{{\mathcal C}_r} \phi_p({\mathbf y}^* \mid {\mathbf 0}, {\boldsymbol\Lambda}{\boldsymbol\Psi}{\boldsymbol\Lambda}^\top + {\boldsymbol\Theta}_\epsilon) \mathop{\mathrm{d}}\hspace{0.5pt}\!{\mathbf y}^*\).

FIML may be difficult (high-dimensional integral; perfect separation).

Composite likelihood

Terminology: Pseudo-likelihood, quasi-likelihood (à la Wedderburn (1974) or misspecified models), limited information methods.

Let \(\{{\mathcal A}_1,\dots,{\mathcal A}_K\}\) be a set of marginal or conditional events (partitioning the variable space). The composite likelihood is defined as (Lindsay, 1988) \[ {\mathcal L}({\boldsymbol\theta}\mid {\mathbf y}) = \prod_{k=1}^K L({\boldsymbol\theta}\mid {\mathbf y}\in {\mathcal A}_k)^{\textcolor{lightgray}{\omega_k}} \]

Component likelihoods \(L({\boldsymbol\theta}\mid {\mathbf y}\in {\mathcal A}_k)\) are either conditional (Besag, 1974; Liang, 1987; Molenberghs & Verbeke, 2006) or marginal (Chandler & Bate, 2007; Cox & Reid, 2004; Varin, 2008) densities.
Composite likelihood enjoys nice features (Varin et al., 2011): relatively efficient, robust, and easier to maximise (smoother surface).

Composite likelihood is an inference function derived by multiplying a collection of component likelihoods
Because each individual component is a conditional or marginal density, the resulting estimating equation obtained from the derivative of the composite log-likelihood is an unbiased estimating equation
Because the components are multiplied, whether or not they are independent, the inference function has the properties of likelihood from a misspeciﬁed model
Terminology
- Pseudo-likelihood: Function of parameter and data that behaves in “some respect” as a likelihood
- Composite likelihood: An example of pseudo-likelihood, based on terms that are marginal or conditional densities
- Quasi-likelihood: Wedderburn’s likelihood which is not a pdf or misspecified likelihoods
- Limited information methods: Inference procedures based on low-dimensional margins.
Some history
- Besag (1974) introduced the idea of composite likelihood for spatial data
- Liang (1987) studies composite conditional likelihoods
- Molenberghs and Verbeke (2005) in the context of longitudinal studies
- Independence likelihood
- Pairwise likelihood

An analogy

One may enjoy the approximate picture despite
not being able to see every blade of grass.

Pairwise likelihood estimation

For pairs of variables \(y_i\) and \(y_j\), \(i,j=1,\dots,p\), and \(i<j\), define \(\hspace{5cm}\) \[ \pi_{cc'}^{(ij)}({\boldsymbol\theta}) = \operatorname{P}_{{\boldsymbol\theta}}(y_i = c, y_j = c'), \hspace{2em} c,c'\in\{0,1\}. \tag{6}\] There are \(\tilde R = 4 \times \binom{p}{2}\) such probabilities, with \(\sum_{c,c'} \pi_{cc'}^{(ij)}({\boldsymbol\theta}) = 1\).

The pairwise log-likelihood takes the form (Katsikatsou et al., 2012) \[ \log \operatorname{\mathcal L_{\text P}}({\boldsymbol\theta}\mid {\mathcal Y}) = \sum_{i<j} \sum_{c}\sum_{c'} \hat n_{cc'}^{(ij)} \log \pi_{cc'}^{(ij)}({\boldsymbol\theta}), \tag{7}\] where \(\hat n_{cc'}^{(ij)} = \sum_h w_h [{\mathbf y}^{(h)}_i = c, {\mathbf y}^{(h)}_j = c']\).
The evaluation of Equation 7 now involves only bivariate normal integrals! \[ \pi_{cc'}^{(ij)}({\boldsymbol\theta}) = \iint_{\tilde{{\mathcal C}}_{cc'}} \phi_2\big({\mathbf y}^*_{ij} \mid {\mathbf 0}, {\boldsymbol\Sigma}_{y^*}^{(ij)} ({\boldsymbol\theta})\big) \mathop{\mathrm{d}}\hspace{0.5pt}\!{\mathbf y}^*_{ij} \]

MPLE properties

Let \(\hat{\boldsymbol\theta}_{\text{PL}}= \mathop{\mathrm{argmax}}_{{\boldsymbol\theta}} \operatorname{\mathcal L_{\text P}}({\boldsymbol\theta}\mid {\mathcal Y})\). Under certain regularity conditions (Varin et al., 2011), as \(n\to\infty\), \[ \sqrt n (\hat{\boldsymbol\theta}_{\text{PL}}- {\boldsymbol\theta}) \xrightarrow{\text D} {\mathop{\mathrm{N}}}_m \left( {\mathbf 0}, \left\{ {\mathcal H}({\boldsymbol\theta}){\mathcal J}({\boldsymbol\theta})^{-1}{\mathcal H}({\boldsymbol\theta}) \right\}^{-1} \right),\hspace{1em} \text{where} \tag{8}\]
- \({\mathcal H}({\boldsymbol\theta})=-\mathop{\mathrm{E}}\big[\nabla^2\log \operatorname{\mathcal L_{\text P}}({\boldsymbol\theta}\mid {\mathbf y}^{(h)})\big]\) is the sensitivity matrix; and
- \({\mathcal J}({\boldsymbol\theta})=\mathop{\mathrm{Var}}\big[\nabla\log\operatorname{\mathcal L_{\text P}}({\boldsymbol\theta}\mid {\mathbf y}^{(h)})\big]\) is the variability matrix.

Estimators of these matrices are given by (Asparouhov, 2005; Zhao & Joe, 2005)

\[ \begin{aligned} \hat{\mathbf H}&= - \frac{1}{\sum_h w_h} \sum_h \nabla^2\log \operatorname{\mathcal L_{\text P}}({\boldsymbol\theta}\mid {\mathbf y}^{(h)}) \Bigg|_{{\boldsymbol\theta}= \hat{\boldsymbol\theta}_{\text{PL}}} \hspace{2em}\text{and} \\ \hat{\mathbf J}&= \frac{1}{\sum_h w_h} \sum_h \nabla\log \operatorname{\mathcal L_{\text P}}({\boldsymbol\theta}\mid {\mathbf y}^{(h)}) \nabla\log \operatorname{\mathcal L_{\text P}}({\boldsymbol\theta}\mid {\mathbf y}^{(h)})^\top \Bigg|_{{\boldsymbol\theta}= \hat{\boldsymbol\theta}_{\text{PL}}}. \end{aligned} \tag{9}\]

Limited information GOF tests

Goodness-of-fit

GOF tests are usually constructed by inspecting the fit of the joint probabilities \(\hat\pi_r := \pi_r(\hat{\boldsymbol\theta})\).

Most common tests are
- LR: \(X^2 = 2n\sum_r p_r\log( p_r/\hat\pi_r)\);
- Pearson: \(X^2 = n\sum_r ( p_r - \hat\pi_r)^2 / \hat\pi_r\).
These tests are asymptotically distributed as chi square.

Likely to face sparsity issues (small or zero cell counts) which distort the approximation to the chi square.

Limited information goodness-of-fit (LIGOF)

Consider instead the fit of the lower order moments.

Univariate: \(\ \dot\pi_i := \operatorname{P}(y_i = 1)\)
Bivariate: \(\ \dot\pi_{ij} := \operatorname{P}(y_i = 1, y_j=1)\)
Collectively \[ {\boldsymbol\pi}_2 = \begin{pmatrix} \dot{\boldsymbol\pi}_1 \\ \dot{\boldsymbol\pi}_2 \\ \end{pmatrix} = \begin{pmatrix} (\dot\pi_1, \dots, \dot\pi_p)^\top \\ \big(\dot\pi_{ij}\big)_{i<j} \\ \end{pmatrix} \] This is of dimensions \[ S=p + p(p-1)/2 \ll R. \]

Transformation matrix

Consider \({\mathbf T}_2: \mathbb{R}^R \to \mathbb{R}^S\) defined by \({\boldsymbol\pi}\mapsto {\boldsymbol\pi}_2\). To illustrate, consider \(p=3\) so that \(R=2^3=8\) and \(S=3+3=6\).

\[ {\color{lightgray}\overbrace{\color{black} \left( \begin{array}{c} \dot\pi_1 \\ \dot\pi_2 \\ \dot\pi_3 \\ \hdashline \dot\pi_{12} \\ \dot\pi_{13} \\ \dot\pi_{23} \\ \end{array} \right) \vphantom{ \begin{array}{c} \pi_{000} \\ \pi_{100} \\ \pi_{010} \\ \pi_{001} \\ \pi_{110} \\ \pi_{101} \\ \pi_{011} \\ \pi_{111} \\ \end{array} } }^{{\boldsymbol\pi}_2}} = {\color{lightgray}\overbrace{\color{black} \left( \begin{array}{cccccccc} 0 & 1 & 0 & 0 & 1 & 1 & 0 & 1 \\ 0 & 0 & 1 & 0 & 1 & 0 & 1 & 1 \\ 0 & 0 & 0 & 1 & 0 & 1 & 1 & 1 \\ \hdashline 0 & 0 & 0 & 0 & 1 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 1 \\ \end{array} \right) \vphantom{ \begin{array}{c} \pi_{000} \\ \pi_{100} \\ \pi_{010} \\ \pi_{001} \\ \pi_{110} \\ \pi_{101} \\ \pi_{011} \\ \pi_{111} \\ \end{array} } }^{{\mathbf T}_2}} \ {\color{lightgray}\overbrace{\color{black} \left( \begin{array}{c} \pi_{000} \\ \pi_{100} \\ \pi_{010} \\ \pi_{001} \\ \pi_{110} \\ \pi_{101} \\ \pi_{011} \\ \pi_{111} \\ \end{array} \right) }^{{\boldsymbol\pi}}} \]

Asymptotic distribution of residuals

Theorem

Consider the lower order residuals \(\hat{\mathbf e}_2 = {\mathbf p}_2 - {\boldsymbol\pi}_2(\hat{\boldsymbol\theta}_{\text{PL}})\). Then as \(n\to\infty\), \[ \sqrt n \, \hat{\mathbf e}_2 \xrightarrow{\text D} {\mathop{\mathrm{N}}}_S\left({\mathbf 0}, {\boldsymbol\Omega}_2\right) \] where \({\boldsymbol\Omega}_2 = \left( {\mathbf I}- {\boldsymbol\Delta}_2{\mathcal H}({\boldsymbol\theta})^{-1} {\mathbf B}({\boldsymbol\theta}) \right) {\boldsymbol\Sigma}_2 \left( {\mathbf I}- {\boldsymbol\Delta}_2{\mathcal H}({\boldsymbol\theta})^{-1} {\mathbf B}({\boldsymbol\theta}) \right)^\top\),

\({\boldsymbol\Sigma}_2 = {\mathbf T}_2{\boldsymbol\Sigma}{\mathbf T}_2^\top\) (uni & bivariate multinomial matrix);
\({\boldsymbol\Delta}_2 = {\mathbf T}_2 \big(\partial\pi_r({\boldsymbol\theta}) / \partial\theta_k \big)_{r,k}\) (uni & bivariate derivatives);
\({\mathcal H}({\boldsymbol\theta})\) (sensitivity matrix); and
\({\mathbf B}({\boldsymbol\theta})\) (some transformation matrix dependent on \({\boldsymbol\theta}\)).

To use \({\boldsymbol\Omega}_2\) in practice, replace with “hat versions” of relevant matrices.

Distribution of test statistics

LIGOF test statistics generally take the quadratic form \[ X^2 = n \hat{\mathbf e}_2^\top \hat{\boldsymbol\Xi}\hat{\mathbf e}_2, \] where \({\boldsymbol\Xi}(\hat{\boldsymbol\theta}) =: \hat{\boldsymbol\Xi}\xrightarrow{\text P} {\boldsymbol\Xi}\) is some \(S\times S\) weight matrix. Generally, \(X^2\) is referred to a chi square distribution under \(H_0\), because (Mathai & Provost, 1992) \[ X^2 \xrightarrow{\text D} \sum_{s=1}^S \delta_s\chi^2_1 \quad \text{as} \quad n\to\infty, \] where the \(\delta_s\) are the eigenvalues of \({\mathbf M}= {\boldsymbol\Omega}_2{\boldsymbol\Xi}\). Two cases:

If \({\mathbf M}\) is idempotent, then the chi square is exact.
Otherwise, it is a sum of scaled chi squares. Can be approximated by a chi square with degrees of freedom needing estimation (Cai et al., 2006).

Test statistics used

\[\begin{gathered}X^2 = n \hat{\mathbf e}_2^\top \hat{\boldsymbol\Xi}\hat{\mathbf e}_2 \\\sqrt n \hat{\mathbf e}_2 \approx {\mathop{\mathrm{N}}}_S ({\mathbf 0}, \hat{\boldsymbol\Omega}_2)\end{gathered}\]

	Name	\(\hat{{\boldsymbol\Xi}}\)	D.f.
1	Wald	\(\hat{{\boldsymbol\Omega}}^+_2\)	\(S-m\)
2	Wald (VCF)	\({\boldsymbol\Xi}\hat{{\boldsymbol\Omega}}_2{\boldsymbol\Xi}\)	\(S-m\)
3	Wald (Diag.)	\(\mathop{\mathrm{diag}}(\hat{{\boldsymbol\Omega}}_2)^{-1}\)	est.
4	Pearson	\(\mathop{\mathrm{diag}}(\hat{{\boldsymbol\pi}}_2)^{-1}\)	est.
5	RSS	\(\mathbf I\)	est.
6	Multinomial	\(\hat{{\boldsymbol\Sigma}}_2^{-1}\)	est.

Simulation study

Loadings
5 x 1 sparse Matrix of class "dgCMatrix"
         
[1,] 0.80
[2,] 0.70
[3,] 0.47
[4,] 0.38
[5,] 0.34

Thresholds
[1] -1.43 -0.55 -0.13 -0.72
[5] -1.13

Factor correlations
[1] 1

Loadings
8 x 1 sparse Matrix of class "dgCMatrix"
         
[1,] 0.80
[2,] 0.70
[3,] 0.47
[4,] 0.38
[5,] 0.34
[6,] 0.80
[7,] 0.70
[8,] 0.47

Thresholds
[1] -1.43 -0.55 -0.13 -0.72
[5] -1.13 -1.43 -0.55 -0.13

Factor correlations
[1] 1

Loadings
15 x 1 sparse Matrix of class "dgCMatrix"
          
 [1,] 0.80
 [2,] 0.70
 [3,] 0.47
 [4,] 0.38
 [5,] 0.34
 [6,] 0.80
 [7,] 0.70
 [8,] 0.47
 [9,] 0.38
[10,] 0.34
[11,] 0.80
[12,] 0.70
[13,] 0.47
[14,] 0.38
[15,] 0.34

Thresholds
 [1] -1.43 -0.55 -0.13 -0.72
 [5] -1.13 -1.43 -0.55 -0.13
 [9] -0.72 -1.13 -1.43 -0.55
[13] -0.13 -0.72 -1.13

Factor correlations
[1] 1

Loadings
10 x 2 sparse Matrix of class "dgCMatrix"
               
 [1,] 0.80 .   
 [2,] 0.70 .   
 [3,] 0.47 .   
 [4,] 0.38 .   
 [5,] 0.34 .   
 [6,] .    0.80
 [7,] .    0.70
 [8,] .    0.47
 [9,] .    0.38
[10,] .    0.34

Thresholds
 [1] -1.43 -0.55 -0.13 -0.72
 [5] -1.13 -1.43 -0.55 -0.13
 [9] -0.72 -1.13

Factor correlations
     [,1] [,2]
[1,]  1.0  0.3
[2,]  0.3  1.0

Loadings
15 x 3 sparse Matrix of class "dgCMatrix"
                    
 [1,] 0.80 .    .   
 [2,] 0.70 .    .   
 [3,] 0.47 .    .   
 [4,] 0.38 .    .   
 [5,] 0.34 .    .   
 [6,] .    0.80 .   
 [7,] .    0.70 .   
 [8,] .    0.47 .   
 [9,] .    0.38 .   
[10,] .    0.34 .   
[11,] .    .    0.80
[12,] .    .    0.70
[13,] .    .    0.47
[14,] .    .    0.38
[15,] .    .    0.34

Thresholds
 [1] -1.43 -0.55 -0.13 -0.72
 [5] -1.13 -1.43 -0.55 -0.13
 [9] -0.72 -1.13 -1.43 -0.55
[13] -0.13 -0.72 -1.13

Factor correlations
     [,1] [,2] [,3]
[1,]  1.0  0.2  0.3
[2,]  0.2  1.0  0.4
[3,]  0.3  0.4  1.0

Experiment 1: Informative sampling

Using M1: 1F5V, generate a fixed population size of \(N\). Then, assign each unit \(h\) a probability of selection as follows: \[ w_h^{-1} = \frac{1}{1 + \exp(y_1^*)}. \] Larger values of \(y_1^*\) result in smaller probabilities of selection.
Sample \(n\in\{500, 1000, 5000\}\) units from a population of size \(N=n/0.01\) which ensures no need for FPC factor (Lumley, 2004).
In repeated sampling (\(B=1000\)), interested in performance of PMLE vis-à-vis
- Bias
- Coverage for 95% CI
- SD/SE ratio

Informative sampling (results)

Relative mean bias

	True values	\(n = 500\)		\(n = 1000\)		\(n = 5000\)
	True values	PML	PMLW	PML	PMLW	PML	PMLW
Loadings
\(\lambda_1\)	0.80	−2.6%	−1.2%	−2.6%	−0.8%	−2.3%	−0.3%
\(\lambda_2\)	0.70	8.5%	0.2%	8.7%	0.3%	8.4%	−0.3%
\(\lambda_3\)	0.47	14.8%	1.8%	13.8%	0.7%	13.4%	0.2%
\(\lambda_4\)	0.38	13.2%	1.1%	12.8%	0.5%	12.9%	0.5%
\(\lambda_5\)	0.34	10.2%	−1.5%	9.8%	−1.8%	11.4%	−0.1%
Thresholds
\(\tau_1\)	-1.43	−2.2%	0.5%	−2.4%	0.3%	−2.5%	0.2%
\(\tau_2\)	-0.55	5.6%	0.3%	5.6%	0.2%	5.5%	0.0%
\(\tau_3\)	-0.13	43.4%	1.4%	42.3%	0.2%	42.0%	−0.2%
\(\tau_4\)	-0.72	4.5%	0.2%	4.6%	0.3%	4.5%	0.1%
\(\tau_5\)	-1.13	2.0%	0.1%	2.6%	0.6%	2.1%	0.1%
Factor correlations
\(\rho_{12}\)	0.20	20.0%	5.4%	17.0%	2.8%	16.7%	2.6%
\(\rho_{13}\)	0.30	14.6%	1.7%	12.6%	−0.3%	13.0%	−0.0%
\(\rho_{23}\)	0.40	4.0%	1.7%	2.6%	−0.0%	2.5%	−0.1%

Informative sampling (results)

Coverage and SD/SE ratio

	Coverage		SD/SE		Coverage		SD/SE		Coverage		SD/SE
	\(n = 500\)				\(n = 1000\)				\(n = 5000\)
	PML	PMLW	PML	PMLW	PML	PMLW	PML	PMLW	PML	PMLW	PML	PMLW
Loadings
\(\lambda_1\)	95.1%	95.8%	1.02	0.99	93.6%	94.1%	1.07	1.02	91.7%	96.0%	1.19	0.98
\(\lambda_2\)	89.1%	96.1%	1.22	0.94	80.0%	95.6%	1.48	0.97	31.8%	95.7%	2.66	1.01
\(\lambda_3\)	82.5%	95.7%	1.33	0.94	75.1%	94.3%	1.58	1.00	23.5%	95.1%	2.88	1.01
\(\lambda_4\)	91.4%	95.3%	1.14	0.97	83.7%	94.7%	1.33	1.01	50.7%	95.3%	2.16	0.96
\(\lambda_5\)	92.6%	95.7%	1.04	0.96	92.1%	96.0%	1.09	0.96	72.0%	94.4%	1.67	1.01
Thresholds
\(\tau_1\)	95.2%	96.1%	1.01	0.94	92.7%	96.8%	1.09	0.92	72.7%	96.6%	1.68	0.93
\(\tau_2\)	96.3%	97.9%	0.95	0.86	93.1%	96.4%	1.12	0.92	66.6%	96.5%	1.80	0.89
\(\tau_3\)	86.0%	94.8%	1.39	0.99	73.4%	96.9%	1.64	0.93	10.3%	96.9%	3.19	0.94
\(\tau_4\)	91.9%	94.1%	1.12	1.01	89.6%	95.1%	1.22	0.97	63.9%	95.4%	1.90	0.96
\(\tau_5\)	95.1%	95.8%	1.02	0.98	92.5%	95.7%	1.12	0.98	80.5%	95.4%	1.46	1.02
Factor correlations
\(\rho_{12}\)	92.1%	93.6%	1.10	1.01	91.2%	94.6%	1.13	1.00	79.6%	95.3%	1.49	0.97
\(\rho_{13}\)	92.7%	95.9%	1.07	0.96	88.8%	94.1%	1.20	1.04	70.6%	95.5%	1.71	0.98
\(\rho_{23}\)	91.8%	92.7%	1.06	1.04	95.6%	95.1%	0.99	0.99	93.7%	95.1%	1.05	0.99

Experiment 2: Educational survey

Simulate a population of 10⁶ students clustered within classrooms and stratified by school type (correlating with abilities).

Type	\(N\)	Classes	Avg. class size
A	400	33.0	15.2
B	1000	19.6	25.6
C	600	24.9	20.4

\(\text{ICC} \in (0.05, 0.60)\)

Cluster sample: Sample \(n_C\) schools using PPS, then sample 1 classroom via SRS, then select all students in classroom.
Stratified cluster sample: For each stratum, sample \(n_S\) schools using SRS, then sample 1 classroom via SRS, then select all students in classroom.

Educational survey sampling weights

Educational survey (results)

Type I error rates (\(\alpha = 5\%\), \(n=5000\))

Educational survey (results)

Power analysis (\(\alpha = 5\%\))

Summary

Conclusions

⚡ PML estimation offers a computationally efficient alternative that mitigates some challenges inherent in the UV approach for binary factor models.

🧩 For samples with unequal probability of selection, sampling weights are easily incorporated in the PML estimation routine.

🔍 Sparsity impairs the dependability of GOF tests, but are circumvented by considering lower order statistics.

✅ Generally all tests have acceptable Type I errors, except WaldDiag test.
🛠️ Wald and Pearson tests need \(\hat{{\boldsymbol\Omega}}_2^{-1}\), but WaldVCF and WaldDiag do not.
⚠️ Ignoring sampling weights may lead to inflated Type I errors and lower power.
🏆 Pearson test seems the most robust.

Software

R software implementation in {lavaan} (>= 0.6-17) to carry out weighted PML, and separately in {lavaan.bingof} for the LIGOF tests.

fit <- lavaan::sem(
  model = "eta1 =~ y1 + y2 + y3 + y4 + y5",
  data = lavaan.bingof::gen_data_bin_wt(n = 1000),
  std.lv = TRUE,
  estimator = "PML",
  sampling.weights = "wt"
)
lavaan.bingof::all_tests(fit)

# A tibble: 6 × 6
     X2    df name          pval Xi_rank     S
  <dbl> <dbl> <chr>        <dbl>   <int> <int>
1  3.39  5    Wald         0.641      13    15
2  3.34  5    WaldVCF      0.648       5    15
3  1.40  3.00 WaldDiag,MM3 0.705      15    15
4  1.10  3.35 Pearson,MM3  0.826      15    15
5  1.77  4.04 RSS,MM3      0.782      15    15
6  1.11  3.35 Multn,MM3    0.825      15    15

Thanks!

https://haziqj.ml/plgof-nus

References

Agresti, A. (2012). Categorical data analysis (Vol. 792). John Wiley & Sons.

Asparouhov, T. (2005). Sampling weights in latent variable modeling. Structural Equation Modeling, 12(3), 411–434.

Bartholomew, D. J., & Leung, S. O. (2002). A goodness of fit test for sparse 2p contingency tables. British Journal of Mathematical and Statistical Psychology, 55(1), 1–15.

Besag, J. (1974). Spatial Interaction and the Statistical Analysis of Lattice Systems. Journal of the Royal Statistical Society: Series B (Methodological), 36(2), 192–225. https://doi.org/10.1111/j.2517-6161.1974.tb00999.x

Cai, Li., Maydeu‐Olivares, Albert., Coffman, D. L., & Thissen, David. (2006). Limited‐information goodness‐of‐fit testing of item response theory models for sparse 2 ^P tables. British Journal of Mathematical and Statistical Psychology, 59(1), 173–194. https://doi.org/10.1348/000711005X66419

Chandler, R. E., & Bate, S. (2007). Inference for Clustered Data Using the Independence Loglikelihood. Biometrika, 94(1), 167–183. https://www.jstor.org/stable/20441361

Cox, D. R., & Reid, N. (2004). A Note on Pseudolikelihood Constructed from Marginal Densities. Biometrika, 91(3), 729–737. https://www.jstor.org/stable/20441134

Fuller, W. A. (2009). Introduction to statistical time series. John Wiley & Sons.

Katsikatsou, M., Moustaki, I., Yang-Wallentin, F., & Jöreskog, K. G. (2012). Pairwise likelihood estimation for factor analysis models with ordinal data. Computational Statistics & Data Analysis, 56(12), 4243–4258.

Liang, K.-Y. (1987). Extended Mantel-Haenszel Estimating Procedure for Multivariate Logistic Regression Models. Biometrics, 43(2), 289–299. https://doi.org/10.2307/2531813

Lindsay, B. G. (1988). Composite likelihood methods. In N. U. Prabhu (Ed.), Contemporary Mathematics (Vol. 80, pp. 221–239). American Mathematical Society. https://doi.org/10.1090/conm/080/999014

Lumley, T. (2004). Analysis of Complex Survey Samples. Journal of Statistical Software, 9, 1–19. https://doi.org/10.18637/jss.v009.i08

Mathai, A. M., & Provost, S. B. (1992). Quadratic forms in random variables: Theory and applications. Dekker.

Maydeu-Olivares, A., & Joe, H. (2005). Limited- and full-information estimation and goodness-of-fit testing in \(2^n\) contingency tables: A unified framework. Journal of the American Statistical Association, 100(471), 1009–1020.

Maydeu-Olivares, A., & Joe, H. (2008). An overview of limited information goodness-of-fit testing in multidimensional contingency tables. New Trends in Psychometrics, 253–262.

Molenberghs, G., & Verbeke, G. (2006). Models for discrete longitudinal data (Nachdr.). Springer.

Muthén, B. O., & Satorra, A. (1995). Complex Sample Data in Structural Equation Modeling. Sociological Methodology, 25, 267–316. https://doi.org/10.2307/271070

Reiser, M. (1996). Analysis of residuals for the multionmial item response model. Psychometrika, 61, 509–528.

Skinner, C. J. (1989). Domain means, regression and multivariate analysis. In C. J. Skinner, D. Holt, & T. M. F. Smith (Eds.), Analysis of complex surveys (pp. 59–87). Wiley.

Varin, C. (2008). On composite marginal likelihoods. AStA Advances in Statistical Analysis, 92(1), 1–28. https://doi.org/10.1007/s10182-008-0060-7

Varin, C., Reid, N., & Firth, D. (2011). An overview of composite likelihood methods. Statistica Sinica, 5–42.

Wedderburn, R. W. M. (1974). Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika, 61(3), 439–447. https://doi.org/10.2307/2334725

Zhao, Y, & Joe, H. (2005). Composite likelihood estimation in multivariate data analysis. Tha Canadian Journal of Statistics, 33(3), 335–356.