Let \({\mathbf Y}= (Y_1, \dots, Y_p)^\top \in \{0,1\}^p\) be a vector of Bernoulli random variables. Consider a response pattern \({\mathbf y}= (y_1,\dots,y_p)^\top\), where each \(y_i\in\{0,1\}\). The probability of observing such a response pattern is given by the joint distribution \[\begin{align*} \pi &= \Pr({\mathbf Y}= {\mathbf y}) = \Pr(Y_1=y_1,\dots,Y_p=y_p). \tag{1} \end{align*}\] Note that there are a total of \(R=2^p\) possible joint probabilities \(\pi_r\) corresponding to all possible twoway response patterns \({\mathbf y}_r\).
When we consider a parametric model with parameter vector \({\boldsymbol\theta}\in\mathbb{R}^m\), we write \(\pi_r({\boldsymbol\theta})\) to indicate each joint probability, and \[\begin{equation} {\boldsymbol\pi}({\boldsymbol\theta}) = \begin{pmatrix} \pi_{1}({\boldsymbol\theta}) \\ \vdots \\ \pi_{R}({\boldsymbol\theta}) \\ \end{pmatrix} \in [0,1]^R \tag{2} \end{equation}\] for the vector of joint probabilities, with \(\sum_{r=1}^R \pi_{r}({\boldsymbol\theta}) =1\).
Modelbased probabilities
The model of interest is a factor model, commonly used in social statistics. Using an underlying variable (UV) approach, the observed binary responses \(y_i\) are manifestations of some latent, continuous variables \(Y_i^*\), \(i=1,\dots,p\). The connection is made as follows: \[ Y_i = \begin{cases} 1 & Y_i^* > \tau_i \\ 0 & Y_i^* \leq \tau_i, \end{cases} \] where \(\tau_i\) is the threshold associated with the variable \(Y_i^*\). For convenience, \(Y_i^*\) is taken to be standard normal random variables^{1}. The factor model takes the form \[ \mathbf Y^* = \mathbf \Lambda\boldsymbol \eta + \boldsymbol \epsilon, \] where each component is explained below:
\(\mathbf Y^* = (Y_1^*,\dots,Y_p^*)^\top \in \mathbf R^p\) are the underlying variables;
\(\boldsymbol\Lambda \in \mathbf R^{p \times q}\) is the matrix of loadings;
\(\boldsymbol \eta = (\eta_1,\dots,\eta_q)^\top \in \mathbf R^q\) is the vector of latent factors;
\(\boldsymbol \epsilon \in \mathbf R^p\) are the error terms associated with the items (aka unique variables).
We also make some distributional assumptions, namely
\(\boldsymbol\eta \sim \operatorname{N}_q(\mathbf 0, \boldsymbol\Psi)\), where \({\boldsymbol\Psi}\) is a correlation matrix, i.e. for \(k,l\in\{1,\dots,q\}\), \[ {\boldsymbol\Psi}_{kl} = \begin{cases} 1 & \text{if } k = l \\ \rho(\eta_k, \eta_l) & \text{if } k \neq l. \end{cases} \]
\(\boldsymbol \epsilon \sim \operatorname{N}_p(\mathbf 0, \boldsymbol\Theta_\epsilon)\), with \(\boldsymbol\Theta_\epsilon = \mathbf I  \operatorname{diag}(\boldsymbol\Lambda \boldsymbol\Psi \boldsymbol\Lambda^\top)\).
These two assumptions, together with \(\operatorname{Cov}(\boldsymbol\eta, \boldsymbol\epsilon) = 0\), implies that \(\mathbf Y^*\sim\operatorname{N}_p(\mathbf 0,{\boldsymbol\Sigma}_{{\mathbf y}^*})\), where \[\begin{align} {\boldsymbol\Sigma}_{{\mathbf y}^*} = \operatorname{Var}(\mathbf Y^*) &= \boldsymbol\Lambda\boldsymbol\Phi\boldsymbol\Lambda^\top + \boldsymbol\Theta_\epsilon \\ &= \mathbf I + (\boldsymbol\Lambda\boldsymbol\Phi\boldsymbol\Lambda^\top  \operatorname{diag}\big(\boldsymbol\Lambda \boldsymbol\Phi \boldsymbol\Lambda^\top)\big). \end{align}\] The parameter vector for this factor model is denoted \(\boldsymbol\theta^\top = (\boldsymbol\lambda, \boldsymbol\psi, \boldsymbol\tau) \in\mathbb{R}^m\), where it contains the vectors of the free nonredundant parameters in \(\boldsymbol\Lambda\) and \(\boldsymbol \Psi\) respectively, as well as the vector of all free thresholds.
Under this factor model, the probability of response pattern \(\mathbf y_r\) is \[\begin{align} \pi_{r}({\boldsymbol\theta}) &= \Pr(\mathbf Y = \mathbf y_r \mid \boldsymbol\theta) \\ &= \idotsint_A \phi_p(\mathbf y^* \mid \mathbf 0, \boldsymbol\Sigma_{\mathbf y^*} ) \, \text{d}\mathbf y^* \end{align}\] where \(\phi_p(\cdot \mid \boldsymbol\mu,\boldsymbol\Sigma)\) is the density function of the \(p\)dimensional normal distribution with mean \(\boldsymbol\mu\) and variance \(\boldsymbol\Sigma\). This integral is evaluated on the set \[ A = \{ \mathbf Y^* \in \mathbb R^p \mid Y_1=y_1,\dots,Y_p=y_p \}. \]
Parameter estimation
Suppose that \(h=1,\dots,n\) observations of \({\mathbf Y}={\mathbf y}^{(h)}\) are obtained. For the purpose of generalising from independent samples to complex samples, suppose that each unit \(h\) in the sample is assigned a (normalised) survey weight \(w_h\) with \(\sum_h w_h = n\). Of course, if an independent simple random sampling scheme is implemented, then each \(w_h=1\).
The sample proportions for each category \(r\) is written \(p_r = \hat n_r/n\), where \[\begin{equation} \hat n_r = \sum_h w_h [{\mathbf y}^{(h)} = {\mathbf y}_r], \end{equation}\] and \([\cdot]\) denotes the Iverson bracket. In other words, \(\hat n_r\) represents the (weighted) frequency counts of the observed response patterns such that \(\sum_{r=1}^R \hat n_r = n\). The vector \(\hat{\mathbf n}= (\hat n_1, \dots, \hat n_R)^\top\) then defines a multivariate binomial distribution, or more commonly called a multinomial distribution with parameters \(n\), \(R\), and \({\boldsymbol\pi}({\boldsymbol\theta})\). The probability mass function of \(\hat{\mathbf n}\) is given by \[\begin{equation} f_{\hat{\mathbf n}}(x_1,\dots,x_R) = n! \prod_{r=1}^R \frac{1}{x_r!} \big[\pi_{r}({\boldsymbol\theta})\big]^{x_r}, \end{equation}\] and the loglikelihood for \({\boldsymbol\theta}\) given the observed frequencies \(\hat{\mathbf n}\) (ignoring constants) is then \[\begin{equation} \ell({\boldsymbol\theta}) = \log f_{\hat{\mathbf n}}(\hat n_1,\dots,\hat n_R) = \sum_{r=1}^{R} \hat n_r \log \pi_{r}({\boldsymbol\theta}). \end{equation}\]
The maximum likelihood estimator \(\hat{\boldsymbol\theta}_{\text{ML}}\) satisfies \(\hat{\boldsymbol\theta}_{\text{ML}}= \mathop{\mathrm{argmax}}_{{\boldsymbol\theta}} \ell({\boldsymbol\theta})\). Maximum likelihood theory tells us that, under certain regularity conditions, as \(n\to\infty\), \[\begin{equation}\label{eq:limitdisttheta} \sqrt n (\hat{\boldsymbol\theta} {\boldsymbol\theta}) \xrightarrow{\text D} {\mathop{\mathrm{N}}}_m\big({\mathbf 0}, {\mathcal I}({\boldsymbol\theta})^{1}\big), \end{equation}\] where \(\mathbb{R}^{m\times m} \ni {\mathcal I}= n^{1}\mathop{\mathrm{E}}\nabla^2 \ell({\boldsymbol\theta})\) is the (unit) expected Fisher information matrix. It can be further shown that \({\mathcal I}= {\boldsymbol\Delta}^\top {\mathbf D}^{1} {\boldsymbol\Delta}\), where
 \({\boldsymbol\Delta}_{r,k} = \frac{\partial\pi_r({\boldsymbol\theta})}{\partial\theta_k}\), \(r=1,\dots,R\), \(k=1,\dots,m\); and
 \({\mathbf D}= \mathop{\mathrm{diag}}({\boldsymbol\pi}({\boldsymbol\theta}))\).
The maximum likelihood estimators are a class of best asymptotically normal (BAN) estimators \(\hat{\boldsymbol\theta}\) of \({\boldsymbol\theta}\) that satisfy \[\begin{equation} \sqrt n (\hat{\boldsymbol\theta} {\boldsymbol\theta}) = \sqrt n \, {\mathbf B}\big({\mathbf p} {\boldsymbol\pi}({\boldsymbol\theta})\big) + o(1) \tag{3} \end{equation}\] for some \(m\times R\) matrix \({\mathbf B}\). In the specific case of maximum likelihood estimation, we can derive \({\mathbf B}\) to be \({\mathbf B}= {\mathcal I}^{1} {\boldsymbol\Delta}^\top {\mathbf D}^{1}\). This is proven in the corresponding article (see XXX for more details).
Pairwise likelihood estimation
The main interest for this project is to construct test statistics using composite likelihood methods, specifically the pairwise likelihood method. In order to define the pairwise likelihood, let \(\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta})\) be the probability under the model that \(Y_i=y_i \in \{0,1\}\) and \(Y_j=y_j\in\{0,1\}\) for a pair of variables \(Y_i\) and \(Y_j\), \(i,j=1,\dots,p\) and \(i<j\). The pairwise loglikelihood takes the form \[\begin{equation} \operatorname{\ell_P}({\boldsymbol\theta}) = \sum_{i<j} \sum_{y_i}\sum_{y_j} \hat n_{y_iy_j}^{(ij)} \log \pi_{y_iy_j}^{(ij)}({\boldsymbol\theta}), \end{equation}\] where \(\hat n_{y_iy_j}^{(ij)}\) is the observed (weighted) frequency of sample units with \(Y_i=y_i\) and \(Y_j=y_j\), \[ \hat n_{y_iy_j}^{(ij)} = \sum_h w_h [{\mathbf y}^{(h)}_i = y_i, {\mathbf y}^{(h)}_j = y_j]. \] Let us also define the corresponding observed pairwise proportions \(p_{y_iy_j}^{(ij)} = \hat n_{y_iy_j}^{(ij)}/n\). There are a total of \(\tilde R = 4 \times {p \choose 2}\) summands, where the ‘4’ is representative of the total number of pairwise combinations of binary choices ‘00’, ‘10’, ‘01’, and ‘11’.
The pairwise maximum likelihood estimator \(\hat{\boldsymbol\theta}_{\text{PL}}\) satisfies \(\hat{\boldsymbol\theta}_{\text{PL}}= \mathop{\mathrm{argmax}}_{{\boldsymbol\theta}} \operatorname{\ell_P}({\boldsymbol\theta})\). Under certain regularity conditions, \[\begin{equation} \sqrt N (\hat{\boldsymbol\theta}_{\text{PL}} {\boldsymbol\theta}) \xrightarrow{D} {\mathop{\mathrm{N}}}_m\big({\mathbf 0}, {\mathcal H}({\boldsymbol\theta}){\mathcal J}({\boldsymbol\theta})^{1}{\mathcal H}({\boldsymbol\theta})\big), \end{equation}\] where
 \({\mathcal H}({\boldsymbol\theta})=\mathop{\mathrm{E}}\nabla^2\operatorname{\ell_P}({\boldsymbol\theta})\) is the sensitivity matrix; and
 \({\mathcal J}({\boldsymbol\theta})=\mathop{\mathrm{Var}}\big(\sqrt N \nabla\operatorname{\ell_P}({\boldsymbol\theta})\big)\) is the variability matrix.
We can also show that the pairwise maximum likelihood estimator satisfies (3), where the matrix \({\mathbf B}\) in this case would be \({\mathbf B}= {\mathcal H}^{1}\tilde{\boldsymbol\Delta}^{1}\tilde {\mathbf D}^{1} {\mathbf G}\), with
 \(\tilde{\boldsymbol\Delta}\in \mathbb{R}^{\tilde R \times m}\) consists of partial derivatives of the pairwise probabilities, i.e. \(\frac{\partial\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta})}{\partial\theta_k}\);
 \(\tilde{\mathbf D}= \mathop{\mathrm{diag}}((\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta}))_{i<j})\); and
 \({\mathbf G}\) is some indicator matrix to transform the quantities from \(\tilde R\) dimensions to \(R\) dimensions. See other articles for further details.
Goodnessoffit
Collect all the sample proportions \(p_r = \hat n_r/n\), \(r=1,\dots,R\) to form an \(R\)vector of proportions \({\mathbf p}\). For independent samples, it is widely known (Agresti 2012) that \[\begin{equation} \sqrt N ({\mathbf p} {\boldsymbol\pi}) \xrightarrow{\text D} {\mathop{\mathrm{N}}}_R ({\mathbf 0}, {\boldsymbol\Sigma}), \tag{4} \end{equation}\] where \({\boldsymbol\Sigma}= {\mathbf D} {\boldsymbol\pi}{\boldsymbol\pi}^\top\), with \({\mathbf D}:= \mathop{\mathrm{diag}}({\boldsymbol\pi})\). Considering a parametric model, the proportion vector in (4) reads \({\boldsymbol\pi}= {\boldsymbol\pi}({\boldsymbol\theta})\). An analogous result for the complex sampling design case exists as well (Fuller 2009); however the covariance matrix \({\boldsymbol\Sigma}\) need not take a multinomial form, and the true proportions \({\boldsymbol\pi}\) may either be a finite population or a superpopulation quantity.
Suppose we denote by \({\boldsymbol\pi}(\hat{\boldsymbol\theta})\) the estimated proportions under a parametric model. For estimators \(\hat{\boldsymbol\theta}\) satisfying (3), MaydeuOlivares and Joe (2005) show that the distribution of the residuals \(\hat{\mathbf e}= {\mathbf p} {\boldsymbol\pi}(\hat{\boldsymbol\theta})\) is asymptotically normal: \[\begin{equation} \sqrt n \hat{\mathbf e}= \sqrt n \big( {\mathbf p} {\boldsymbol\pi}(\hat{\boldsymbol\theta}) \big) \xrightarrow{\text D} {\mathop{\mathrm{N}}}_R ({\mathbf 0}, {\boldsymbol\Omega}) \end{equation}\] where \[\begin{align} {\boldsymbol\Omega} &= ({\mathbf I} {\boldsymbol\Delta}{\mathbf B}){\boldsymbol\Sigma}({\mathbf I} {\boldsymbol\Delta}{\mathbf B})^\top \\ &= {\boldsymbol\Sigma} {\boldsymbol\Sigma}{\mathbf B}^\top{\boldsymbol\Delta}^\top  {\boldsymbol\Delta}{\mathbf B}{\boldsymbol\Sigma}+ {\boldsymbol\Delta}{\mathbf B}{\boldsymbol\Sigma}{\mathbf B}^\top {\boldsymbol\Delta}^\top, \end{align}\] with \({\mathbf B}\) being the transformation matrix and \({\boldsymbol\Delta}\) the matrix of partial derivatives of \({\boldsymbol\pi}({\boldsymbol\theta})\) (both described previously). MaydeuOlivares and Joe (2005) remarks that for maximum likelihood estimators, the residual covariance matrix simplifies to \({\boldsymbol\Omega}= {\boldsymbol\Sigma} {\boldsymbol\Delta}{\mathcal I}^{1}_1{\boldsymbol\Delta}^\top\). We do not think such a simplification exists for the pairwise likelihood case [Hold that thought. Maybe it’s possible. Still investigating.].
Lowerorder residuals
The interest here is in obtaining the lowerorder marginals from the full \(R\)dimensional probability vector \({\boldsymbol\pi}\). Namely, we would like to get
Univariate moments of the form \(\dot\pi_i = \Pr(Y_i=1)\) (of which there are \(p\)) \[ \dot{\boldsymbol\pi}_1^\top = (\dot\pi_1,\dots,\dot\pi_p) \in[0,1]^p \]
Bivariate moments of the form \(\dot\pi_{ij}=\Pr(Y_i=1,Y_j=1)\) (of which there are \({p \choose 2}\)) \[ \dot{\boldsymbol\pi}_2^\top = \Big(\dot\pi_{ij} \Big)_{\substack{i,j=1\\i<j}}^n \in [0,1]^{p(p1)/2} \]
The bivariate moments in particular will be especially relevant when considering a pairwise likelihood estimation of the parametric model. Denote \({\boldsymbol\pi}_2 = \begin{pmatrix}\dot{{\boldsymbol\pi}}_1\\ \dot{{\boldsymbol\pi}}_2 \end{pmatrix}\) as the \(S:=p(p+1)/2\) vector of univariate and bivariate moments. Correspondinly, let \({\mathbf p}_2\) be the \(S\)vector of univariate and bivariate sample moments. It is possible to obtain \({\boldsymbol\pi}_2\) (as well as \({\mathbf p}_2\)) via a transformation of the full vector of probabilities \({\boldsymbol\pi}\) (c.f. \({\mathbf p}\)): \[\begin{equation} {\boldsymbol\pi}_2 = {\mathbf T}_2 {\boldsymbol\pi}. \end{equation}\] Here, \({\mathbf T}_2\) is an indicator matrix of dimension \(S \times R\). See the articles for further details.
Consider now the lower order residuals of up to order 2 under a parametric model, defined by \[\begin{equation} \hat{\mathbf e}_2 = {\mathbf p}_2  {\boldsymbol\pi}_2(\hat{\boldsymbol\theta}) = {\mathbf T}_2 \big({\mathbf p} {\boldsymbol\pi}(\hat{\boldsymbol\theta}) \big) = {\mathbf T}_2\hat{\mathbf e}. \end{equation}\] It should be straightforward to see, using the previous results, that as \(n\to\infty\) we get \[\begin{equation} \sqrt n \hat{\mathbf e}_2 = \sqrt N \big({\mathbf p}_2  {\boldsymbol\pi}_2(\hat{\boldsymbol\theta})\big) \xrightarrow{\text D} \mathop{\mathrm{N}}({\mathbf 0}, {\boldsymbol\Omega}_2), \tag{5} \end{equation}\] where \[\begin{align} {\boldsymbol\Omega}_2 = {\mathbf T}_2{\boldsymbol\Omega}{\mathbf T}_2^\top &= {\mathbf T}_2 ({\mathbf I} {\boldsymbol\Delta}{\mathbf B}){\boldsymbol\Sigma}({\mathbf I} {\boldsymbol\Delta}{\mathbf B})^\top {\mathbf T}_2^\top \\ &= {\boldsymbol\Sigma}_2  {\mathbf T}_2{\boldsymbol\Sigma}{\mathbf B}^\top{\boldsymbol\Delta}_2^\top  {\boldsymbol\Delta}_2 {\mathbf B}{\boldsymbol\Sigma}{\mathbf T}_2^\top + {\boldsymbol\Delta}_2 {\mathbf B}{\boldsymbol\Sigma}{\mathbf B}^\top {\boldsymbol\Delta}_2^\top, \tag{6} \end{align}\] with \({\boldsymbol\Sigma}_2 = {\mathbf T}_2{\boldsymbol\Sigma}{\mathbf T}_2^\top\) and \({\boldsymbol\Delta}_2={\mathbf T}_2{\boldsymbol\Delta}\) being the the transformed version of the respective matrices. For the specific case of maximum likelihood estimation, this simplifies to \({\boldsymbol\Omega}_2 = {\boldsymbol\Sigma}_2  {\boldsymbol\Delta}_2 {\mathcal I}_1^{1} {\boldsymbol\Delta}_2^\top\). For pairwise likelihood estimation, we show that \[ {\boldsymbol\Omega}_2 = ({\mathbf I} {\boldsymbol\Delta}_2{\mathbf B}_2){\boldsymbol\Sigma}_2({\mathbf I} {\boldsymbol\Delta}_2{\mathbf B}_2)^\top, \tag{7} \] where the transformation matrix in full is given by \({\mathbf B}_2 = {\mathcal H}^{1}\tilde{\boldsymbol\Delta}^\top\tilde{\mathbf D}^{1}{\mathbf G}{\mathbf T}_2\). For posterity, the \({\mathbf B}({\boldsymbol\theta})\) matrix in our manuscript is actually \({\mathbf B}({\boldsymbol\theta}) = \tilde{\boldsymbol\Delta}^\top\tilde{\mathbf D}^{1} B\), where the indicator matrix \(B\) is \(B = {\mathbf G}{\mathbf T}_2\) in the notation of the present article.
As a remark, while elegant in theory, forming the matrix \({\boldsymbol\Omega}_2\) via the linear transformation \({\mathbf T}_2\) is impractical when \(p\) is large for two reasons. One, it involves multiplying an \(S \times R\) matrix with an \(R \times R\) matrix, and so this computation is of order \(O(SR^2) = O(p(p+1)2^{2p1})\). Two, storage requirements of the \(R\times R\) matrix (e.g. \({\boldsymbol\Sigma}\)) may possibly be large and/or involve small numbers, so precision might be a concern.
The advantage of our equation in (7) is that \({\boldsymbol\Omega}_2\) can be formed using the byproduct of the pairwise estimation procedure, i.e. using the matrices \({\mathcal H}\) and \(\tilde{\boldsymbol\Delta}\), as well as the (estimated) pairwise probabilities \(\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta})\). What remains is the proper estimation of \({\boldsymbol\Sigma}_2\), and this depends on the sampling design performed. The complex sampling procedure article touches upon this briefly, with further details available in the manuscript.
Summary
In general, all (limited information) goodnessoffit test statistics take the form \[ X^2 = n \hat{\mathbf e}_2^\top \hat{\boldsymbol\Xi}\hat{\mathbf e}_2, \tag{8} \] where \(\hat{\boldsymbol\Xi}\xrightarrow{\text P} {\boldsymbol\Xi}\) is some \(S\times S\) weight matrix. We summarise the various weight matrices that we use in this R package:
Name  Weight (\({\boldsymbol\Xi}\))  D.f.  R function  Remarks  

1  Wald  \({\boldsymbol\Omega}_2^+\)  \(Sm\)  Wald_test() 
\({\boldsymbol\Omega}_2\) may be rank deficient 
2  Wald V2  \(\mathop{\mathrm{diag}}({\boldsymbol\Omega}_2)^{1}\)  est.  Wald_diag_test() 
\({\boldsymbol\Omega}_2\) need not be inverted 
3  Wald V3  \({\boldsymbol\Xi}{\boldsymbol\Omega}_2{\boldsymbol\Xi}\)  \(Sm\)  Wald_vcovf_test() 
\({\boldsymbol\Omega}_2\) need not be estimated 
4  Pearson  \(\mathop{\mathrm{diag}}({\boldsymbol\pi}_2)^{1}\)  est.  Pearson_RS_test() 
RaoScott adjustment 
5  Pearson V2  \(\mathop{\mathrm{diag}}({\boldsymbol\pi}_2)^{1}\)  est.  Pearson_test() 
Moment matching 
6  RSS  \({\mathbf I}\)  est.  RSS_test() 

7  Multinomial  \({\boldsymbol\Sigma}_2^{1}\)  est.  Multn_test() 
It should be noted that all of the above tests, with the exception of the Wald V3 test, requires estimation of the \({\boldsymbol\Omega}_2\) matrices. For this purpose, we use the sample versions of the relevant quantities, mostly evaluating \(\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta})\) at the estimates \({\boldsymbol\theta}=\hat{\boldsymbol\theta}_{\text{PL}}\), i.e.
 \({\mathbf H}=  \nabla^2\operatorname{\ell_P}({\boldsymbol\theta}) \Big_{{\boldsymbol\theta}=\hat{\boldsymbol\theta}_{\text{PL}}}\) as estimating \({\mathcal H}\);
 \(\hat{\tilde{\boldsymbol\Delta}_{sk}} = \frac{\partial\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta})}{\partial\theta_k} \Big_{{\boldsymbol\theta}=\hat{\boldsymbol\theta}_{\text{PL}}}\);
 \(\hat{\tilde{\mathbf D}} = \mathop{\mathrm{diag}}(\hat\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta}))\); and
 an appropriate estimator of \({\boldsymbol\Sigma}_2\).
In general, all of these test statistics are chi square variates. However, for some of these tests, the degrees of freedom are not known a priori and have to be estimated. There are two methods that we consider: 1) A RaoScott type adjustment (Rao and Scott 1979; Rao and Scott 1981; Rao and Scott 1984); and 2) a moment matching procedure (Bartholomew and Leung 2002). While in theory each test may be subjected to either a RaoScott type adjustment or a moment matching procedure, we only consider the RaoScott adjustment for the Pearson test.
Test statistics
In this section, we elaborate further on the specific test statistics used.
Wald test
One strategy is to choose \({\boldsymbol\Xi}\) so that the resulting quadratic form \(X^2\) in (8) is asymptotically chisquare. This strategy is followed by Reiser (1996), MaydeuOlivares and Joe (2005), and MaydeuOlivares and Joe (2006). Choosing
 \(\hat{\boldsymbol\Xi}= \hat\Omega_2^\) (the generalised inverse); or
 \(\hat{\boldsymbol\Xi}= \hat\Omega_2^+\) (the MoorePenrose inverse)
ensures that \(X^2 \xrightarrow{\text D} \chi^2_k\) where \(k\) equals the rank of \({\boldsymbol\Omega}_2\). The pseudoinverse needs to be employed because \({\boldsymbol\Omega}_2\) may be of deficient rank. When \({\boldsymbol\Omega}_2\) is full rank (rank = \(S\)), then \({\boldsymbol\Xi}\) is equal to the usual matrix inverse. For our implementation, and similar to Reiser (1996), we will be using the MoorePenrose inverse. Finally, the degrees of freedom will be (at most) \(Sm\), i.e. \(S\) minus the number of free model parameters that need to be estimated.
Diagonal Wald test
An alternative Wald test is studied where \(\hat{\boldsymbol\Xi}= \mathop{\mathrm{diag}}(\hat{\boldsymbol\Omega}_2)^{1}\) is used instead of the pseudoinverse of \({\boldsymbol\Omega}_2\). We refer to this as the Diagonal Wald test. The motivation here is stability and ease of computation because it is straightforward to inverse a diagonal matrix rather than a full matrix which grows rapidly with the number of items \(p\). However, it does not follow the same chisquare distribution as the original Wald test, and its distribution needs to be determined (Bartholomew and Leung 2002). We use the momentmatching procedure for this, which we briefly explain below.
Suppose that \(Y\sim \chi^2_c\). Then we assume that the test statistic \(X^2 = n\hat{\mathbf e}_2 \hat{\boldsymbol\Xi}\hat{\mathbf e}_2\) can be approximated by a linear transformation of this chi square random variate, i.e. \[ X^2 \approx a + bY \] Let \(\mu_k(X)=\mathop{\mathrm{E}}[(X  \mathop{\mathrm{E}}X)^k]\) represent the \(k\)th central moment for a random variable \(X\). The first three asymptotic central moments of \(X^2\) are (Mathai and Provost 1992 Theorem 3.2b.2, p.53) \[\begin{equation} \mu_1(X^2) = \operatorname{tr}(\hat{\boldsymbol\Xi}{\boldsymbol\Omega}_2), \hspace{1em} \mu_2(X^2) = 2\operatorname{tr}\big((\hat{\boldsymbol\Xi}{\boldsymbol\Omega}_2)^2\big), \hspace{1em} \mu_3(X^2) = 8\operatorname{tr}\big((\hat{\boldsymbol\Xi}{\boldsymbol\Omega}_2)^3\big), \tag{9} \end{equation}\] while the first three central moments of the approximating random variables are \[\begin{equation} \mu_1 = a + bc, \hspace{2em} \mu_2 = 2b^2c, \hspace{2em} \mu_3 = 8b^3c. \tag{10} \end{equation}\] Useful to note here are the relationships \(\mu_1=\mathop{\mathrm{E}}(Y)\), \(\mu_2=\mathop{\mathrm{Var}}(Y)\), and \(\mu_3=\mathop{\mathrm{E}}(Y^3)3\mu\mathop{\mathrm{Var}}(Y)\mu^3\) for any random variable \(Y\), while the raw moments of a chisquare random variable with \(c\) degrees of freedom are given by the formula \(\mathop{\mathrm{E}}(Y^k)=c(c+2)(c+4)\cdots(c+2k2)\). Equating the two sets of moments in (9) and (10) yields the threemoment adjustment parameters \[\begin{equation} b = \frac{\mu_3(X^2)}{4\mu_2(X^2)}, \hspace{2em} c = \frac{\mu_2(X^2)}{2b^2}, \hspace{2em} a = \mu_1(X^2)  bc. \end{equation}\] Tail area probabilities for \(p\)value calculations can be computed as follows: \[\begin{equation} \Pr(X^2 > x) \approx \Pr\left(Y > \frac{xa}{b} \ \Big \ Y \sim \chi^2_c\right). \end{equation}\]
We can also use twomoments and onemoment matching techniques similarly. For the twomoment adjustment, assume that \(X^2 \approx bY\) where \(Y\sim\chi^2_c\). Then matching the first two asymptotic moments of \(X^2\) with the approximating random variable gives \[\begin{equation} b = \frac{\mu_2(X^2)}{2\mu_1(X^2)} \hspace{1em}\text{and}\hspace{1em} c = \frac{\mu_1(X^2)}{b}. \end{equation}\] As for the onemoment adjustment, \(X^2\) is again approximated by \(Y\sim\chi^2_c\), but \(c\) is fixed to be the number of degrees of freedom available for testing (taken to be \(c=Sm\)). Solving for \(b\) yields \[\begin{equation} b = \frac{\mu_1(X^2)}{c}. \end{equation}\] In both cases, tail probabilities are calculated as \[\begin{equation} \Pr(X^2 > x) \approx \Pr\left(Y > \frac{x}{b} \ \Big \ Y \sim \chi^2_c\right). \end{equation}\] MaydeuOlivares and Joe (2008) advises that onemoment matching results in inaccurate \(p\)values.
Variancecovariance free Wald test
The drawbacks of using \(\hat{\boldsymbol\Omega}_2^+\) as a weight matrix in the quadratic form is that 1) \({\boldsymbol\Omega}_2\) needs to be estimated; and 2) the degrees of freedom \(k\) cannot be determined a priori and can be tricky (if done by looking at small eigenvalues, it can be tricky to judge wchih are actually zero).
To overcome this, MaydeuOlivares and Joe (2005) and MaydeuOlivares and Joe (2006) suggested using a weight matrix \({\boldsymbol\Xi}\) such that \({\boldsymbol\Omega}_2\) is a ginverse of \({\boldsymbol\Xi}\), i.e. \({\boldsymbol\Xi}= {\boldsymbol\Xi}{\boldsymbol\Omega}_2{\boldsymbol\Xi}\). Let \({\boldsymbol\Delta}_2^\perp\) be an \(S \times (Sm)\) orthogonal complement to \({\boldsymbol\Delta}_2\), i.e. it satisfies \(({\boldsymbol\Delta}_2^\perp)^\top{\boldsymbol\Delta}_2 = {\mathbf 0}\). Then from (5), we see that \[ \sqrt n ({\boldsymbol\Delta}_2^\perp)^\top \hat {\mathbf e}_2 \xrightarrow{\text D} \mathop{\mathrm{N}}({\mathbf 0}, ({\boldsymbol\Delta}_2^\perp)^\top {\boldsymbol\Omega}_2 {\boldsymbol\Delta}_2^\perp). \tag{11} \] Because of (6), the asymptotic covariance matrix may be written \[ ({\boldsymbol\Delta}_2^\perp)^\top {\boldsymbol\Omega}_2 {\boldsymbol\Delta}_2^\perp = ({\boldsymbol\Delta}_2^\perp)^\top {\boldsymbol\Sigma}_2 {\boldsymbol\Delta}_2^\perp, \] since all the multiplications of \({\boldsymbol\Delta}_2\) with its orthogonal complement cancels out. By letting \[ {\boldsymbol\Xi}= {\boldsymbol\Delta}_2^\perp \big( ({\boldsymbol\Delta}_2^\perp)^\top {\boldsymbol\Sigma}_2 {\boldsymbol\Delta}_2^\perp \big)^{1} ({\boldsymbol\Delta}_2^\perp)^\top, \] we can then verify \({\boldsymbol\Xi}= {\boldsymbol\Xi}{\boldsymbol\Omega}_2{\boldsymbol\Xi}\); that is, \({\boldsymbol\Omega}_2\) is a generalised inverse of \({\boldsymbol\Xi}\). Let \(\hat{\boldsymbol\Xi}\) be an appropriate estimate of \({\boldsymbol\Xi}\), e.g. by replacing the matrices above with their corresponding hat versions. The implication here is that \[ X^2 = n \hat{\mathbf e}_2^\top \hat{\boldsymbol\Xi}\hat{\mathbf e}_2 = n \hat{\mathbf e}_2^\top \hat{\boldsymbol\Delta}_2^\perp \big( (\hat{\boldsymbol\Delta}_2^\perp)^\top \hat{\boldsymbol\Sigma}_2 \hat{\boldsymbol\Delta}_2^\perp \big)^{1} (\hat{\boldsymbol\Delta}_2^\perp)^\top \hat{\mathbf e}_2 \] converges in distribution to a \(\chi^2_{Sm}\) variate as \(n\to\infty\) due to (11) and Slutsky’s theorem. The degrees of freedom is as such because \({\boldsymbol\Delta}_2^\perp\) is of full column rank \(Sm\) and hence \({\boldsymbol\Xi}\) is also of rank \(Sm\). The dimensions of a vector space and its orthogonal complement always add up to the dimension of the whole space. Since \({\boldsymbol\Delta}_2 \in \mathbb{R}^{S\times m}\) has dimension (column rank) \(m\), the column space has an orthogonal complement in \(\mathbb{R}^S\), and thus the dimension of this orthogonal complement is \(Sm\).
Pearson test
To construct a Pearson test statistic, let \(\hat{\boldsymbol\Xi}^{1} = \hat{\mathbf D}_2 := \mathop{\mathrm{diag}}\big( {\boldsymbol\pi}_2(\hat{\boldsymbol\theta}) \big) \xrightarrow{\text P} {\mathbf D}_2:=\mathop{\mathrm{diag}}\big( {\boldsymbol\pi}_2({\boldsymbol\theta}) \big).\) Then we can see that \[\begin{equation}\label{eq:Pearsontest} X^2 = n\hat{\mathbf e}_2^\top\hat{\mathbf D}_2^{1}\hat{\mathbf e}_2 = n \sum_{r=1}^R \frac{(\dot p_r  \dot\pi_r(\hat{\boldsymbol\theta}))^2}{\dot\pi_r(\hat{\boldsymbol\theta})} + n \sum_{r<s} \frac{(\dot p_{rs}  \dot\pi_{rs}(\hat{\boldsymbol\theta}))^2}{\dot\pi_{rs}(\hat{\boldsymbol\theta})}, \end{equation}\] which resembles the traditional Pearson chisquare test statistic. Note that \[\begin{equation} \sqrt n \hat{\mathbf D}_2^{1/2}\hat{\mathbf e}_2 \xrightarrow{\text D} \mathop{\mathrm{N}}({\mathbf 0}, {\mathbf D}_2^{1/2}{\boldsymbol\Omega}_2 {\mathbf D}_2^{1/2}) \end{equation}\] as \(N\to\infty\), and therefore \(X^2\) has the limiting distribution of \(\sum_{s=1}^S \delta_s X_s\), where \(\delta_s\) are eigenvalues of \(n^{1} {\mathbf D}_2^{1/2}{\boldsymbol\Omega}_2 {\mathbf D}_2^{1/2}\) and \(X_s\,\overset{\text{iid}}{\sim}\,\chi^2_1\) (Mathai and Provost 1992, Representation 3.1a.1, p.29). These eigenvalues can be estimated by the eigenvalues of \(n^{1} \hat{\mathbf D}_2^{1/2}\hat{\boldsymbol\Omega}_2 \hat {\mathbf D}_2^{1/2}\). In principle, this method can be applied more generally by considering a Choleski decomposition \({\boldsymbol\Xi}= {\mathbf L}{\mathbf L}^\top\) instead of matrix square roots. Of course, this only works if the matrix \({\boldsymbol\Xi}\) is of full rank, which rules out the Wald test statistic above.
Let \(\bar\delta\) represent the arithmetic mean of the eigenvalues \(\delta_s\). The asymptotic mean and variance of \(X^2\) are \[\begin{align} \mathop{\mathrm{E}}X^2 &= \mathop{\mathrm{E}}\sum_{s=1}^S \delta_s X_s = \sum_{s=1}^S \delta_s = S\bar\delta = \operatorname{tr}(n^{1} {\mathbf D}_2^{1/2}{\boldsymbol\Omega}_2 {\mathbf D}_2^{1/2}) \\ \mathop{\mathrm{Var}}X^2 &= \mathop{\mathrm{Var}}\sum_{s=1}^S \delta_s X_s = 2\sum_{s=1}^S\delta_s^2 = 2\sum_{s=1}^S (\delta_s  \bar\delta)^2 + 2S\bar\delta^2, \end{align}\] So it seems that the power of the test is affected by the size and dispersion of the eigenvalues (Heo 2014). Rao and Scott (1979);Rao and Scott (1981);Rao and Scott (1984) suggested adjustments to the Pearson test as follows.
Firstorder RaoScott type test is obtained by dividing \(X^2\) by \(\bar \delta\). This statistic is distributed as \(\chi^2_S\).
Secondorder RaoScott type test is obtained by dividing \(X^2/\bar\delta\) by \(1+\hat a^2\), where \(\hat a=\sqrt{S^{1}\sum_{s=1}^S (\delta_s  \bar\delta)}\big/\bar\delta\) is the coefficient of variation of the eigenvalues. Hence, \(1+\hat a^2= (S\bar\delta ^2)^{1}\sum_{s=1}^s \delta_s^2\). This statistic is distributed as \(\chi^2_k\) where \(k=S(1+\hat a^2)^{1}\).