Skip to contents

Let \({\mathbf Y}= (Y_1, \dots, Y_p)^\top \in \{0,1\}^p\) be a vector of Bernoulli random variables. Consider a response pattern \({\mathbf y}= (y_1,\dots,y_p)^\top\), where each \(y_i\in\{0,1\}\). The probability of observing such a response pattern is given by the joint distribution \[\begin{align*} \pi &= \Pr({\mathbf Y}= {\mathbf y}) = \Pr(Y_1=y_1,\dots,Y_p=y_p). \tag{1} \end{align*}\] Note that there are a total of \(R=2^p\) possible joint probabilities \(\pi_r\) corresponding to all possible two-way response patterns \({\mathbf y}_r\).

When we consider a parametric model with parameter vector \({\boldsymbol\theta}\in\mathbb{R}^m\), we write \(\pi_r({\boldsymbol\theta})\) to indicate each joint probability, and \[\begin{equation} {\boldsymbol\pi}({\boldsymbol\theta}) = \begin{pmatrix} \pi_{1}({\boldsymbol\theta}) \\ \vdots \\ \pi_{R}({\boldsymbol\theta}) \\ \end{pmatrix} \in [0,1]^R \tag{2} \end{equation}\] for the vector of joint probabilities, with \(\sum_{r=1}^R \pi_{r}({\boldsymbol\theta}) =1\).

Model-based probabilities

The model of interest is a factor model, commonly used in social statistics. Using an underlying variable (UV) approach, the observed binary responses \(y_i\) are manifestations of some latent, continuous variables \(Y_i^*\), \(i=1,\dots,p\). The connection is made as follows: \[ Y_i = \begin{cases} 1 & Y_i^* > \tau_i \\ 0 & Y_i^* \leq \tau_i, \end{cases} \] where \(\tau_i\) is the threshold associated with the variable \(Y_i^*\). For convenience, \(Y_i^*\) is taken to be standard normal random variables1. The factor model takes the form \[ \mathbf Y^* = \mathbf \Lambda\boldsymbol \eta + \boldsymbol \epsilon, \] where each component is explained below:

  • \(\mathbf Y^* = (Y_1^*,\dots,Y_p^*)^\top \in \mathbf R^p\) are the underlying variables;

  • \(\boldsymbol\Lambda \in \mathbf R^{p \times q}\) is the matrix of loadings;

  • \(\boldsymbol \eta = (\eta_1,\dots,\eta_q)^\top \in \mathbf R^q\) is the vector of latent factors;

  • \(\boldsymbol \epsilon \in \mathbf R^p\) are the error terms associated with the items (aka unique variables).

We also make some distributional assumptions, namely

  1. \(\boldsymbol\eta \sim \operatorname{N}_q(\mathbf 0, \boldsymbol\Psi)\), where \({\boldsymbol\Psi}\) is a correlation matrix, i.e. for \(k,l\in\{1,\dots,q\}\), \[ {\boldsymbol\Psi}_{kl} = \begin{cases} 1 & \text{if } k = l \\ \rho(\eta_k, \eta_l) & \text{if } k \neq l. \end{cases} \]

  2. \(\boldsymbol \epsilon \sim \operatorname{N}_p(\mathbf 0, \boldsymbol\Theta_\epsilon)\), with \(\boldsymbol\Theta_\epsilon = \mathbf I - \operatorname{diag}(\boldsymbol\Lambda \boldsymbol\Psi \boldsymbol\Lambda^\top)\).

These two assumptions, together with \(\operatorname{Cov}(\boldsymbol\eta, \boldsymbol\epsilon) = 0\), implies that \(\mathbf Y^*\sim\operatorname{N}_p(\mathbf 0,{\boldsymbol\Sigma}_{{\mathbf y}^*})\), where \[\begin{align} {\boldsymbol\Sigma}_{{\mathbf y}^*} = \operatorname{Var}(\mathbf Y^*) &= \boldsymbol\Lambda\boldsymbol\Phi\boldsymbol\Lambda^\top + \boldsymbol\Theta_\epsilon \\ &= \mathbf I + (\boldsymbol\Lambda\boldsymbol\Phi\boldsymbol\Lambda^\top - \operatorname{diag}\big(\boldsymbol\Lambda \boldsymbol\Phi \boldsymbol\Lambda^\top)\big). \end{align}\] The parameter vector for this factor model is denoted \(\boldsymbol\theta^\top = (\boldsymbol\lambda, \boldsymbol\psi, \boldsymbol\tau) \in\mathbb{R}^m\), where it contains the vectors of the free non-redundant parameters in \(\boldsymbol\Lambda\) and \(\boldsymbol \Psi\) respectively, as well as the vector of all free thresholds.

Under this factor model, the probability of response pattern \(\mathbf y_r\) is \[\begin{align} \pi_{r}({\boldsymbol\theta}) &= \Pr(\mathbf Y = \mathbf y_r \mid \boldsymbol\theta) \\ &= \idotsint_A \phi_p(\mathbf y^* \mid \mathbf 0, \boldsymbol\Sigma_{\mathbf y^*} ) \, \text{d}\mathbf y^* \end{align}\] where \(\phi_p(\cdot \mid \boldsymbol\mu,\boldsymbol\Sigma)\) is the density function of the \(p\)-dimensional normal distribution with mean \(\boldsymbol\mu\) and variance \(\boldsymbol\Sigma\). This integral is evaluated on the set \[ A = \{ \mathbf Y^* \in \mathbb R^p \mid Y_1=y_1,\dots,Y_p=y_p \}. \]

Parameter estimation

Suppose that \(h=1,\dots,n\) observations of \({\mathbf Y}={\mathbf y}^{(h)}\) are obtained. For the purpose of generalising from independent samples to complex samples, suppose that each unit \(h\) in the sample is assigned a (normalised) survey weight \(w_h\) with \(\sum_h w_h = n\). Of course, if an independent simple random sampling scheme is implemented, then each \(w_h=1\).

The sample proportions for each category \(r\) is written \(p_r = \hat n_r/n\), where \[\begin{equation} \hat n_r = \sum_h w_h [{\mathbf y}^{(h)} = {\mathbf y}_r], \end{equation}\] and \([\cdot]\) denotes the Iverson bracket. In other words, \(\hat n_r\) represents the (weighted) frequency counts of the observed response patterns such that \(\sum_{r=1}^R \hat n_r = n\). The vector \(\hat{\mathbf n}= (\hat n_1, \dots, \hat n_R)^\top\) then defines a multivariate binomial distribution, or more commonly called a multinomial distribution with parameters \(n\), \(R\), and \({\boldsymbol\pi}({\boldsymbol\theta})\). The probability mass function of \(\hat{\mathbf n}\) is given by \[\begin{equation} f_{\hat{\mathbf n}}(x_1,\dots,x_R) = n! \prod_{r=1}^R \frac{1}{x_r!} \big[\pi_{r}({\boldsymbol\theta})\big]^{x_r}, \end{equation}\] and the log-likelihood for \({\boldsymbol\theta}\) given the observed frequencies \(\hat{\mathbf n}\) (ignoring constants) is then \[\begin{equation} \ell({\boldsymbol\theta}) = \log f_{\hat{\mathbf n}}(\hat n_1,\dots,\hat n_R) = \sum_{r=1}^{R} \hat n_r \log \pi_{r}({\boldsymbol\theta}). \end{equation}\]

The maximum likelihood estimator \(\hat{\boldsymbol\theta}_{\text{ML}}\) satisfies \(\hat{\boldsymbol\theta}_{\text{ML}}= \mathop{\mathrm{argmax}}_{{\boldsymbol\theta}} \ell({\boldsymbol\theta})\). Maximum likelihood theory tells us that, under certain regularity conditions, as \(n\to\infty\), \[\begin{equation}\label{eq:limitdisttheta} \sqrt n (\hat{\boldsymbol\theta}- {\boldsymbol\theta}) \xrightarrow{\text D} {\mathop{\mathrm{N}}}_m\big({\mathbf 0}, {\mathcal I}({\boldsymbol\theta})^{-1}\big), \end{equation}\] where \(\mathbb{R}^{m\times m} \ni {\mathcal I}= -n^{-1}\mathop{\mathrm{E}}\nabla^2 \ell({\boldsymbol\theta})\) is the (unit) expected Fisher information matrix. It can be further shown that \({\mathcal I}= {\boldsymbol\Delta}^\top {\mathbf D}^{-1} {\boldsymbol\Delta}\), where

  • \({\boldsymbol\Delta}_{r,k} = \frac{\partial\pi_r({\boldsymbol\theta})}{\partial\theta_k}\), \(r=1,\dots,R\), \(k=1,\dots,m\); and
  • \({\mathbf D}= \mathop{\mathrm{diag}}({\boldsymbol\pi}({\boldsymbol\theta}))\).

The maximum likelihood estimators are a class of best asymptotically normal (BAN) estimators \(\hat{\boldsymbol\theta}\) of \({\boldsymbol\theta}\) that satisfy \[\begin{equation} \sqrt n (\hat{\boldsymbol\theta}- {\boldsymbol\theta}) = \sqrt n \, {\mathbf B}\big({\mathbf p}- {\boldsymbol\pi}({\boldsymbol\theta})\big) + o(1) \tag{3} \end{equation}\] for some \(m\times R\) matrix \({\mathbf B}\). In the specific case of maximum likelihood estimation, we can derive \({\mathbf B}\) to be \({\mathbf B}= {\mathcal I}^{-1} {\boldsymbol\Delta}^\top {\mathbf D}^{-1}\). This is proven in the corresponding article (see XXX for more details).

Pairwise likelihood estimation

The main interest for this project is to construct test statistics using composite likelihood methods, specifically the pairwise likelihood method. In order to define the pairwise likelihood, let \(\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta})\) be the probability under the model that \(Y_i=y_i \in \{0,1\}\) and \(Y_j=y_j\in\{0,1\}\) for a pair of variables \(Y_i\) and \(Y_j\), \(i,j=1,\dots,p\) and \(i<j\). The pairwise log-likelihood takes the form \[\begin{equation} \operatorname{\ell_P}({\boldsymbol\theta}) = \sum_{i<j} \sum_{y_i}\sum_{y_j} \hat n_{y_iy_j}^{(ij)} \log \pi_{y_iy_j}^{(ij)}({\boldsymbol\theta}), \end{equation}\] where \(\hat n_{y_iy_j}^{(ij)}\) is the observed (weighted) frequency of sample units with \(Y_i=y_i\) and \(Y_j=y_j\), \[ \hat n_{y_iy_j}^{(ij)} = \sum_h w_h [{\mathbf y}^{(h)}_i = y_i, {\mathbf y}^{(h)}_j = y_j]. \] Let us also define the corresponding observed pairwise proportions \(p_{y_iy_j}^{(ij)} = \hat n_{y_iy_j}^{(ij)}/n\). There are a total of \(\tilde R = 4 \times {p \choose 2}\) summands, where the ‘4’ is representative of the total number of pairwise combinations of binary choices ‘00’, ‘10’, ‘01’, and ‘11’.

The pairwise maximum likelihood estimator \(\hat{\boldsymbol\theta}_{\text{PL}}\) satisfies \(\hat{\boldsymbol\theta}_{\text{PL}}= \mathop{\mathrm{argmax}}_{{\boldsymbol\theta}} \operatorname{\ell_P}({\boldsymbol\theta})\). Under certain regularity conditions, \[\begin{equation} \sqrt N (\hat{\boldsymbol\theta}_{\text{PL}}- {\boldsymbol\theta}) \xrightarrow{D} {\mathop{\mathrm{N}}}_m\big({\mathbf 0}, {\mathcal H}({\boldsymbol\theta}){\mathcal J}({\boldsymbol\theta})^{-1}{\mathcal H}({\boldsymbol\theta})\big), \end{equation}\] where

  • \({\mathcal H}({\boldsymbol\theta})=-\mathop{\mathrm{E}}\nabla^2\operatorname{\ell_P}({\boldsymbol\theta})\) is the sensitivity matrix; and
  • \({\mathcal J}({\boldsymbol\theta})=\mathop{\mathrm{Var}}\big(\sqrt N \nabla\operatorname{\ell_P}({\boldsymbol\theta})\big)\) is the variability matrix.

We can also show that the pairwise maximum likelihood estimator satisfies (3), where the matrix \({\mathbf B}\) in this case would be \({\mathbf B}= {\mathcal H}^{-1}\tilde{\boldsymbol\Delta}^{-1}\tilde {\mathbf D}^{-1} {\mathbf G}\), with

  • \(\tilde{\boldsymbol\Delta}\in \mathbb{R}^{\tilde R \times m}\) consists of partial derivatives of the pairwise probabilities, i.e. \(\frac{\partial\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta})}{\partial\theta_k}\);
  • \(\tilde{\mathbf D}= \mathop{\mathrm{diag}}((\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta}))_{i<j})\); and
  • \({\mathbf G}\) is some indicator matrix to transform the quantities from \(\tilde R\) dimensions to \(R\) dimensions. See other articles for further details.

Goodness-of-fit

Collect all the sample proportions \(p_r = \hat n_r/n\), \(r=1,\dots,R\) to form an \(R\)-vector of proportions \({\mathbf p}\). For independent samples, it is widely known (Agresti 2012) that \[\begin{equation} \sqrt N ({\mathbf p}- {\boldsymbol\pi}) \xrightarrow{\text D} {\mathop{\mathrm{N}}}_R ({\mathbf 0}, {\boldsymbol\Sigma}), \tag{4} \end{equation}\] where \({\boldsymbol\Sigma}= {\mathbf D}- {\boldsymbol\pi}{\boldsymbol\pi}^\top\), with \({\mathbf D}:= \mathop{\mathrm{diag}}({\boldsymbol\pi})\). Considering a parametric model, the proportion vector in (4) reads \({\boldsymbol\pi}= {\boldsymbol\pi}({\boldsymbol\theta})\). An analogous result for the complex sampling design case exists as well (Fuller 2009); however the covariance matrix \({\boldsymbol\Sigma}\) need not take a multinomial form, and the true proportions \({\boldsymbol\pi}\) may either be a finite population or a superpopulation quantity.

Suppose we denote by \({\boldsymbol\pi}(\hat{\boldsymbol\theta})\) the estimated proportions under a parametric model. For estimators \(\hat{\boldsymbol\theta}\) satisfying (3), Maydeu-Olivares and Joe (2005) show that the distribution of the residuals \(\hat{\mathbf e}= {\mathbf p}- {\boldsymbol\pi}(\hat{\boldsymbol\theta})\) is asymptotically normal: \[\begin{equation} \sqrt n \hat{\mathbf e}= \sqrt n \big( {\mathbf p}- {\boldsymbol\pi}(\hat{\boldsymbol\theta}) \big) \xrightarrow{\text D} {\mathop{\mathrm{N}}}_R ({\mathbf 0}, {\boldsymbol\Omega}) \end{equation}\] where \[\begin{align} {\boldsymbol\Omega} &= ({\mathbf I}- {\boldsymbol\Delta}{\mathbf B}){\boldsymbol\Sigma}({\mathbf I}- {\boldsymbol\Delta}{\mathbf B})^\top \\ &= {\boldsymbol\Sigma}- {\boldsymbol\Sigma}{\mathbf B}^\top{\boldsymbol\Delta}^\top - {\boldsymbol\Delta}{\mathbf B}{\boldsymbol\Sigma}+ {\boldsymbol\Delta}{\mathbf B}{\boldsymbol\Sigma}{\mathbf B}^\top {\boldsymbol\Delta}^\top, \end{align}\] with \({\mathbf B}\) being the transformation matrix and \({\boldsymbol\Delta}\) the matrix of partial derivatives of \({\boldsymbol\pi}({\boldsymbol\theta})\) (both described previously). Maydeu-Olivares and Joe (2005) remarks that for maximum likelihood estimators, the residual covariance matrix simplifies to \({\boldsymbol\Omega}= {\boldsymbol\Sigma}- {\boldsymbol\Delta}{\mathcal I}^{-1}_1{\boldsymbol\Delta}^\top\). We do not think such a simplification exists for the pairwise likelihood case [Hold that thought. Maybe it’s possible. Still investigating.].

Lower-order residuals

The interest here is in obtaining the lower-order marginals from the full \(R\)-dimensional probability vector \({\boldsymbol\pi}\). Namely, we would like to get

  1. Univariate moments of the form \(\dot\pi_i = \Pr(Y_i=1)\) (of which there are \(p\)) \[ \dot{\boldsymbol\pi}_1^\top = (\dot\pi_1,\dots,\dot\pi_p) \in[0,1]^p \]

  2. Bivariate moments of the form \(\dot\pi_{ij}=\Pr(Y_i=1,Y_j=1)\) (of which there are \({p \choose 2}\)) \[ \dot{\boldsymbol\pi}_2^\top = \Big(\dot\pi_{ij} \Big)_{\substack{i,j=1\\i<j}}^n \in [0,1]^{p(p-1)/2} \]

The bivariate moments in particular will be especially relevant when considering a pairwise likelihood estimation of the parametric model. Denote \({\boldsymbol\pi}_2 = \begin{pmatrix}\dot{{\boldsymbol\pi}}_1\\ \dot{{\boldsymbol\pi}}_2 \end{pmatrix}\) as the \(S:=p(p+1)/2\) vector of univariate and bivariate moments. Correspondinly, let \({\mathbf p}_2\) be the \(S\)-vector of univariate and bivariate sample moments. It is possible to obtain \({\boldsymbol\pi}_2\) (as well as \({\mathbf p}_2\)) via a transformation of the full vector of probabilities \({\boldsymbol\pi}\) (c.f. \({\mathbf p}\)): \[\begin{equation} {\boldsymbol\pi}_2 = {\mathbf T}_2 {\boldsymbol\pi}. \end{equation}\] Here, \({\mathbf T}_2\) is an indicator matrix of dimension \(S \times R\). See the articles for further details.

Consider now the lower order residuals of up to order 2 under a parametric model, defined by \[\begin{equation} \hat{\mathbf e}_2 = {\mathbf p}_2 - {\boldsymbol\pi}_2(\hat{\boldsymbol\theta}) = {\mathbf T}_2 \big({\mathbf p}- {\boldsymbol\pi}(\hat{\boldsymbol\theta}) \big) = {\mathbf T}_2\hat{\mathbf e}. \end{equation}\] It should be straightforward to see, using the previous results, that as \(n\to\infty\) we get \[\begin{equation} \sqrt n \hat{\mathbf e}_2 = \sqrt N \big({\mathbf p}_2 - {\boldsymbol\pi}_2(\hat{\boldsymbol\theta})\big) \xrightarrow{\text D} \mathop{\mathrm{N}}({\mathbf 0}, {\boldsymbol\Omega}_2), \tag{5} \end{equation}\] where \[\begin{align} {\boldsymbol\Omega}_2 = {\mathbf T}_2{\boldsymbol\Omega}{\mathbf T}_2^\top &= {\mathbf T}_2 ({\mathbf I}- {\boldsymbol\Delta}{\mathbf B}){\boldsymbol\Sigma}({\mathbf I}- {\boldsymbol\Delta}{\mathbf B})^\top {\mathbf T}_2^\top \\ &= {\boldsymbol\Sigma}_2 - {\mathbf T}_2{\boldsymbol\Sigma}{\mathbf B}^\top{\boldsymbol\Delta}_2^\top - {\boldsymbol\Delta}_2 {\mathbf B}{\boldsymbol\Sigma}{\mathbf T}_2^\top + {\boldsymbol\Delta}_2 {\mathbf B}{\boldsymbol\Sigma}{\mathbf B}^\top {\boldsymbol\Delta}_2^\top, \tag{6} \end{align}\] with \({\boldsymbol\Sigma}_2 = {\mathbf T}_2{\boldsymbol\Sigma}{\mathbf T}_2^\top\) and \({\boldsymbol\Delta}_2={\mathbf T}_2{\boldsymbol\Delta}\) being the the transformed version of the respective matrices. For the specific case of maximum likelihood estimation, this simplifies to \({\boldsymbol\Omega}_2 = {\boldsymbol\Sigma}_2 - {\boldsymbol\Delta}_2 {\mathcal I}_1^{-1} {\boldsymbol\Delta}_2^\top\). For pairwise likelihood estimation, we show that \[ {\boldsymbol\Omega}_2 = ({\mathbf I}- {\boldsymbol\Delta}_2{\mathbf B}_2){\boldsymbol\Sigma}_2({\mathbf I}- {\boldsymbol\Delta}_2{\mathbf B}_2)^\top, \tag{7} \] where the transformation matrix in full is given by \({\mathbf B}_2 = {\mathcal H}^{-1}\tilde{\boldsymbol\Delta}^\top\tilde{\mathbf D}^{-1}{\mathbf G}{\mathbf T}_2\). For posterity, the \({\mathbf B}({\boldsymbol\theta})\) matrix in our manuscript is actually \({\mathbf B}({\boldsymbol\theta}) = \tilde{\boldsymbol\Delta}^\top\tilde{\mathbf D}^{-1} B\), where the indicator matrix \(B\) is \(B = {\mathbf G}{\mathbf T}_2\) in the notation of the present article.

As a remark, while elegant in theory, forming the matrix \({\boldsymbol\Omega}_2\) via the linear transformation \({\mathbf T}_2\) is impractical when \(p\) is large for two reasons. One, it involves multiplying an \(S \times R\) matrix with an \(R \times R\) matrix, and so this computation is of order \(O(SR^2) = O(p(p+1)2^{2p-1})\). Two, storage requirements of the \(R\times R\) matrix (e.g. \({\boldsymbol\Sigma}\)) may possibly be large and/or involve small numbers, so precision might be a concern.

The advantage of our equation in (7) is that \({\boldsymbol\Omega}_2\) can be formed using the byproduct of the pairwise estimation procedure, i.e. using the matrices \({\mathcal H}\) and \(\tilde{\boldsymbol\Delta}\), as well as the (estimated) pairwise probabilities \(\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta})\). What remains is the proper estimation of \({\boldsymbol\Sigma}_2\), and this depends on the sampling design performed. The complex sampling procedure article touches upon this briefly, with further details available in the manuscript.

Summary

In general, all (limited information) goodness-of-fit test statistics take the form \[ X^2 = n \hat{\mathbf e}_2^\top \hat{\boldsymbol\Xi}\hat{\mathbf e}_2, \tag{8} \] where \(\hat{\boldsymbol\Xi}\xrightarrow{\text P} {\boldsymbol\Xi}\) is some \(S\times S\) weight matrix. We summarise the various weight matrices that we use in this R package:

Name Weight (\({\boldsymbol\Xi}\)) D.f. R function Remarks
1 Wald \({\boldsymbol\Omega}_2^+\) \(S-m\) Wald_test() \({\boldsymbol\Omega}_2\) may be rank deficient
2 Wald V2 \(\mathop{\mathrm{diag}}({\boldsymbol\Omega}_2)^{-1}\) est. Wald_diag_test() \({\boldsymbol\Omega}_2\) need not be inverted
3 Wald V3 \({\boldsymbol\Xi}{\boldsymbol\Omega}_2{\boldsymbol\Xi}\) \(S-m\) Wald_vcovf_test() \({\boldsymbol\Omega}_2\) need not be estimated
4 Pearson \(\mathop{\mathrm{diag}}({\boldsymbol\pi}_2)^{-1}\) est. Pearson_RS_test() Rao-Scott adjustment
5 Pearson V2 \(\mathop{\mathrm{diag}}({\boldsymbol\pi}_2)^{-1}\) est. Pearson_test() Moment matching
6 RSS \({\mathbf I}\) est. RSS_test()
7 Multinomial \({\boldsymbol\Sigma}_2^{-1}\) est. Multn_test()

It should be noted that all of the above tests, with the exception of the Wald V3 test, requires estimation of the \({\boldsymbol\Omega}_2\) matrices. For this purpose, we use the sample versions of the relevant quantities, mostly evaluating \(\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta})\) at the estimates \({\boldsymbol\theta}=\hat{\boldsymbol\theta}_{\text{PL}}\), i.e.

  • \({\mathbf H}= - \nabla^2\operatorname{\ell_P}({\boldsymbol\theta}) \Big|_{{\boldsymbol\theta}=\hat{\boldsymbol\theta}_{\text{PL}}}\) as estimating \({\mathcal H}\);
  • \(\hat{\tilde{\boldsymbol\Delta}_{sk}} = \frac{\partial\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta})}{\partial\theta_k} \Big|_{{\boldsymbol\theta}=\hat{\boldsymbol\theta}_{\text{PL}}}\);
  • \(\hat{\tilde{\mathbf D}} = \mathop{\mathrm{diag}}(\hat\pi_{y_iy_j}^{(ij)}({\boldsymbol\theta}))\); and
  • an appropriate estimator of \({\boldsymbol\Sigma}_2\).

In general, all of these test statistics are chi square variates. However, for some of these tests, the degrees of freedom are not known a priori and have to be estimated. There are two methods that we consider: 1) A Rao-Scott type adjustment (Rao and Scott 1979; Rao and Scott 1981; Rao and Scott 1984); and 2) a moment matching procedure (Bartholomew and Leung 2002). While in theory each test may be subjected to either a Rao-Scott type adjustment or a moment matching procedure, we only consider the Rao-Scott adjustment for the Pearson test.

Test statistics

In this section, we elaborate further on the specific test statistics used.

Wald test

One strategy is to choose \({\boldsymbol\Xi}\) so that the resulting quadratic form \(X^2\) in (8) is asymptotically chi-square. This strategy is followed by Reiser (1996), Maydeu-Olivares and Joe (2005), and Maydeu-Olivares and Joe (2006). Choosing

  • \(\hat{\boldsymbol\Xi}= \hat\Omega_2^-\) (the generalised inverse); or
  • \(\hat{\boldsymbol\Xi}= \hat\Omega_2^+\) (the Moore-Penrose inverse)

ensures that \(X^2 \xrightarrow{\text D} \chi^2_k\) where \(k\) equals the rank of \({\boldsymbol\Omega}_2\). The pseudoinverse needs to be employed because \({\boldsymbol\Omega}_2\) may be of deficient rank. When \({\boldsymbol\Omega}_2\) is full rank (rank = \(S\)), then \({\boldsymbol\Xi}\) is equal to the usual matrix inverse. For our implementation, and similar to Reiser (1996), we will be using the Moore-Penrose inverse. Finally, the degrees of freedom will be (at most) \(S-m\), i.e. \(S\) minus the number of free model parameters that need to be estimated.

Diagonal Wald test

An alternative Wald test is studied where \(\hat{\boldsymbol\Xi}= \mathop{\mathrm{diag}}(\hat{\boldsymbol\Omega}_2)^{-1}\) is used instead of the pseudoinverse of \({\boldsymbol\Omega}_2\). We refer to this as the Diagonal Wald test. The motivation here is stability and ease of computation because it is straightforward to inverse a diagonal matrix rather than a full matrix which grows rapidly with the number of items \(p\). However, it does not follow the same chi-square distribution as the original Wald test, and its distribution needs to be determined (Bartholomew and Leung 2002). We use the moment-matching procedure for this, which we briefly explain below.

Suppose that \(Y\sim \chi^2_c\). Then we assume that the test statistic \(X^2 = n\hat{\mathbf e}_2 \hat{\boldsymbol\Xi}\hat{\mathbf e}_2\) can be approximated by a linear transformation of this chi square random variate, i.e. \[ X^2 \approx a + bY \] Let \(\mu_k(X)=\mathop{\mathrm{E}}[(X - \mathop{\mathrm{E}}X)^k]\) represent the \(k\)th central moment for a random variable \(X\). The first three asymptotic central moments of \(X^2\) are (Mathai and Provost 1992 Theorem 3.2b.2, p.53) \[\begin{equation} \mu_1(X^2) = \operatorname{tr}(\hat{\boldsymbol\Xi}{\boldsymbol\Omega}_2), \hspace{1em} \mu_2(X^2) = 2\operatorname{tr}\big((\hat{\boldsymbol\Xi}{\boldsymbol\Omega}_2)^2\big), \hspace{1em} \mu_3(X^2) = 8\operatorname{tr}\big((\hat{\boldsymbol\Xi}{\boldsymbol\Omega}_2)^3\big), \tag{9} \end{equation}\] while the first three central moments of the approximating random variables are \[\begin{equation} \mu_1 = a + bc, \hspace{2em} \mu_2 = 2b^2c, \hspace{2em} \mu_3 = 8b^3c. \tag{10} \end{equation}\] Useful to note here are the relationships \(\mu_1=\mathop{\mathrm{E}}(Y)\), \(\mu_2=\mathop{\mathrm{Var}}(Y)\), and \(\mu_3=\mathop{\mathrm{E}}(Y^3)-3\mu\mathop{\mathrm{Var}}(Y)-\mu^3\) for any random variable \(Y\), while the raw moments of a chi-square random variable with \(c\) degrees of freedom are given by the formula \(\mathop{\mathrm{E}}(Y^k)=c(c+2)(c+4)\cdots(c+2k-2)\). Equating the two sets of moments in (9) and (10) yields the three-moment adjustment parameters \[\begin{equation} b = \frac{\mu_3(X^2)}{4\mu_2(X^2)}, \hspace{2em} c = \frac{\mu_2(X^2)}{2b^2}, \hspace{2em} a = \mu_1(X^2) - bc. \end{equation}\] Tail area probabilities for \(p\)-value calculations can be computed as follows: \[\begin{equation} \Pr(X^2 > x) \approx \Pr\left(Y > \frac{x-a}{b} \ \Big| \ Y \sim \chi^2_c\right). \end{equation}\]

We can also use two-moments and one-moment matching techniques similarly. For the two-moment adjustment, assume that \(X^2 \approx bY\) where \(Y\sim\chi^2_c\). Then matching the first two asymptotic moments of \(X^2\) with the approximating random variable gives \[\begin{equation} b = \frac{\mu_2(X^2)}{2\mu_1(X^2)} \hspace{1em}\text{and}\hspace{1em} c = \frac{\mu_1(X^2)}{b}. \end{equation}\] As for the one-moment adjustment, \(X^2\) is again approximated by \(Y\sim\chi^2_c\), but \(c\) is fixed to be the number of degrees of freedom available for testing (taken to be \(c=S-m\)). Solving for \(b\) yields \[\begin{equation} b = \frac{\mu_1(X^2)}{c}. \end{equation}\] In both cases, tail probabilities are calculated as \[\begin{equation} \Pr(X^2 > x) \approx \Pr\left(Y > \frac{x}{b} \ \Big| \ Y \sim \chi^2_c\right). \end{equation}\] Maydeu-Olivares and Joe (2008) advises that one-moment matching results in inaccurate \(p\)-values.

Variance-covariance free Wald test

The drawbacks of using \(\hat{\boldsymbol\Omega}_2^+\) as a weight matrix in the quadratic form is that 1) \({\boldsymbol\Omega}_2\) needs to be estimated; and 2) the degrees of freedom \(k\) cannot be determined a priori and can be tricky (if done by looking at small eigenvalues, it can be tricky to judge wchih are actually zero).

To overcome this, Maydeu-Olivares and Joe (2005) and Maydeu-Olivares and Joe (2006) suggested using a weight matrix \({\boldsymbol\Xi}\) such that \({\boldsymbol\Omega}_2\) is a g-inverse of \({\boldsymbol\Xi}\), i.e. \({\boldsymbol\Xi}= {\boldsymbol\Xi}{\boldsymbol\Omega}_2{\boldsymbol\Xi}\). Let \({\boldsymbol\Delta}_2^\perp\) be an \(S \times (S-m)\) orthogonal complement to \({\boldsymbol\Delta}_2\), i.e. it satisfies \(({\boldsymbol\Delta}_2^\perp)^\top{\boldsymbol\Delta}_2 = {\mathbf 0}\). Then from (5), we see that \[ \sqrt n ({\boldsymbol\Delta}_2^\perp)^\top \hat {\mathbf e}_2 \xrightarrow{\text D} \mathop{\mathrm{N}}({\mathbf 0}, ({\boldsymbol\Delta}_2^\perp)^\top {\boldsymbol\Omega}_2 {\boldsymbol\Delta}_2^\perp). \tag{11} \] Because of (6), the asymptotic covariance matrix may be written \[ ({\boldsymbol\Delta}_2^\perp)^\top {\boldsymbol\Omega}_2 {\boldsymbol\Delta}_2^\perp = ({\boldsymbol\Delta}_2^\perp)^\top {\boldsymbol\Sigma}_2 {\boldsymbol\Delta}_2^\perp, \] since all the multiplications of \({\boldsymbol\Delta}_2\) with its orthogonal complement cancels out. By letting \[ {\boldsymbol\Xi}= {\boldsymbol\Delta}_2^\perp \big( ({\boldsymbol\Delta}_2^\perp)^\top {\boldsymbol\Sigma}_2 {\boldsymbol\Delta}_2^\perp \big)^{-1} ({\boldsymbol\Delta}_2^\perp)^\top, \] we can then verify \({\boldsymbol\Xi}= {\boldsymbol\Xi}{\boldsymbol\Omega}_2{\boldsymbol\Xi}\); that is, \({\boldsymbol\Omega}_2\) is a generalised inverse of \({\boldsymbol\Xi}\). Let \(\hat{\boldsymbol\Xi}\) be an appropriate estimate of \({\boldsymbol\Xi}\), e.g. by replacing the matrices above with their corresponding hat versions. The implication here is that \[ X^2 = n \hat{\mathbf e}_2^\top \hat{\boldsymbol\Xi}\hat{\mathbf e}_2 = n \hat{\mathbf e}_2^\top \hat{\boldsymbol\Delta}_2^\perp \big( (\hat{\boldsymbol\Delta}_2^\perp)^\top \hat{\boldsymbol\Sigma}_2 \hat{\boldsymbol\Delta}_2^\perp \big)^{-1} (\hat{\boldsymbol\Delta}_2^\perp)^\top \hat{\mathbf e}_2 \] converges in distribution to a \(\chi^2_{S-m}\) variate as \(n\to\infty\) due to (11) and Slutsky’s theorem. The degrees of freedom is as such because \({\boldsymbol\Delta}_2^\perp\) is of full column rank \(S-m\) and hence \({\boldsymbol\Xi}\) is also of rank \(S-m\). The dimensions of a vector space and its orthogonal complement always add up to the dimension of the whole space. Since \({\boldsymbol\Delta}_2 \in \mathbb{R}^{S\times m}\) has dimension (column rank) \(m\), the column space has an orthogonal complement in \(\mathbb{R}^S\), and thus the dimension of this orthogonal complement is \(S-m\).

Pearson test

To construct a Pearson test statistic, let \(\hat{\boldsymbol\Xi}^{-1} = \hat{\mathbf D}_2 := \mathop{\mathrm{diag}}\big( {\boldsymbol\pi}_2(\hat{\boldsymbol\theta}) \big) \xrightarrow{\text P} {\mathbf D}_2:=\mathop{\mathrm{diag}}\big( {\boldsymbol\pi}_2({\boldsymbol\theta}) \big).\) Then we can see that \[\begin{equation}\label{eq:Pearsontest} X^2 = n\hat{\mathbf e}_2^\top\hat{\mathbf D}_2^{-1}\hat{\mathbf e}_2 = n \sum_{r=1}^R \frac{(\dot p_r - \dot\pi_r(\hat{\boldsymbol\theta}))^2}{\dot\pi_r(\hat{\boldsymbol\theta})} + n \sum_{r<s} \frac{(\dot p_{rs} - \dot\pi_{rs}(\hat{\boldsymbol\theta}))^2}{\dot\pi_{rs}(\hat{\boldsymbol\theta})}, \end{equation}\] which resembles the traditional Pearson chi-square test statistic. Note that \[\begin{equation} \sqrt n \hat{\mathbf D}_2^{-1/2}\hat{\mathbf e}_2 \xrightarrow{\text D} \mathop{\mathrm{N}}({\mathbf 0}, {\mathbf D}_2^{-1/2}{\boldsymbol\Omega}_2 {\mathbf D}_2^{-1/2}) \end{equation}\] as \(N\to\infty\), and therefore \(X^2\) has the limiting distribution of \(\sum_{s=1}^S \delta_s X_s\), where \(\delta_s\) are eigenvalues of \(n^{-1} {\mathbf D}_2^{-1/2}{\boldsymbol\Omega}_2 {\mathbf D}_2^{-1/2}\) and \(X_s\,\overset{\text{iid}}{\sim}\,\chi^2_1\) (Mathai and Provost 1992, Representation 3.1a.1, p.29). These eigenvalues can be estimated by the eigenvalues of \(n^{-1} \hat{\mathbf D}_2^{-1/2}\hat{\boldsymbol\Omega}_2 \hat {\mathbf D}_2^{-1/2}\). In principle, this method can be applied more generally by considering a Choleski decomposition \({\boldsymbol\Xi}= {\mathbf L}{\mathbf L}^\top\) instead of matrix square roots. Of course, this only works if the matrix \({\boldsymbol\Xi}\) is of full rank, which rules out the Wald test statistic above.

Let \(\bar\delta\) represent the arithmetic mean of the eigenvalues \(\delta_s\). The asymptotic mean and variance of \(X^2\) are \[\begin{align} \mathop{\mathrm{E}}X^2 &= \mathop{\mathrm{E}}\sum_{s=1}^S \delta_s X_s = \sum_{s=1}^S \delta_s = S\bar\delta = \operatorname{tr}(n^{-1} {\mathbf D}_2^{-1/2}{\boldsymbol\Omega}_2 {\mathbf D}_2^{-1/2}) \\ \mathop{\mathrm{Var}}X^2 &= \mathop{\mathrm{Var}}\sum_{s=1}^S \delta_s X_s = 2\sum_{s=1}^S\delta_s^2 = 2\sum_{s=1}^S (\delta_s - \bar\delta)^2 + 2S\bar\delta^2, \end{align}\] So it seems that the power of the test is affected by the size and dispersion of the eigenvalues (Heo 2014). Rao and Scott (1979);Rao and Scott (1981);Rao and Scott (1984) suggested adjustments to the Pearson test as follows.

  1. First-order Rao-Scott type test is obtained by dividing \(X^2\) by \(\bar \delta\). This statistic is distributed as \(\chi^2_S\).

  2. Second-order Rao-Scott type test is obtained by dividing \(X^2/\bar\delta\) by \(1+\hat a^2\), where \(\hat a=\sqrt{S^{-1}\sum_{s=1}^S (\delta_s - \bar\delta)}\big/\bar\delta\) is the coefficient of variation of the eigenvalues. Hence, \(1+\hat a^2= (S\bar\delta ^2)^{-1}\sum_{s=1}^s \delta_s^2\). This statistic is distributed as \(\chi^2_k\) where \(k=S(1+\hat a^2)^{-1}\).

References

Agresti, Alan. 2012. Categorical Data Analysis. Vol. 792. John Wiley & Sons.
Bartholomew, David J, and Shing On Leung. 2002. “A Goodness of Fit Test for Sparse 2p Contingency Tables.” British Journal of Mathematical and Statistical Psychology 55 (1): 1–15.
Fuller, Wayne A. 2009. Introduction to Statistical Time Series. John Wiley & Sons.
Heo, Sunyeong. 2014. “Empirical Analysis on Rao-Scott First Order Adjustment for Two Population Homogeneity Test Based on Stratified Three-Stage Cluster Sampling with Pps.” Journal of the Chosun Natural Science 7 (3): 208–13.
Katsikatsou, Myrsini, Irini Moustaki, Fan Yang-Wallentin, and Karl G Jöreskog. 2012. “Pairwise Likelihood Estimation for Factor Analysis Models with Ordinal Data.” Computational Statistics & Data Analysis 56 (12): 4243–58.
Mathai, Arakaparampil M, and Serge B Provost. 1992. Quadratic Forms in Random Variables: Theory and Applications. Dekker.
Maydeu-Olivares, Alberto, and Harry Joe. 2005. “Limited-and Full-Information Estimation and Goodness-of-Fit Testing in 2 n Contingency Tables: A Unified Framework.” Journal of the American Statistical Association 100 (471): 1009–20.
———. 2006. “Limited Information Goodness-of-Fit Testing in Multidimensional Contingency Tables.” Psychometrika 71 (4): 713.
———. 2008. “An Overview of Limited Information Goodness-of-Fit Testing in Multidimensional Contingency Tables.” New Trends in Psychometrics, 253–62.
Rao, Jon N K, and AJ Scott. 1979. “Chi-Squared Tests for Analysis of Categorical Data from Complex Surveys.” In Proceedings of the American Statistical Association, Section on Survey Research Methods, 58–66.
Rao, Jon N K, and Alastair J Scott. 1981. “The Analysis of Categorical Data from Complex Sample Surveys: Chi-Squared Tests for Goodness of Fit and Independence in Two-Way Tables.” Journal of the American Statistical Association 76 (374): 221–30.
Rao, Jon N K, and Alistair J Scott. 1984. “On Chi-Squared Tests for Multiway Contingency Tables with Cell Proportions Estimated from Survey Data.” The Annals of Statistics, 46–60.
Reiser, Mark. 1996. “Analysis of Residuals for the Multionmial Item Response Model.” Psychometrika 61: 509–28.