To simulate a complex sampling design, our strategy is to generate data for an entire population inspired by a sampling design in education. The population consists of 2,000 schools (Primary Sampling Units, PSU) of three types: “A” (400 units), “B” (1000 units), and “C” (600 units). The school type correlates with the average abilities of its students (explained further below) and therefore gives a way of stratifying the population. Each school is assigned a random number of students from the normal distribution $$\mathop{\mathrm{N}}(500, 125^2)$$ (the number then rounded down to a whole number). Students are then assigned randomly into classes of average sizes 15, 25 and 20 respectively for each school type A, B and C. The total population size is roughly $$N=1$$ million students. The population statistics is summarised in the table below.

# Make population
pop <- make_population(model_no = 1, return_all = TRUE, seed = 123)
Schools
Students
Class size
Type $$N$$ $$N$$ Avg. SD Min. Max. Avg. SD Min. Max.
A 400 200,670 501.7 96.9 253 824 15.2 3.9 3 36
B 1,000 501,326 501.3 100.2 219 839 25.7 5.0 6 48
C 600 303,861 506.4 101.9 195 829 20.4 4.4 6 39

In the factor model, the latent variable $${\boldsymbol\eta}$$ represents the abilities of each individual student, and it is used as a stratifying variable in the following way. For each individual $$h=1,\dots,N$$, let $${\boldsymbol\eta}_h$$ represent the vector of latent abilities generated from $$N_q({\mathbf 0}, {\boldsymbol\Psi})$$ as described in a previous article. Define $$z_h=\sum_{j=1}^q \eta_{jh}$$, which is now a normal random variate centered at zero. If all the correlations in $${\boldsymbol\Psi}$$ are positive, then $$z_h$$ has variance at least $$q$$. Students are then arranged in descending order of $$z_h$$, and based on this arrangement assigned to the school type A, B or C respectively. If the observed variables $$y$$ are test items, this would suggest that school type A would consist of ‘high’ ability students, type B ‘medium’ ability student, and type C ‘low’ ability students. Note that the actualised item responses are further subjected to random errors $$\epsilon$$.

## Probability-based sample

A probability based sampling procedure may be implemented in one of three ways described below.

### Stratified sampling

From each school type (strata), select 1000 students (PSU) using SRS. Let $$N_a$$ be the total number of students in stratum $$a\in\{1,2,3\}$$. Then the probability of selection of a student in stratum $$a$$ is $$\Pr(\text{selection}) = \frac{1000}{N_a}$$. The total sample size is $$n= 3 \times 1000 = 3000$$.

### Two-stage cluster sampling

Select 140 schools (PSU; clusters) using probabilities proportional to size (PPS). For each school, select one class by SRS, and all students in that class are added to the sample. The probability of selection of a student in PSU $$b=1,\dots,2000$$ is then $\Pr(\text{selection}) = \Pr(\text{weighted school selection}) \times \frac{1}{\# \text{ classes in school }b}.$ The total sample size will vary from sample to sample, but on average will be $$n = 140 \times 21.5 = 3010$$, where $$\frac{15 \times 400 + 25 \times 1000 + 20 \times 600}{3} = 21.5$$ is the average class size per school.

### Two-stage stratified cluster sampling

For each school type (strata), select 50 schools using SRS. Then, within each school, select 1 class by SRS, and all students in that class are added to the sample. The probability of selection of a student in PSU $$b$$ from school type $$a$$ is $\Pr(\text{selection}) = \frac{50}{\# \text{ schools of type } a} \times \frac{1}{\# \text{ classes in school }b}.$ Here, the expected sample size is $$n=50 \times (15 + 25 + 20) = 3000$$.

## Generating the population data

For the purposes of checking the data generation process, we will use Model 1 parameters to generate the response patterns. That is, we assume a 1 factor model with $$p=5$$ observed response items. The true model parameter values $${\boldsymbol\theta}$$ (loadings and thresholds) are as follows:

#> lambda1 lambda2 lambda3 lambda4 lambda5    tau1    tau2    tau3    tau4    tau5
#>    0.80    0.70    0.47    0.38    0.34   -1.43   -0.55   -0.13   -0.72   -1.13

### Comparison of population and model-implied moments

# Univariate moments (model-implied)
round(pi1, 4)
#>     y1     y2     y3     y4     y5
#> 0.9236 0.7088 0.5517 0.7642 0.8708

# Univariate moments (population statistics)
round(prop1, 4)
#>     y1     y2     y3     y4     y5
#> 0.9240 0.7083 0.5515 0.7643 0.8709

# Errors (percentage)
round((prop1 - pi1) / pi1 * 100, 2)
#>    y1    y2    y3    y4    y5
#>  0.04 -0.07 -0.03  0.01  0.01

### Model fit

# Comparison of lambda values
round(coef(fit), 2)[1:5]
#> eta1=~y1 eta1=~y2 eta1=~y3 eta1=~y4 eta1=~y5
#>     0.80     0.70     0.47     0.38     0.34
true_vals[1:5]
#> lambda1 lambda2 lambda3 lambda4 lambda5
#>    0.80    0.70    0.47    0.38    0.34

# Comparison of threshold values
round(coef(fit), 2)[-(1:5)]
#> y1|t1 y2|t1 y3|t1 y4|t1 y5|t1
#> -1.43 -0.55 -0.13 -0.72 -1.13
true_vals[-(1:5)]
#>  tau1  tau2  tau3  tau4  tau5
#> -1.43 -0.55 -0.13 -0.72 -1.13

## Performing a complex sampling procedure

There are three functions that perform the three complex sampling procedures described earlier.

# Stratified sampling
(samp1 <- gen_data_bin_complex1(population = pop, seed = 123))
#> # A tibble: 3,000 × 9
#>    type  school class    wt y1    y2    y3    y4    y5
#>    <chr> <chr>  <chr> <dbl> <ord> <ord> <ord> <ord> <ord>
#>  1 A     A1     A1ab  0.599 1     1     1     1     1
#>  2 A     A1     A1y   0.599 1     1     1     1     1
#>  3 A     A10    A10a  0.599 1     1     1     1     1
#>  4 A     A10    A10b  0.599 1     1     1     0     1
#>  5 A     A10    A10i  0.599 1     1     1     1     1
#>  6 A     A10    A10s  0.599 1     1     1     1     0
#>  7 A     A101   A101g 0.599 1     1     0     1     1
#>  8 A     A101   A101l 0.599 1     1     1     1     1
#>  9 A     A101   A101q 0.599 1     1     0     1     1
#> 10 A     A101   A101v 0.599 1     1     1     1     1
#> # ℹ 2,990 more rows

# Two-stage cluster sampling
(samp2 <- gen_data_bin_complex2(population = pop, seed = 123))
#> # A tibble: 3,155 × 9
#>    type  school class    wt y1    y2    y3    y4    y5
#>    <chr> <chr>  <chr> <dbl> <ord> <ord> <ord> <ord> <ord>
#>  1 A     A100   A100l  1.48 1     1     1     1     1
#>  2 A     A100   A100l  1.48 1     1     0     1     1
#>  3 A     A100   A100l  1.48 1     1     1     1     1
#>  4 A     A100   A100l  1.48 1     1     1     0     1
#>  5 A     A100   A100l  1.48 1     1     1     1     1
#>  6 A     A100   A100l  1.48 1     1     1     1     1
#>  7 A     A100   A100l  1.48 1     1     0     1     1
#>  8 A     A100   A100l  1.48 1     1     1     1     0
#>  9 A     A100   A100l  1.48 1     1     0     0     1
#> 10 A     A100   A100l  1.48 1     1     1     1     1
#> # ℹ 3,145 more rows

# Two-stage stratified cluster sampling
(samp3 <- gen_data_bin_complex3(population = pop, seed = 123))
#> # A tibble: 3,039 × 9
#>    type  school class    wt y1    y2    y3    y4    y5
#>    <chr> <chr>  <chr> <dbl> <ord> <ord> <ord> <ord> <ord>
#>  1 A     A109   A109b 0.740 1     1     1     1     1
#>  2 A     A109   A109b 0.740 1     1     1     1     1
#>  3 A     A109   A109b 0.740 1     1     1     0     1
#>  4 A     A109   A109b 0.740 1     1     1     1     1
#>  5 A     A109   A109b 0.740 1     1     1     1     1
#>  6 A     A109   A109b 0.740 1     1     1     0     1
#>  7 A     A109   A109b 0.740 1     1     1     1     1
#>  8 A     A109   A109b 0.740 1     1     1     1     1
#>  9 A     A109   A109b 0.740 1     1     0     1     1
#> 10 A     A109   A109b 0.740 1     1     1     1     1
#> # ℹ 3,029 more rows

### Check bias in model fit

To check the bias in model fit, do the following:

1. Generate data sets using gen_data_bin_complex() for the stratified, cluster, and stratified cluster methods.

2. Fit the factor model using lavaan and compare the estimates with the true parameter values true_vals.

3. Repeat steps 1 & 2 a total of 1,000 times.

## Estimating the multinomial covariance matrix

Let $${\mathbf p}$$ be the $$R \times 1$$ vector of sample proportions (of response patterns) corresponding to the vector of population proportions $${\boldsymbol\pi}$$. Assuming iid observations, $$$\sqrt n ({\mathbf p}- {\boldsymbol\pi}) \xrightarrow{\text D}{\mathop{\mathrm{N}}}_{R}({\mathbf 0},{\boldsymbol\Sigma}), \tag{1}$$$ where $${\boldsymbol\Sigma}= \mathop{\mathrm{diag}}({\boldsymbol\pi}) - {\boldsymbol\pi}{\boldsymbol\pi}^\top$$. Under a simple random sampling procedure, the $${\boldsymbol\Sigma}$$ may be estimated using $$\hat{\boldsymbol\Sigma}= \mathop{\mathrm{diag}}(\hat{\boldsymbol\pi}) - \hat{\boldsymbol\pi}\hat{\boldsymbol\pi}^\top$$, where $$\hat{\boldsymbol\pi}$$ are the estimated value of the population proportions based on the null hypothesis being tested $$H_0: {\boldsymbol\pi}= {\boldsymbol\pi}({\boldsymbol\theta})$$.

Under a complex sampling procedure, the proportions appearing in (1) should be replaced by the weighted version with elements $$${\mathbf p}_r = \frac{\sum_h w_h [{\mathbf y}^{(h)} = {\mathbf y}_r]}{\sum_h w_h}, \hspace{2em} r=1,\dots,R.$$$ For convenience, we may do a total normalisation of the weights so that $$\sum_h w_h=n$$, the sample size. Under suitable conditions, the central limit theorem in (1) still applies, although the covariance matrix $${\boldsymbol\Sigma}$$ need not now take a multinomial form.

We now discuss how to estimate $${\boldsymbol\Sigma}$$ from a stratified multistage sample scheme where the strata are labelled $$a=1,\dots,N_S$$ and the PSUs are labelled $$b=1,\dots,n_a$$, where $$n_a$$ is the number of PSUs selected in stratum $$a$$. Let $${\mathcal S}_{ab}$$ be the set of sample units contained within PSU $$b$$ within stratum $$a$$. Define $$${\mathbf u}_{ab} = \sum_{h \in {\mathcal S}_{ab}} w_h ({\mathbf y}^{(h)} - \hat{\boldsymbol\pi}).$$$ One may think of these as the ‘deviations from the mean’ of units in PSU $$b$$ within stratum $$a$$. Further define $$$\bar{\mathbf u}_a = \frac{1}{n_a} \sum_{b=1}^{n_a} {\mathbf u}_{ab},$$$ the average of these deviations in stratum $$a$$. Then, a standard estimator of $${\boldsymbol\Sigma}$$ is given by $$$\hat{\boldsymbol\Sigma}= \frac{1}{n} \sum_a \frac{n_a}{n_a-1} \sum_b ({\mathbf u}_{ab} - \bar{\mathbf u}_a)({\mathbf u}_{ab} - \bar{\mathbf u}_a)^\top.$$$

For the three complex sampling procedures discussed earlier,

1. (Stratified sampling) There are three strata $$a=1,2,3$$ corresponding to the school types, and the PSUs are the individual students within schools in each stratum. An equal size is taken per stratum, so $$n_a=1000$$, and thus the sample units $${\mathcal S}_{ab}$$ lists out the students from various schools and classes within each stratum $$a$$ (there is no sum over $$b$$).

2. (Two-stage cluster sampling) In this case, there is only 1 stratum, $$n_a=n$$ and $${\mathbf u}_a=0$$.

1. (Two-stage stratified cluster sampling) Same as 1, but the sample units $${\mathcal S}_{ab}$$ lists out the students from school $$b$$ (from the selected class) within each stratum $$a$$.

### Comparison of estimates

Note: The $${\boldsymbol\Sigma}$$ matrix appearing in the plots below have been standardised so the plots show correlation matrices instead. These are not the tetrachoric correlations, instead they are correlations assuming that the responses are numeric 0s and 1s.