To simulate a complex sampling design, our strategy is to generate data for an entire population inspired by a sampling design in education. The population consists of 2,000 schools (Primary Sampling Units, PSU) of three types: “A” (400 units), “B” (1000 units), and “C” (600 units). The school type correlates with the average abilities of its students (explained further below) and therefore gives a way of stratifying the population. Each school is assigned a random number of students from the normal distribution N(500,1252) (the number then rounded down to a whole number). Students are then assigned randomly into classes of average sizes 15, 25 and 20 respectively for each school type A, B and C. The total population size is roughly N=1 million students. The population statistics is summarised in the table below.
# Make population
pop <- make_population(model_no = 1, return_all = TRUE, seed = 123)
Type | N | N | Avg. | SD | Min. | Max. | Avg. | SD | Min. | Max. |
---|---|---|---|---|---|---|---|---|---|---|
A | 400 | 200,670 | 501.7 | 96.9 | 253 | 824 | 15.2 | 3.9 | 3 | 36 |
B | 1,000 | 501,326 | 501.3 | 100.2 | 219 | 839 | 25.7 | 5.0 | 6 | 48 |
C | 600 | 303,861 | 506.4 | 101.9 | 195 | 829 | 20.4 | 4.4 | 6 | 39 |
In the factor model, the latent variable {\boldsymbol\eta} represents the abilities of each individual student, and it is used as a stratifying variable in the following way. For each individual h=1,\dots,N, let {\boldsymbol\eta}_h represent the vector of latent abilities generated from N_q({\mathbf 0}, {\boldsymbol\Psi}) as described in a previous article. Define z_h=\sum_{j=1}^q \eta_{jh}, which is now a normal random variate centered at zero. If all the correlations in {\boldsymbol\Psi} are positive, then z_h has variance at least q. Students are then arranged in descending order of z_h, and based on this arrangement assigned to the school type A, B or C respectively. If the observed variables y are test items, this would suggest that school type A would consist of ‘high’ ability students, type B ‘medium’ ability student, and type C ‘low’ ability students. Note that the actualised item responses are further subjected to random errors \epsilon.
Probability-based sample
A probability based sampling procedure may be implemented in one of three ways described below.
Stratified sampling
From each school type (strata), select 1000 students (PSU) using SRS. Let N_a be the total number of students in stratum a\in\{1,2,3\}. Then the probability of selection of a student in stratum a is \Pr(\text{selection}) = \frac{1000}{N_a}. The total sample size is n= 3 \times 1000 = 3000.
Two-stage cluster sampling
Select 140 schools (PSU; clusters) using probabilities proportional to size (PPS). For each school, select one class by SRS, and all students in that class are added to the sample. The probability of selection of a student in PSU b=1,\dots,2000 is then \Pr(\text{selection}) = \Pr(\text{weighted school selection}) \times \frac{1}{\# \text{ classes in school }b}. The total sample size will vary from sample to sample, but on average will be n = 140 \times 21.5 = 3010, where \frac{15 \times 400 + 25 \times 1000 + 20 \times 600}{3} = 21.5 is the average class size per school.
Two-stage stratified cluster sampling
For each school type (strata), select 50 schools using SRS. Then, within each school, select 1 class by SRS, and all students in that class are added to the sample. The probability of selection of a student in PSU b from school type a is \Pr(\text{selection}) = \frac{50}{\# \text{ schools of type } a} \times \frac{1}{\# \text{ classes in school }b}. Here, the expected sample size is n=50 \times (15 + 25 + 20) = 3000.
Generating the population data
For the purposes of checking the data generation process, we will use Model 1 parameters to generate the response patterns. That is, we assume a 1 factor model with p=5 observed response items. The true model parameter values {\boldsymbol\theta} (loadings and thresholds) are as follows:
#> lambda1 lambda2 lambda3 lambda4 lambda5 tau1 tau2 tau3 tau4 tau5
#> 0.80 0.70 0.47 0.38 0.34 -1.43 -0.55 -0.13 -0.72 -1.13
Comparison of population and model-implied moments
# Univariate moments (model-implied)
round(pi1, 4)
#> y1 y2 y3 y4 y5
#> 0.9236 0.7088 0.5517 0.7642 0.8708
# Univariate moments (population statistics)
round(prop1, 4)
#> y1 y2 y3 y4 y5
#> 0.9240 0.7083 0.5515 0.7643 0.8709
# Errors (percentage)
round((prop1 - pi1) / pi1 * 100, 2)
#> y1 y2 y3 y4 y5
#> 0.04 -0.07 -0.03 0.01 0.01
Model fit
# Comparison of lambda values
round(coef(fit), 2)[1:5]
#> eta1=~y1 eta1=~y2 eta1=~y3 eta1=~y4 eta1=~y5
#> 0.80 0.70 0.47 0.38 0.34
true_vals[1:5]
#> lambda1 lambda2 lambda3 lambda4 lambda5
#> 0.80 0.70 0.47 0.38 0.34
# Comparison of threshold values
round(coef(fit), 2)[-(1:5)]
#> y1|t1 y2|t1 y3|t1 y4|t1 y5|t1
#> -1.43 -0.55 -0.13 -0.72 -1.13
true_vals[-(1:5)]
#> tau1 tau2 tau3 tau4 tau5
#> -1.43 -0.55 -0.13 -0.72 -1.13
Performing a complex sampling procedure
There are three functions that perform the three complex sampling procedures described earlier.
# Stratified sampling
(samp1 <- gen_data_bin_complex1(population = pop, seed = 123))
#> # A tibble: 3,000 × 9
#> type school class wt y1 y2 y3 y4 y5
#> <chr> <chr> <chr> <dbl> <ord> <ord> <ord> <ord> <ord>
#> 1 A A1 A1ab 0.599 1 1 1 1 1
#> 2 A A1 A1y 0.599 1 1 1 1 1
#> 3 A A10 A10a 0.599 1 1 1 1 1
#> 4 A A10 A10b 0.599 1 1 1 0 1
#> 5 A A10 A10i 0.599 1 1 1 1 1
#> 6 A A10 A10s 0.599 1 1 1 1 0
#> 7 A A101 A101g 0.599 1 1 0 1 1
#> 8 A A101 A101l 0.599 1 1 1 1 1
#> 9 A A101 A101q 0.599 1 1 0 1 1
#> 10 A A101 A101v 0.599 1 1 1 1 1
#> # ℹ 2,990 more rows
# Two-stage cluster sampling
(samp2 <- gen_data_bin_complex2(population = pop, seed = 123))
#> # A tibble: 3,155 × 9
#> type school class wt y1 y2 y3 y4 y5
#> <chr> <chr> <chr> <dbl> <ord> <ord> <ord> <ord> <ord>
#> 1 A A100 A100l 1.48 1 1 1 1 1
#> 2 A A100 A100l 1.48 1 1 0 1 1
#> 3 A A100 A100l 1.48 1 1 1 1 1
#> 4 A A100 A100l 1.48 1 1 1 0 1
#> 5 A A100 A100l 1.48 1 1 1 1 1
#> 6 A A100 A100l 1.48 1 1 1 1 1
#> 7 A A100 A100l 1.48 1 1 0 1 1
#> 8 A A100 A100l 1.48 1 1 1 1 0
#> 9 A A100 A100l 1.48 1 1 0 0 1
#> 10 A A100 A100l 1.48 1 1 1 1 1
#> # ℹ 3,145 more rows
# Two-stage stratified cluster sampling
(samp3 <- gen_data_bin_complex3(population = pop, seed = 123))
#> # A tibble: 3,039 × 9
#> type school class wt y1 y2 y3 y4 y5
#> <chr> <chr> <chr> <dbl> <ord> <ord> <ord> <ord> <ord>
#> 1 A A109 A109b 0.740 1 1 1 1 1
#> 2 A A109 A109b 0.740 1 1 1 1 1
#> 3 A A109 A109b 0.740 1 1 1 0 1
#> 4 A A109 A109b 0.740 1 1 1 1 1
#> 5 A A109 A109b 0.740 1 1 1 1 1
#> 6 A A109 A109b 0.740 1 1 1 0 1
#> 7 A A109 A109b 0.740 1 1 1 1 1
#> 8 A A109 A109b 0.740 1 1 1 1 1
#> 9 A A109 A109b 0.740 1 1 0 1 1
#> 10 A A109 A109b 0.740 1 1 1 1 1
#> # ℹ 3,029 more rows
Check bias in model fit
To check the bias in model fit, do the following:
Generate data sets using
gen_data_bin_complex()
for the stratified, cluster, and stratified cluster methods.Fit the factor model using lavaan and compare the estimates with the true parameter values
true_vals
.Repeat steps 1 & 2 a total of 1,000 times.
Estimating the multinomial covariance matrix
Let {\mathbf p} be the R \times 1 vector of sample proportions (of response patterns) corresponding to the vector of population proportions {\boldsymbol\pi}. Assuming iid observations, \begin{equation} \sqrt n ({\mathbf p}- {\boldsymbol\pi}) \xrightarrow{\text D}{\mathop{\mathrm{N}}}_{R}({\mathbf 0},{\boldsymbol\Sigma}), \tag{1} \end{equation} where {\boldsymbol\Sigma}= \mathop{\mathrm{diag}}({\boldsymbol\pi}) - {\boldsymbol\pi}{\boldsymbol\pi}^\top. Under a simple random sampling procedure, the {\boldsymbol\Sigma} may be estimated using \hat{\boldsymbol\Sigma}= \mathop{\mathrm{diag}}(\hat{\boldsymbol\pi}) - \hat{\boldsymbol\pi}\hat{\boldsymbol\pi}^\top, where \hat{\boldsymbol\pi} are the estimated value of the population proportions based on the null hypothesis being tested H_0: {\boldsymbol\pi}= {\boldsymbol\pi}({\boldsymbol\theta}).
Under a complex sampling procedure, the proportions appearing in (1) should be replaced by the weighted version with elements \begin{equation} {\mathbf p}_r = \frac{\sum_h w_h [{\mathbf y}^{(h)} = {\mathbf y}_r]}{\sum_h w_h}, \hspace{2em} r=1,\dots,R. \end{equation} For convenience, we may do a total normalisation of the weights so that \sum_h w_h=n, the sample size. Under suitable conditions, the central limit theorem in (1) still applies, although the covariance matrix {\boldsymbol\Sigma} need not now take a multinomial form.
We now discuss how to estimate {\boldsymbol\Sigma} from a stratified multistage sample scheme where the strata are labelled a=1,\dots,N_S and the PSUs are labelled b=1,\dots,n_a, where n_a is the number of PSUs selected in stratum a. Let {\mathcal S}_{ab} be the set of sample units contained within PSU b within stratum a. Define \begin{equation} {\mathbf u}_{ab} = \sum_{h \in {\mathcal S}_{ab}} w_h ({\mathbf y}^{(h)} - \hat{\boldsymbol\pi}). \end{equation} One may think of these as the ‘deviations from the mean’ of units in PSU b within stratum a. Further define \begin{equation} \bar{\mathbf u}_a = \frac{1}{n_a} \sum_{b=1}^{n_a} {\mathbf u}_{ab}, \end{equation} the average of these deviations in stratum a. Then, a standard estimator of {\boldsymbol\Sigma} is given by \begin{equation} \hat{\boldsymbol\Sigma}= \frac{1}{n} \sum_a \frac{n_a}{n_a-1} \sum_b ({\mathbf u}_{ab} - \bar{\mathbf u}_a)({\mathbf u}_{ab} - \bar{\mathbf u}_a)^\top. \end{equation}
For the three complex sampling procedures discussed earlier,
(Stratified sampling) There are three strata a=1,2,3 corresponding to the school types, and the PSUs are the individual students within schools in each stratum. An equal size is taken per stratum, so n_a=1000, and thus the sample units {\mathcal S}_{ab} lists out the students from various schools and classes within each stratum a (there is no sum over b).
(Two-stage cluster sampling) In this case, there is only 1 stratum, n_a=n and {\mathbf u}_a=0.
- (Two-stage stratified cluster sampling) Same as 1, but the sample units {\mathcal S}_{ab} lists out the students from school b (from the selected class) within each stratum a.