Spatio-temporal modelling of property prices in Brunei Darussalam

LSE Social Statistics Seminar

Assistant Professor of Statistics

Universiti Brunei Darussalam

October 26, 2023

RPPI measures the rate at which the prices of private residential properties purchased by households are changing over time — BDCB

Ideally, any such index should capture the underlying *trend in property values*, rather than **changes in characteristics of the houses** being transacted.

Hedonic regression is commonly employed to measure the **constant-quality property price changes**, i.e. \[
y_i = f(x_i, t_i) + \epsilon_i.
\]

Some challenges faced:

- Data sparsity (infrequent transactions and missing covariates)
- Spatial and temporal autocorrelation
- Coverage and sample selection bias
- Data collection and processing

Source: github.com/Pecners

Location of Starbucks® around London.

Data source: Kaggle

Transport for London (TFL) Tube, Overground, Elizabeth line, DLR & Tram lines (including Crossrail alignment)

Data source: github/oobrien

London Middle Super Output Area (MSOA) population density (2011).

Data source: London Datastore

- Geospatial points vs point processes
- Point processes: Locations are random variables (e.g. earthquakes, crime, diseases, etc.)
- Geospatial points: Locations are fixed

- Raster/Grid data
- Pixelated (or gridded) data where each pixel is associated with a specific geographical location
- E.g. satellite imagery, digital elevation models, etc.

- Network data.
- E.g. road networks, social networks, etc.

*Simple features*refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects.- The
`{sf}`

package (Pebesma and Bivand 2023) provides support for handling spatial data in R.

```
Simple feature collection with 983 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -0.5102995 ymin: 51.28676 xmax: 0.3339955 ymax: 51.69187
Geodetic CRS: WGS 84
# A tibble: 983 × 4
MSOA11CD LAD11NM PopDen geometry
<chr> <chr> <dbl> <MULTIPOLYGON [°]>
1 E02000001 City of London 25.5 (((-0.0977199 51.52292, -0.097492 51.5…
2 E02000002 Barking and Dagenham 31.3 (((0.1475682 51.59681, 0.1481361 51.59…
3 E02000003 Barking and Dagenham 46.9 (((0.1510245 51.58589, 0.1510833 51.58…
4 E02000004 Barking and Dagenham 24.8 (((0.1851372 51.56552, 0.1833498 51.56…
5 E02000005 Barking and Dagenham 72.1 (((0.1492369 51.56738, 0.1483768 51.56…
6 E02000007 Barking and Dagenham 50.6 (((0.1561951 51.56508, 0.1543864 51.56…
7 E02000008 Barking and Dagenham 81.5 (((0.1357881 51.55903, 0.1344422 51.55…
8 E02000009 Barking and Dagenham 87.4 (((0.1162061 51.557, 0.113344 51.55732…
9 E02000010 Barking and Dagenham 76.8 (((0.1563492 51.55102, 0.1525812 51.55…
10 E02000011 Barking and Dagenham 38.8 (((0.1646052 51.55265, 0.1640759 51.55…
# ℹ 973 more rows
```

E.g. Data set with (\(n=983\)) features containing spatial attributes (

`geometry`

) and feature-related variables (`MSOA11CD`

,`LAD11NM`

,`PopDen`

).Notice the use of

`{dplyr}`

verbs! Link: Tidyverse methods for`sf`

objects

To assess – or at least, to account – for spatial patterns in the data.

Crucial to acknowledge the inherent non-randomness and interdependence within spatial data, increasing the validity and accuracy.

Numerous motivations for doing so including

- Agriculture (Besag and Higdon 1999)
- Ecology (Brewer and Nolan 2007)
- Education (Wall 2004)
- Epidemiology (Xie et al. 2017; Li et al. 2011)
- Image analysis (Gavin and Jennison 1997)

Spatial statistics are indispensable for capturing the complexity of spatial data, leading to more accurate, reliable, and nuanced insights and decisions.

Values are randomly distributed in space, independently of each other (and particularly its neighbours)

Lack of a spatial pattern.

Values are

**non-randomly**distributed in space.Large values are “attracting” other large values, and vice versa.

A clustering effect is present.

Values are

**non-randomly**distributed in space.Large values are “repelling” other large values, and vice versa.

A dispersion effect is present.

A test of “global” trends: Looks for evidence of spatial heterogeneity across the entire study area indexed by \(i=1,\dots,M\).

The Moran’s I test statistic (1948) is defined as: \[ I = \frac{M}{\sum_{i}{\sum_{j}{w_{ij}}}} \frac{\sum_{i}{\sum_{j}{w_{ij}(y_i - \bar{y})(y_j - \bar{y})}}}{\sum_{i}{(y_i - \bar{y})^2}} \in [-1, 1] \] where \(w_{ij}\) is the “spatial weight” between locations \(i\) and \(j\).

The null hypothesis is “the data are randomly distributed in space (no spatial autocorrelation)”, and is typically compared against an alternative of “>”.

The Moran’s I value can be thought of as the ratio between the observed spatial variability and the expected variability.

Two common approaches for \(p\)-value calculation:

- Under \(H_0\), \(\operatorname{E}(I) = -1/(M-1)\).
- The variance \(\operatorname{Var}(I)\) also has a closed form expression.
- So, the CLT is invoked and a \(p\)-values based on a \(Z\) score is calculated.
- In practice, the normality assumption is not always met.

Under \(H_0\), the spatial pattern is random, so we can shuffle the values and recompute \(I\).

Compute the proportion of times \(I\) is larger than the observed value.

The Monte Carlo approach can be more accurate.

For areal data it is quite common to implement \(W=A\) where \(A = (a_{ij})\) is an adjacency matrix with the following definition: \[ a_{ij} = \begin{cases} 1 & i \sim j \text{ where } i \neq j \\ 0 & \text{otherwise.} \end{cases} \]

```
A B C D E
A 0 1 1 0 0
B 1 0 1 1 1
C 1 1 0 0 1
D 0 1 0 0 1
E 0 1 1 1 0
```

Other useful definitions:

\(D = \operatorname{diag}(A1)\) is a diagonal matrix containing number of neighbours.

\(\tilde W = D^{-1}A\) is the scaled adjacency matrix (row sums are 1).

```
library(spdep)
y <- negfc_sf$value
Wl <- poly2nb(negfc_sf) |> nb2listw(style = "B")
moran.test(y, listw = Wl, alternative = "less")
```

```
Moran I test under randomisation
data: y
weights: Wl
Moran I statistic standard deviate = -1.4695, p-value = 0.07084
alternative hypothesis: less
sample estimates:
Moran I statistic Expectation Variance
-0.11145491 -0.01234568 0.00454854
```

```
Monte-Carlo simulation of Moran I
data: y
weights: Wl
number of simulations + 1: 1001
statistic = -0.11145, observed rank = 58, p-value = 0.05794
alternative hypothesis: less
```

The study area \(\mathcal S\) is partitioned into \(M\) non-overlapping areal units. To each areal unit \(i\) is associated a response variable \(y_i\) and a vector of covariates \(x_i\). Let \[ \begin{gather} y_i \sim p(y_i \mid \mu_i, \nu^2) \\ g(\mu_i) = x_i^\top\beta + \psi_i. \end{gather} \]

Parameters of interest are \(\beta\) (regression coefficients) and \(\nu^2\) (response variance).

\(g(\cdot)\) is a link function, e.g.

- \(y_i \sim \operatorname{N}(\mu_i, \nu^2) \Leftrightarrow g(x) = x\)
- \(y_i \sim \operatorname{Bin}(n_i, \pi_i) \Leftrightarrow g(x) = \log(\frac{x}{1-x})\)
- \(y_i \sim \operatorname{Pois}(\mu_i) \Leftrightarrow g(x) = \log(x)\)

Additionally, \(\psi_i\) are structured random effects to be estimated. It partly contains spatial random effects which we label \(\phi_i\).

Let \(\psi_i=\phi_i\). Then, the CAR prior (Leroux, Lei, and Breslow 2000) is defined as \[ \phi_i \mid \phi_{-i},W,\rho,\tau^2 \sim \operatorname{N}\left(\frac{\rho\sum_{j} w_{ij} \phi_j}{\rho\sum_{j} w_{ij}+1-\rho}, \frac{\tau^2}{\rho\sum_{j} w_{ij}+1-\rho}\right), \] where \(\rho\in[0,1]\) is a spatial dependence parameter; \(\tau^2\) is a variance parameter; and \(W\) is a known weight matrix (e.g. \(W=A \in \{0,1\}^{M\times M}\) adjacency matrix).

Special cases:

- \(\rho = 0\) corresponds to independence;
- \(\rho = 1\) corresponds to the
*Intrinsic*CAR; - \(\rho=1\) and \(\psi_i = \phi_i + \theta_i\), where \(\theta_i\sim\operatorname{N}(0,\sigma^2)\) – this is the BYM (1991) model for binomial, Poisson and ZIP likelihoods.

Using Brook’s lemma, the joint distribution of \(\phi\) (Besag 1974) is \[ (\phi_1,\dots,\phi_M)^\top \sim \operatorname{N}_M(0,\tau^2 Q^{-1}), \] where \[ Q = \rho[\overbrace{\operatorname{diag}(A1)}^{D}-\overbrace{A}^{DD^{-1}A}] + (1-\rho)I_M. \]

Clearly \(Q\) is a convex combination of a fully intrinsic spatial structure \(Q=D(I_M - \tilde W)\) and a completely independent structure \(Q=I_M\).

Estimating \(\rho\) from the data allows us to quantify the extent of spatial autocorrelation.

In econometrics (LeSage and Pace 2009), the simultaneous autoregressive (SAR) model is more widely used: \[ \begin{gathered} y = \rho W y + X\beta + \epsilon \\ \epsilon \sim N_M(0, \sigma^2 I_M) \end{gathered} \]

A direct spatial relationship between the response variables.

Assume spatial interactions have a global effect (impacts are spread throughout the entire network of relationships).

The \(\rho\) parameter is interpeted as the “spillover” effect.

More appropriate to model “directional flow” by customising the \(W\) matrix.

Now suppose data at each spatial unit is recorded for \(t=1,\dots,T\) time periods. Extend the model: \[ \begin{gather} y_{it} \sim p(y_{it} \mid \mu_{it}, \nu^2) \\ g(\mu_{it}) = x_{it}^\top\beta + \psi_{it}. \end{gather} \]

We’ll discuss three ways to model the temporal dependence:

- Linear time effects
- ANOVA decomposition
- Time autoregressive process

Estimate the *linear* time trends for each areal unit. The goal is to estimate areas exhibiting changing linear trends in \(y\) over time. \[
\begin{gather}
\psi_{it} = \phi_i + (\alpha + \delta_i)t \\
\phi_i \mid \phi_{-i} \sim \operatorname{CAR}(W, \rho, \tau^2) \\
\delta_i \mid \delta_{-i} \sim \operatorname{CAR}(W, \tilde\rho, \tilde\tau^2)
\end{gather}
\]

- Spatially varying intercept \(\phi_i\).
- Spatially varying time slope \(\alpha + \delta_i\), where \(\alpha\) is a “global” time effect.
- Assume linear temporal trend (like growth models).

Decomposes the spatio-temporal variation into their constituent parts. The goal is to estimate overall time trends and spatial patterns. \[ \begin{gather} \psi_{it} = \phi_i + \delta_t + \gamma_{it} \\ \phi_i \mid \phi_{-i} \sim \operatorname{CAR}(W, \rho, \tau^2) \\ \delta_i \mid \delta_{-i} \sim \operatorname{CAR}(D, \tilde\rho, \tilde\tau^2) \\ \gamma_{it} \sim \operatorname{N}(0, \omega^2) \end{gather} \]

- \(\phi\) are the spatial random effects, and \(\delta\) are the temporal random effects.
- \(D\) is the “temporal adjacency” matrix: \(D_{tt'}=1\) if \(|t-t'|=1\), otherwise 0.
- The (optional) interaction terms \(\gamma_{it}\) are not identified for normal models.

Include and AR(1) or AR(2) temporal process. The goal is to estimate the *evolution* of spatial random effects over time. \[
\begin{gather}
\psi_{it} = \phi_{it} \\
\phi^{(t)} \mid \phi^{(t-1)},\phi^{(t-2)} \sim \operatorname{N}_M\big(a_1\phi^{(t-1)} + a_2\phi^{(t-2)}, \tau^2 Q(W,\rho)^{-1}\big) \hspace{2em} t\geq 3 \\
\phi^{(1)} \sim \operatorname{N}_M\big(0, \tau^2 Q(W,\rho)^{-1}\big)
\end{gather}
\]

- The \(Q\) precision matrix is the same as the CAR Leroux model.
- Temporal autocorrelation is induced by the mean.
- Spatial autocorrelation is induced by the variance.

- Model estimated fully Bayesian using Markov chain Monte Carlo (MCMC) methods. Additional priors:
- \(\beta \sim \operatorname{N}(\mu_\beta,\Sigma_\beta)\). Default choice of mean 0 and very large independent variance.
- \(\Gamma^{-1}(1,0.01)\) for variance parameters.
- \(\operatorname{Unif}(-1,1)\) for the \(\rho\)s.

- Use the
`{CARBayesST}`

package (Lee, Rushworth, and Napier 2018).- Gibbs sampling for closed-form full conditionals (e.g. \(\beta\), \(\phi\), \(\tau^2\)).
- Metropolis or Metropolis-Hastings for the rest.
- Make use of sparsity of \(W\) (and \(D\)) to speed up computation.

- Model fit criteria: Log-likelihood; Deviance information criterion (DIC) (Spiegelhalter et al. 2002); and Watanabe-Akaike information criterion (WAIC) (Watanabe and Opper 2010).

Loan information data pertaining to residential property purchases obtained from BDCB as reported from financial institutions (2015–2022).

\(N=3,512\) sample points spread across \(T=32\) quarters containing:

`price`

: Property price in thousands of Brunei dollars`type`

: Property type (Land, Detached, Semi-Detached, Terrace, Apartment)`built_up`

: Built up area in 1,000 square feet`beds`

: No. of beds`baths`

: No. of baths`land_type`

: Land type (Freehold, Leasehold, Strata)`land_size`

: Lot area in acres

Model:

`price ~ house_characteristics + st_effects + error`

(Gaussian errors, identity link).

Brunei is administratively divided into four districts with \(M=39\) mukims (sub-districts).

Some mukims will not have any transactions due to:

- Nature reserve areas
- Government land
- No residential properties

The \(W\) matrix is the binary mukim relationship ignoring the non-observed areas.