---
title: "Raster regression"
date: April 27, 2023
author: Connor Donegan
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 3 # upto three depths of headings (specified by #, ## and ###)
number_sections: true
header-includes:
- \usepackage{amsmath}
vignette: >
%\VignetteIndexEntry{Raster regression}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: raster.bib
link-citations: yes
---
This vignette provides a tutorial for fitting spatial regression models to raster data using **geostan**. The term "raster" is used here to refer to any regularly spaced set of observations such that the data can be represented spatially by a rectangular grid. Remotely sensed imagery is a common form of raster data.
**geostan** can be used for spatial regression with fairly large raster data layers, although the functionality of these models will often be limited to the estimation of regression coefficients and spatial autocorrelation parameters. Limited experience thus far finds that **geostan**'s spatial autoregressive models can be fit to raster layers with two hundred thousand observations using a laptop computer and fewer than ten minutes of sampling time.
## Demonstration
Start by loading some necessary R packages.
```{r setup, message = FALSE, warning = FALSE, eval = TRUE}
library(geostan)
library(sf)
set.seed(1127)
```
We will create a small raster data layer for the purpose of illustration.
```{r eval = TRUE}
row <- 40
col <- 30
c(N <- row * col)
sfc = st_sfc(st_polygon(list(rbind(c(0,0), c(col,0), c(col,row), c(0,0)))))
grid <- st_make_grid(sfc, cellsize = 1, square = TRUE)
grid <- st_as_sf(grid)
W <- shape2mat(grid, style = "W", queen = FALSE)
grid$z <- sim_sar(w = W, rho = 0.9)
grid$y <- -0.5 * grid$z + sim_sar(w = W, rho = .9, sigma = .3)
```
```{r fig.width = 3.25, fig.height = 3, fig.align = 'center'}
plot(grid[,'z'])
```
The following R code will fit a spatial autoregressive model to these data:
```{r eval = FALSE}
fit <- stan_sar(y ~ z, data = grid, C = W)
```
The `stan_sar` function will take the spatial weights matrix `W` and pass it through a function called `prep_sar_data` which will calculate the eigenvalues of the spatial weights matrix using `base::eigen`, as required for computational reasons. This step is prohibitive for large data sets (e.g., $N = 100,000$).
The following code would normally be used to fit a conditional autoregressive (CAR) model:
```{r eval = FALSE}
C <- shape2mat(grid, style = "B", queen = FALSE)
car_list <- prep_car_data(C, "WCAR")
fit <- stan_car(y ~ z, data = grid, car_parts = car_list)
```
Here, the `prep_car_data` function calculates the eigenvalues of the spatial weights matrix using `base::eigen`, which is not feasible for large N.
The `prep_sar_data2` and `prep_car_data2` functions are designed for raster layers. As input, they require the dimensions of the grid (number of rows and number of columns). The eigenvalues are produced very quickly using Equation 5 from @griffith_2000. The methods have certain restrictions. First, this is only applicable to raster layers---regularly spaced, rectangular grids of observations. Second, to define which observations are adjacent to one another, the "rook" criteria is used (spatially, only observations that share an edge are defined as neighbors to one another). Third, the spatial adjacency matrix will be row-standardized. This is standard (and required) for SAR models, and it corresponds to the "WCAR" specification of the CAR model [see @donegan_2022].
The following code will fit a SAR model to our `grid` data, and is suitable for much larger raster layers:
```{r eval = TRUE}
sar_list <- prep_sar_data2(row = row, col = col)
fit <- stan_sar(y ~ z,
data = grid,
centerx = TRUE,
sar_parts = sar_list,
iter = 500,
chains = 4,
slim = TRUE #,
# cores = 4, # for multi-core processing
)
print(fit)
```
The user first creates the data list using `prep_sar_data2` and then passes it to `stan_sar` using the `sar_parts` argument. Also, `slim = TRUE` is invoked to prevent the model from collecting N-length parameter vectors and quantities of interest (such as fitted values and log-likelihoods).
For large data sets and complex models, `slim = TRUE` can bring about computational improvements at the cost of losing some functionality (including the loss of convenience functions like `sp_diag`, `me_diag`, `spatial`, `resid`, and `fitted`). Many quantities of interest, such as fitted values and spatial trend terms, can still be calculated manually using the data and parameter estimates (intercept, coefficients, and spatial autocorrelation parameters).
The favorable MCMC diagnostics for this model (sufficiently large effective sample sizes `n_eff`, and `Rhat` values very near to 1), based on just 250 post-warmup iterations per chain with four MCMC chains, provides some indication as to how computationally efficient these spatial autoregressive models can be.
Also, note that Stan usually samples more efficiently when variables have been mean-centered. Using the `centerx = TRUE` argument in `stan_sar` (or any other model-fitting function in **geostan**) can be very helpful in this respect. Also note that the SAR models in **geostan** are (generally) no less computationally-efficient than the CAR models, and may even be slightly more efficient.
## References