Understanding Kriging (Part 2)

code
analysis
Author

Joaquin Cavieres

Published

March 23, 2026

1 Introduction

Let us continue our understanding of the Kriging method. It is important to recall that while Kriging can be treated as a deterministic problem solvable through linear algebra, we are now shifting our focus to the more common (statistical) Bayesian perspective. The primary objective of Kriging is to utilize observed values \(y_1, \dots, y_N\) sampled at spatial locations \(\mathbf{s}_1, \dots, \mathbf{s}_N\) to predict an unknown value \(y_{\text{new}}\) at a specific location \(\mathbf{s}_{\text{new}}\) (where \(\mathbf{s}_{\text{new}} \notin \{\mathbf{s}_1, \dots, \mathbf{s}_N\}\)). In addition to the deterministic approach previously discussed, we can address this problem using stochastic and Bayesian frameworks. Under this view, we assume the data values \(y_i\) are observed realizations \(y(\mathbf{s}_i)\) of a random variable \(Y(\mathbf{s}_i)\) belonging to a spatial random field \(Y\). Since the collected data \(y(\mathbf{s}_1), \dots, y(\mathbf{s}_N)\) represent only a single realization of the random variables \(Y(\mathbf{s}_1), \dots, Y(\mathbf{s}_N)\), our predictions are formulated using these random variables as inputs. Consequently, the predictor \(\hat{Y}(\mathbf{s})\) is itself a random variable. Note: While our previous post used \(Z(\mathbf{s}_1), \dots, Z(\mathbf{s}_N)\) to denote data values, we will now use \(Y(\mathbf{s}_1), \dots, Y(\mathbf{s}_N)\). This change is strictly notational to help distinguish the stochastic approach from the deterministic one.

2 Random Fields

Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space. A function \(Y:\Omega \to \mathbb{R}\) is called a random variable if it is \(\mathcal{F}/\mathcal{B}(\mathbb{R})\)-measurable, where \(\mathcal{B}(\mathbb{R})\) denotes the Borel \(\sigma\)-algebra on \(\mathbb{R}\). Now, in the same probability space, and with \(D\subseteq\mathbb{R}^d\) a domain in \(d\) dimensions, a stochastic process — or more appropriately in our context a random field\(Y\) on \(D\) is a collection of random variables \(\{Y(\mathbf{s}) : \mathbf{s}\in D\}\), where for each fixed \(\mathbf{s}\in D\) the mapping \(Y(\mathbf{s}):\Omega\to\mathbb{R}\) is a random variable. The moments of a random field \(Y\) provide useful summary information. For example, the mean of a random field \(Y\) is defined as:

\[\begin{equation} \mu(\mathbf{s}) = \mathbb{E}[Y(\mathbf{s})], \qquad \mathbf{s} \in D, \end{equation}\]

provided the expectation exists for all \(\mathbf{s} \in D\). The second central moment (variance) is:

\[\begin{equation} \operatorname{Var}(Y(\mathbf{s})) = \mathbb{E}\!\left[(Y(\mathbf{s}) - \mu(\mathbf{s}))^2\right], \qquad \mathbf{s} \in D. \end{equation}\]

And the covariance between two locations is:

\[\begin{equation} \operatorname{Cov}(Y(\mathbf{s}), Y(\mathbf{s}')) = \mathbb{E}\!\left[(Y(\mathbf{s}) - \mu(\mathbf{s}))(Y(\mathbf{s'}) - \mu(\mathbf{s'}))\right] = \sigma^2 \mathbf{C}(\mathbf{s}, \mathbf{s}'), \quad \mathbf{s},\mathbf{s'}\in D, \end{equation}\]

where the scalar \(\sigma^2 > 0\) is the process variance and \(\mathbf{C}(\mathbf{s}, \mathbf{s}')\) is the correlation function, normalized so that \(\mathbf{C}(\mathbf{s},\mathbf{s}) = 1\). Consequently, the marginal variance of \(Y\) at any location is:

\[\begin{equation} \operatorname{Var}(Y(\mathbf{s})) = \sigma^2 \mathbf{C}(\mathbf{s}, \mathbf{s}) = \sigma^2. \end{equation}\]

In the statistics literature \(\sigma^2\) is typically included in the definition of the covariance function (e.g. \(\sigma^2\exp(-(x-x')^2/2\ell^2)\) for a Gaussian kernel), whereas in numerical analysis and scattered data interpolation it is omitted because it gets absorbed into the expansion coefficients. In the stochastic setting the process variance plays an important role and we keep it explicit throughout.

There are many kinds of random fields, but we focus here on Gaussian random fields, also known as Gaussian processes.

3 Gaussian Random Fields and Kriging

3.1 Gaussian Random Fields

A random field \(Y\) on \(D \subseteq \mathbb{R}^d\) is called a Gaussian Random Field (GRF) if, for any finite collection of locations \(\mathbf{s}_1, \dots, \mathbf{s}_n \in D\), the joint distribution of \((Y(\mathbf{s}_1), \dots, Y(\mathbf{s}_n))^\top\) is multivariate Gaussian:

\[ \begin{pmatrix} Y(\mathbf{s}_1) \\ \vdots \\ Y(\mathbf{s}_n) \end{pmatrix} \sim \mathcal{N}\!\left(\boldsymbol{\mu},\, \boldsymbol{\Sigma}\right), \]

where \(\boldsymbol{\mu} = (\mu(\mathbf{s}_1), \dots, \mu(\mathbf{s}_n))^\top\) is the mean vector and \(\boldsymbol{\Sigma}\) is the \(n \times n\) covariance matrix with entries \(\Sigma_{ij} = \sigma^2 \mathbf{C}(\mathbf{s}_i, \mathbf{s}_j)\).

A GRF is therefore entirely characterized by its mean function \(\mu(\mathbf{s})\) and covariance function \(\sigma^2 \mathbf{C}(\mathbf{s}, \mathbf{s}')\). We write \(Y \sim \mathcal{GRF}(\mu, \sigma^2 \mathbf{C})\) to denote this. This characterization is what makes GRFs so tractable: all finite-dimensional distributions are Gaussian, and conditional distributions — the key operation for prediction — remain Gaussian and can be computed in closed form.

3.2 Stationarity

A GRF is called second-order stationary (or weakly stationary) if:

  1. The mean is constant: \(\mu(\mathbf{s}) = \mu\) for all \(\mathbf{s} \in D\).
  2. The covariance depends only on the lag vector \(\mathbf{h} = \mathbf{s} - \mathbf{s}'\):

\[ \operatorname{Cov}(Y(\mathbf{s}),\, Y(\mathbf{s}')) = \sigma^2 \mathbf{C}(\mathbf{s} - \mathbf{s}') = \sigma^2 \mathbf{C}(\mathbf{h}). \]

If the covariance depends only on the Euclidean distance \(r = \|\mathbf{h}\|\) and not on the direction, the field is called isotropic:

\[ \operatorname{Cov}(Y(\mathbf{s}),\, Y(\mathbf{s}')) = \sigma^2 \mathbf{C}(\|\mathbf{s} - \mathbf{s}'\|). \]

Isotropy reduces the covariance function to a univariate function of distance, which greatly simplifies both modeling and parameter estimation. We assume isotropy throughout.

3.3 The Matérn Covariance Family

A flexible and widely used family of isotropic covariance functions is the Matérn class:

\[ \mathbf{C}(r;\, \nu,\, \ell) = \frac{2^{1-\nu}}{\Gamma(\nu)} \left(\frac{\sqrt{2\nu}\, r}{\ell}\right)^{\!\nu} K_\nu\!\left(\frac{\sqrt{2\nu}\, r}{\ell}\right), \qquad r \geq 0, \]

where \(r = \|\mathbf{s} - \mathbf{s}'\|\) is the Euclidean distance, \(\ell > 0\) is the range (length-scale) parameter controlling how quickly correlations decay, \(\nu > 0\) is the smoothness parameter governing the mean-square differentiability of the field, and \(K_\nu\) is the modified Bessel function of the second kind of order \(\nu\). The normalization \(\mathbf{C}(0;\nu,\ell) = 1\) is immediate from the limits of the Bessel function.

Three important special cases arise for half-integer values of \(\nu\):

  • \(\nu = 1/2\) (exponential): \(\mathbf{C}(r) = \exp(-r/\ell)\). The resulting field is continuous but nowhere mean-square differentiable — appropriate for rough phenomena.
  • \(\nu = 3/2\): \(\mathbf{C}(r) = (1 + \sqrt{3}\,r/\ell)\exp(-\sqrt{3}\,r/\ell)\). The field is once mean-square differentiable.
  • \(\nu \to \infty\) (squared exponential / Gaussian): \(\mathbf{C}(r) = \exp(-r^2/2\ell^2)\). The field is infinitely differentiable, which is often too smooth for real geophysical data.

The choice of \(\nu\) encodes prior beliefs about the regularity of the spatial phenomenon. In practice, \(\nu = 1\) or \(\nu = 3/2\) strike a good balance between realism and tractability.

3.4 Kriging as Bayesian Prediction

We now frame spatial prediction within the Bayesian paradigm. Suppose we observe \(\mathbf{y} = (y(\mathbf{s}_1), \dots, y(\mathbf{s}_N))^\top\) and wish to predict the unknown value \(Y(\mathbf{s}_*)\) at a new location \(\mathbf{s}_*\). We adopt the hierarchical model:

\[ \begin{aligned} Y(\mathbf{s}) &= \mu + U(\mathbf{s}), \qquad U \sim \mathcal{GRF}(0,\, \sigma^2 \mathbf{C}), \\ y(\mathbf{s}_i) &= Y(\mathbf{s}_i) + \varepsilon_i, \qquad \varepsilon_i \overset{\text{iid}}{\sim} \mathcal{N}(0,\, \sigma_e^2), \end{aligned} \]

where \(\mu\) is the constant mean, \(U(\mathbf{s})\) is the zero-mean latent GRF with process variance \(\sigma^2\) and correlation function \(\mathbf{C}\), and \(\varepsilon_i\) is independent Gaussian measurement noise with variance \(\sigma_e^2\) — called the nugget in geostatistics. The nugget accounts for both instrument error and microscale variability that cannot be resolved at the observation scale.

Define the cross-covariance vector \(\mathbf{k}_* = \sigma^2\mathbf{c}_*\), where \(\mathbf{c}_* = (\mathbf{C}(\mathbf{s}_1, \mathbf{s}_*), \dots, \mathbf{C}(\mathbf{s}_N, \mathbf{s}_*))^\top\) collects the correlations between the observation sites and the prediction site. The joint distribution of observations and prediction target is then:

\[ \begin{pmatrix} \mathbf{y} \\ Y_* \end{pmatrix} \sim \mathcal{N}\!\left( \begin{pmatrix} \mu \mathbf{1} \\ \mu \end{pmatrix},\; \begin{pmatrix} \boldsymbol{\Sigma}_{\mathbf{y}} & \mathbf{k}_* \\ \mathbf{k}_*^\top & \sigma^2 \end{pmatrix} \right), \]

where \(\boldsymbol{\Sigma}_{\mathbf{y}} = \sigma^2 \mathbf{C}(\mathbf{S}, \mathbf{S}) + \sigma_e^2 \mathbf{I}\) is the \(N \times N\) observation covariance matrix, with \((i,j)\) entry \(\sigma^2\mathbf{C}(\mathbf{s}_i, \mathbf{s}_j) + \sigma_e^2\mathbf{1}_{[i=j]}\).

3.4.1 The Kriging Equations

By the standard formula for the conditional distribution of a partitioned multivariate Gaussian, the posterior of \(Y_* = Y(\mathbf{s}_*)\) given the data is:

\[ \boxed{Y_* \mid \mathbf{y} \sim \mathcal{N}\!\left(\hat{y}_*, \, \sigma^2_*\right),} \]

with Kriging mean:

\[ \hat{y}_* = \mu + \mathbf{k}_*^\top \boldsymbol{\Sigma}_{\mathbf{y}}^{-1} (\mathbf{y} - \mu \mathbf{1}), \]

and Kriging variance:

\[ \sigma^2_* = \sigma^2 - \mathbf{k}_*^\top \boldsymbol{\Sigma}_{\mathbf{y}}^{-1} \mathbf{k}_*, \]

where \(\mathbf{k}_* = \sigma^2\mathbf{c}_*\) is the cross-covariance vector defined above. Several key properties follow immediately:

  1. Exact interpolation: if \(\sigma_e^2 = 0\) (no nugget), the Kriging mean passes exactly through the observations, i.e. \(\hat{y}_{\mathbf{s}_i} = y(\mathbf{s}_i)\), and the variance \(\sigma^2_*\) collapses to zero at those locations.
  2. Reversion to prior: as \(\mathbf{s}_*\) moves far from all observation sites, \(\mathbf{k}_* \to \mathbf{0}\), so \(\hat{y}_* \to \mu\) and \(\sigma^2_* \to \sigma^2\) — the prediction falls back to the prior mean with full prior uncertainty.
  3. Uncertainty is design-dependent: \(\sigma^2_*\) depends only on the spatial configuration \(\{\mathbf{s}_i\}\) and \(\mathbf{s}_*\), not on the observed values \(\mathbf{y}\). This means the prediction uncertainty can be evaluated — and the survey design can be optimized — before collecting data.

3.4.2 Practical Computation via Cholesky

Direct inversion of \(\boldsymbol{\Sigma}_{\mathbf{y}}\) is numerically unstable for large \(N\). In practice we use the Cholesky decomposition \(\boldsymbol{\Sigma}_{\mathbf{y}} = \mathbf{L}\mathbf{L}^\top\) and define the weight vector:

\[ \boldsymbol{\alpha} = \boldsymbol{\Sigma}_{\mathbf{y}}^{-1}(\mathbf{y} - \mu\mathbf{1}), \]

computed via two triangular solves: forward-solve \(\mathbf{L}\mathbf{v} = \mathbf{y} - \mu\mathbf{1}\), then back-solve \(\mathbf{L}^\top\boldsymbol{\alpha} = \mathbf{v}\). The Kriging mean and variance then become:

\[ \hat{y}_* = \mu + \mathbf{k}_*^\top \boldsymbol{\alpha}, \qquad \sigma^2_* = \sigma^2 - \|\mathbf{L}^{-1}\mathbf{k}_*\|^2, \]

where \(\|\mathbf{L}^{-1}\mathbf{k}_*\|^2 = \mathbf{v}_*^\top\mathbf{v}_*\) with \(\mathbf{v}_* = \mathbf{L}^{-1}\mathbf{k}_*\) obtained by a forward solve. The Cholesky factorization costs \(\mathcal{O}(N^3)\) and is performed once; each subsequent prediction requires only \(\mathcal{O}(N)\) operations. This is the standard approach in all production Kriging implementations.

4 R Code

The following R code demonstrates the full Kriging workflow: simulating a GRF from a Matérn covariance, adding noise to create observations, and predicting on a dense grid.

#======================================================
# Kriging with Matérn covariance
#======================================================
library(ggplot2)
library(gridExtra)

set.seed(42)

#------------------------------------------------------
# True parameters
#------------------------------------------------------
N_s     <- 80        # number of observation locations
range   <- 0.25      # Matérn length-scale
nu      <- 1         # Matérn smoothness 
sigma_u <- 1.0       # process SD 
sigma_e <- 0.15      # nugget (measurement noise) SD sigma_e
mu_true <- 0.0       # constant mean mu

#------------------------------------------------------
# Matérn correlation function
#------------------------------------------------------
matern_cov <- function(coords1, coords2 = NULL, nu = 1, range, sigma2 = 1) {
  if (is.null(coords2)) coords2 <- coords1
  n1 <- nrow(coords1); n2 <- nrow(coords2)
  D  <- as.matrix(dist(rbind(coords1, coords2)))[1:n1, (n1+1):(n1+n2), drop = FALSE]
  D[D == 0] <- 1e-10          # avoid 0 argument to besselK
  s  <- sqrt(2 * nu) * D / range
  C  <- (2^(1 - nu)) / gamma(nu) * s^nu * besselK(s, nu)
  if (n1 == n2 && isTRUE(all.equal(coords1, coords2))) diag(C) <- 1
  sigma2 * C
}

#------------------------------------------------------
# Observation locations: N_s uniform points in [0,1]^2
#------------------------------------------------------
obs_locs <- matrix(runif(N_s * 2), ncol = 2, dimnames = list(NULL, c("s1", "s2")))

#------------------------------------------------------
# Simulate the latent field U(s) ~ GRF(0, sigma2 C) via Cholesky:
#------------------------------------------------------
Sigma_U <- matern_cov(obs_locs, nu = nu, range = range, sigma2 = sigma_u^2)
diag(Sigma_U) <- diag(Sigma_U) + 1e-9     # For numerical stability
L_U    <- t(chol(Sigma_U))
U_true <- as.vector(L_U %*% rnorm(N_s))

#------------------------------------------------------
# Observations: y(s)
#------------------------------------------------------
y_obs <- mu_true + U_true + rnorm(N_s, sd = sigma_e)

#------------------------------------------------------
# Kriging — build observation covariance Σ_y = sigma2_u C(S,S) + sigma2_e I
#------------------------------------------------------
Sigma_y <- matern_cov(obs_locs, nu = nu, range = range, sigma2 = sigma_u^2) +
           (sigma_e^2 + 1e-9) * diag(N_s)
L_obs   <- t(chol(Sigma_y))

# Kriging weight vector "alpha" via two triangular solves
alpha <- backsolve(t(L_obs), forwardsolve(L_obs, y_obs - mu_true))

#------------------------------------------------------
# Prediction: 30 × 30 regular grid on [0,1]^2
#------------------------------------------------------
N_grid    <- 30
pred_grid <- as.matrix(expand.grid(s1 = seq(0, 1, length.out = N_grid),
                                   s2 = seq(0, 1, length.out = N_grid)))

#------------------------------------------------------
# Kriging prediction at each grid point
# Cross-covariance  k* = sigma2 c*  (N_pred × N_obs matrix)
#------------------------------------------------------
k_cross  <- matern_cov(pred_grid, obs_locs, nu = nu, range = range, sigma2 = sigma_u^2)
krig_mean <- mu_true + as.vector(k_cross %*% alpha)

V <- forwardsolve(L_obs, t(k_cross))
krig_var <- pmax(sigma_u^2 - colSums(V^2), 0)
krig_sd  <- sqrt(krig_var)


#------------------------------------------------------
# Plots
#------------------------------------------------------
val_lim <- range(c(U_true, y_obs, krig_mean))

obs_df <- data.frame(s1 = obs_locs[, "s1"], s2 = obs_locs[, "s2"],
                     U  = U_true, y  = y_obs)

pred_df <- data.frame(s1 = pred_grid[, "s1"], s2   = pred_grid[, "s2"],
                      mean = krig_mean, sd   = krig_sd)

# True spatial field at observation locations
p_true <- ggplot(obs_df, aes(s1, s2, colour = U)) +
  geom_point(size = 3.5) +
  scale_colour_distiller(palette = "RdBu", direction = -1,
                         limits = val_lim, name = expression(U(bold(s)))) +
  coord_equal() + theme_bw(base_size = 11) +
  theme(panel.grid = element_blank()) +
  labs(title    = "True spatial field  U(s)",
       subtitle = bquote(sigma == .(sigma_u) ~ "  range =" ~ .(range) ~
                           "  " * nu == .(nu)),
       x = expression(s[1]), y = expression(s[2]))

# Observations
p_obs <- ggplot(obs_df, aes(s1, s2, colour = y)) +
  geom_point(size = 3.5) +
  scale_colour_distiller(palette = "RdBu", direction = -1,
                         limits = val_lim, name = expression(y(bold(s)))) +
  coord_equal() + theme_bw(base_size = 11) +
  theme(panel.grid = element_blank()) +
  labs(title    = "Observations  y(s)",
       subtitle = bquote(sigma[e] == .(sigma_e)),
       x = expression(s[1]), y = expression(s[2]))

# Kriging mean
p_krig_mean <- ggplot(pred_df, aes(s1, s2, fill = mean)) +
  geom_raster() +
  geom_point(data = obs_df, aes(s1, s2),
             inherit.aes = FALSE, colour = "black", size = 1.2) +
  scale_fill_distiller(palette = "RdBu", direction = -1,
                       limits = val_lim,
                       name = expression(hat(y)(bold(s)["*"]))) +
  coord_equal() + theme_bw(base_size = 11) +
  theme(axis.text = element_blank(), axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  labs(title    = "Kriging mean",
       x = NULL, y = NULL)

# Kriging posterior SD
p_krig_sd <- ggplot(pred_df, aes(s1, s2, fill = sd)) +
  geom_raster() +
  geom_point(data = obs_df, aes(s1, s2),
             inherit.aes = FALSE, colour = "black", size = 1.2) +
  scale_fill_distiller(palette = "YlOrRd", direction = 1,
                       limits = c(0, NA),
                       name = expression(sigma["*"](bold(s)["*"]))) +
  coord_equal() + theme_bw(base_size = 11) +
  theme(axis.text = element_blank(), axis.ticks = element_blank(),
        panel.grid = element_blank()) +
  labs(title    = "Kriging uncertainty (posterior SD)",
       x = NULL, y = NULL)

grid.arrange(p_true, p_obs,          ncol = 2)

grid.arrange(p_krig_mean, p_krig_sd, ncol = 2)

5 Conclusion

In this post we have developed Kriging prediction unsder Bayesian inference. Here, the GRF assumption allow us making spatial interpolation into a conditional Gaussian inference: given observations \(\mathbf{y}\), the posterior at any unobserved location \(\mathbf{s}_*\) is a normal distribution whose mean \(\hat{y}_*\) and variance \(\sigma^2_*\) are available in closed form through simple linear algebra. The reason Bayesian Kriging delivers uncertainty estimates is structural: by treating the unobserved field \(U(\mathbf{s})\) as a random variable rather than a fixed unknown, the prediction at any new location \(\mathbf{s}_*\) is not a single number but a full posterior distribution \(Y_* \mid \mathbf{y} \sim \mathcal{N}(\hat{y}_*, \sigma^2_*)\). The variance \[\begin{equation} \sigma^2_* = \sigma^2 - \mathbf{k}_*^\top \boldsymbol{\Sigma}_{\mathbf{y}}^{-1} \mathbf{k}_* \end{equation}\]

measures how much the observations have reduced our prior uncertainty \(\sigma^2\) at that specific location. A purely deterministic interpolation method such as splines or inverse distance weighting would produce the same predicted surface but discard this second quantity entirely. The importance of retaining it is that predictions without uncertainty are incomplete: two locations can share the same predicted grade but one might sit surrounded by boreholes while the other lies in an unsampled region of the domain, and \(\sigma^2_*\) distinguishes them exactly. In practice this matters for decision-making (whether to approve a mining block for extraction, allocate a drilling budget, or classify a resource as measured versus inferred) because all of those decisions depend not just on the expected value but on how much confidence one can attach to it.

To extend the conclusion, also we have to consider several aspects of this framework:

  • The covariance function encodes all prior knowledge about the spatial structure \(\rightarrow\) the Matérn family gives fine-grained control over both the smoothness (via \(\nu\)) and the range of spatial dependence (via \(\ell\)). Choosing these parameters (either by expert knowledge or by maximum likelihood) is the central modeling decision in any Kriging interpolation.

  • The nugget \(\sigma_e^2\) plays a dual role here, statistically it represents measurement error; geometrically it prevents the covariance matrix from becoming singular and controls the degree of interpolation versus. smoothing.

  • The posterior variance \(\sigma^2_*\) is a map of data coverage, hence it does not depend on the observed values, only on the sampling design. This property is central to optimal spatial design: one can identify under-sampled regions and guide where to collect future observations before seeing any data.

6 Comments

An interesting comment was made on Linkedin by Ruben Roa Ureta (View here the comment in LinkedIn). Thus, in the next post I will consider his observation to make a comparison between the Maximum Likelihood method and Bayesian inference using the MCMC method (only in estimation terms) along with the uncertainty quantification for predictions in Kriging!