Let us continue our understanding of the Kriging method. It is important to recall that while Kriging can be treated as a deterministic problem solvable through linear algebra, we are now shifting our focus to the more common (statistical) Bayesian perspective. The primary objective of Kriging is to utilize observed values \(y_1, \dots, y_N\) sampled at spatial locations \(\mathbf{s}_1, \dots, \mathbf{s}_N\) to predict an unknown value \(y_{\text{new}}\) at a specific location \(\mathbf{s}_{\text{new}}\) (where \(\mathbf{s}_{\text{new}} \notin \{\mathbf{s}_1, \dots, \mathbf{s}_N\}\)). In addition to the deterministic approach previously discussed, we can address this problem using stochastic and Bayesian frameworks. Under this view, we assume the data values \(y_i\) are observed realizations \(y(\mathbf{s}_i)\) of a random variable \(Y(\mathbf{s}_i)\) belonging to a spatial random field \(Y\). Since the collected data \(y(\mathbf{s}_1), \dots, y(\mathbf{s}_N)\) represent only a single realization of the random variables \(Y(\mathbf{s}_1), \dots, Y(\mathbf{s}_N)\), our predictions are formulated using these random variables as inputs. Consequently, the predictor \(\hat{Y}(\mathbf{s})\) is itself a random variable. Note: While our previous post used\(Z(\mathbf{s}_1), \dots, Z(\mathbf{s}_N)\)to denote data values, we will now use\(Y(\mathbf{s}_1), \dots, Y(\mathbf{s}_N)\). This change is strictly notational to help distinguish the stochastic approach from the deterministic one.
2 Random Fields
Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability space. A function \(Y:\Omega \to \mathbb{R}\) is called a random variable if it is \(\mathcal{F}/\mathcal{B}(\mathbb{R})\)-measurable, where \(\mathcal{B}(\mathbb{R})\) denotes the Borel \(\sigma\)-algebra on \(\mathbb{R}\). Now, in the same probability space, and with \(D\subseteq\mathbb{R}^d\) a domain in \(d\) dimensions, a stochastic process — or more appropriately in our context a random field — \(Y\) on \(D\) is a collection of random variables \(\{Y(\mathbf{s}) : \mathbf{s}\in D\}\), where for each fixed \(\mathbf{s}\in D\) the mapping \(Y(\mathbf{s}):\Omega\to\mathbb{R}\) is a random variable. The moments of a random field \(Y\) provide useful summary information. For example, the mean of a random field \(Y\) is defined as:
where the scalar \(\sigma^2 > 0\) is the process variance and \(\mathbf{C}(\mathbf{s}, \mathbf{s}')\) is the correlation function, normalized so that \(\mathbf{C}(\mathbf{s},\mathbf{s}) = 1\). Consequently, the marginal variance of \(Y\) at any location is:
In the statistics literature \(\sigma^2\) is typically included in the definition of the covariance function (e.g. \(\sigma^2\exp(-(x-x')^2/2\ell^2)\) for a Gaussian kernel), whereas in numerical analysis and scattered data interpolation it is omitted because it gets absorbed into the expansion coefficients. In the stochastic setting the process variance plays an important role and we keep it explicit throughout.
There are many kinds of random fields, but we focus here on Gaussian random fields, also known as Gaussian processes.
3 Gaussian Random Fields and Kriging
3.1 Gaussian Random Fields
A random field \(Y\) on \(D \subseteq \mathbb{R}^d\) is called a Gaussian Random Field (GRF) if, for any finite collection of locations \(\mathbf{s}_1, \dots, \mathbf{s}_n \in D\), the joint distribution of \((Y(\mathbf{s}_1), \dots, Y(\mathbf{s}_n))^\top\) is multivariate Gaussian:
where \(\boldsymbol{\mu} = (\mu(\mathbf{s}_1), \dots, \mu(\mathbf{s}_n))^\top\) is the mean vector and \(\boldsymbol{\Sigma}\) is the \(n \times n\) covariance matrix with entries \(\Sigma_{ij} = \sigma^2 \mathbf{C}(\mathbf{s}_i, \mathbf{s}_j)\).
A GRF is therefore entirely characterized by its mean function \(\mu(\mathbf{s})\) and covariance function \(\sigma^2 \mathbf{C}(\mathbf{s}, \mathbf{s}')\). We write \(Y \sim \mathcal{GRF}(\mu, \sigma^2 \mathbf{C})\) to denote this. This characterization is what makes GRFs so tractable: all finite-dimensional distributions are Gaussian, and conditional distributions — the key operation for prediction — remain Gaussian and can be computed in closed form.
3.2 Stationarity
A GRF is called second-order stationary (or weakly stationary) if:
The mean is constant: \(\mu(\mathbf{s}) = \mu\) for all \(\mathbf{s} \in D\).
The covariance depends only on the lag vector \(\mathbf{h} = \mathbf{s} - \mathbf{s}'\):
Isotropy reduces the covariance function to a univariate function of distance, which greatly simplifies both modeling and parameter estimation. We assume isotropy throughout.
3.3 The Matérn Covariance Family
A flexible and widely used family of isotropic covariance functions is the Matérn class:
where \(r = \|\mathbf{s} - \mathbf{s}'\|\) is the Euclidean distance, \(\ell > 0\) is the range (length-scale) parameter controlling how quickly correlations decay, \(\nu > 0\) is the smoothness parameter governing the mean-square differentiability of the field, and \(K_\nu\) is the modified Bessel function of the second kind of order \(\nu\). The normalization \(\mathbf{C}(0;\nu,\ell) = 1\) is immediate from the limits of the Bessel function.
Three important special cases arise for half-integer values of \(\nu\):
\(\nu = 1/2\) (exponential): \(\mathbf{C}(r) = \exp(-r/\ell)\). The resulting field is continuous but nowhere mean-square differentiable — appropriate for rough phenomena.
\(\nu = 3/2\): \(\mathbf{C}(r) = (1 + \sqrt{3}\,r/\ell)\exp(-\sqrt{3}\,r/\ell)\). The field is once mean-square differentiable.
\(\nu \to \infty\) (squared exponential / Gaussian): \(\mathbf{C}(r) = \exp(-r^2/2\ell^2)\). The field is infinitely differentiable, which is often too smooth for real geophysical data.
The choice of \(\nu\) encodes prior beliefs about the regularity of the spatial phenomenon. In practice, \(\nu = 1\) or \(\nu = 3/2\) strike a good balance between realism and tractability.
3.4 Kriging as Bayesian Prediction
We now frame spatial prediction within the Bayesian paradigm. Suppose we observe \(\mathbf{y} = (y(\mathbf{s}_1), \dots, y(\mathbf{s}_N))^\top\) and wish to predict the unknown value \(Y(\mathbf{s}_*)\) at a new location \(\mathbf{s}_*\). We adopt the hierarchical model:
where \(\mu\) is the constant mean, \(U(\mathbf{s})\) is the zero-mean latent GRF with process variance \(\sigma^2\) and correlation function \(\mathbf{C}\), and \(\varepsilon_i\) is independent Gaussian measurement noise with variance \(\sigma_e^2\) — called the nugget in geostatistics. The nugget accounts for both instrument error and microscale variability that cannot be resolved at the observation scale.
Define the cross-covariance vector \(\mathbf{k}_* = \sigma^2\mathbf{c}_*\), where \(\mathbf{c}_* = (\mathbf{C}(\mathbf{s}_1, \mathbf{s}_*), \dots, \mathbf{C}(\mathbf{s}_N, \mathbf{s}_*))^\top\) collects the correlations between the observation sites and the prediction site. The joint distribution of observations and prediction target is then:
where \(\boldsymbol{\Sigma}_{\mathbf{y}} = \sigma^2 \mathbf{C}(\mathbf{S}, \mathbf{S}) + \sigma_e^2 \mathbf{I}\) is the \(N \times N\) observation covariance matrix, with \((i,j)\) entry \(\sigma^2\mathbf{C}(\mathbf{s}_i, \mathbf{s}_j) + \sigma_e^2\mathbf{1}_{[i=j]}\).
3.4.1 The Kriging Equations
By the standard formula for the conditional distribution of a partitioned multivariate Gaussian, the posterior of \(Y_* = Y(\mathbf{s}_*)\) given the data is:
where \(\mathbf{k}_* = \sigma^2\mathbf{c}_*\) is the cross-covariance vector defined above. Several key properties follow immediately:
Exact interpolation: if \(\sigma_e^2 = 0\) (no nugget), the Kriging mean passes exactly through the observations, i.e. \(\hat{y}_{\mathbf{s}_i} = y(\mathbf{s}_i)\), and the variance \(\sigma^2_*\) collapses to zero at those locations.
Reversion to prior: as \(\mathbf{s}_*\) moves far from all observation sites, \(\mathbf{k}_* \to \mathbf{0}\), so \(\hat{y}_* \to \mu\) and \(\sigma^2_* \to \sigma^2\) — the prediction falls back to the prior mean with full prior uncertainty.
Uncertainty is design-dependent: \(\sigma^2_*\) depends only on the spatial configuration \(\{\mathbf{s}_i\}\) and \(\mathbf{s}_*\), not on the observed values \(\mathbf{y}\). This means the prediction uncertainty can be evaluated — and the survey design can be optimized — before collecting data.
3.4.2 Practical Computation via Cholesky
Direct inversion of \(\boldsymbol{\Sigma}_{\mathbf{y}}\) is numerically unstable for large \(N\). In practice we use the Cholesky decomposition\(\boldsymbol{\Sigma}_{\mathbf{y}} = \mathbf{L}\mathbf{L}^\top\) and define the weight vector:
computed via two triangular solves: forward-solve \(\mathbf{L}\mathbf{v} = \mathbf{y} - \mu\mathbf{1}\), then back-solve \(\mathbf{L}^\top\boldsymbol{\alpha} = \mathbf{v}\). The Kriging mean and variance then become:
where \(\|\mathbf{L}^{-1}\mathbf{k}_*\|^2 = \mathbf{v}_*^\top\mathbf{v}_*\) with \(\mathbf{v}_* = \mathbf{L}^{-1}\mathbf{k}_*\) obtained by a forward solve. The Cholesky factorization costs \(\mathcal{O}(N^3)\) and is performed once; each subsequent prediction requires only \(\mathcal{O}(N)\) operations. This is the standard approach in all production Kriging implementations.
4 R Code
The following R code demonstrates the full Kriging workflow: simulating a GRF from a Matérn covariance, adding noise to create observations, and predicting on a dense grid.
#======================================================# Kriging with Matérn covariance#======================================================library(ggplot2)library(gridExtra)set.seed(42)#------------------------------------------------------# True parameters#------------------------------------------------------N_s <-80# number of observation locationsrange <-0.25# Matérn length-scalenu <-1# Matérn smoothness sigma_u <-1.0# process SD sigma_e <-0.15# nugget (measurement noise) SD sigma_emu_true <-0.0# constant mean mu#------------------------------------------------------# Matérn correlation function#------------------------------------------------------matern_cov <-function(coords1, coords2 =NULL, nu =1, range, sigma2 =1) {if (is.null(coords2)) coords2 <- coords1 n1 <-nrow(coords1); n2 <-nrow(coords2) D <-as.matrix(dist(rbind(coords1, coords2)))[1:n1, (n1+1):(n1+n2), drop =FALSE] D[D ==0] <-1e-10# avoid 0 argument to besselK s <-sqrt(2* nu) * D / range C <- (2^(1- nu)) /gamma(nu) * s^nu *besselK(s, nu)if (n1 == n2 &&isTRUE(all.equal(coords1, coords2))) diag(C) <-1 sigma2 * C}#------------------------------------------------------# Observation locations: N_s uniform points in [0,1]^2#------------------------------------------------------obs_locs <-matrix(runif(N_s *2), ncol =2, dimnames =list(NULL, c("s1", "s2")))#------------------------------------------------------# Simulate the latent field U(s) ~ GRF(0, sigma2 C) via Cholesky:#------------------------------------------------------Sigma_U <-matern_cov(obs_locs, nu = nu, range = range, sigma2 = sigma_u^2)diag(Sigma_U) <-diag(Sigma_U) +1e-9# For numerical stabilityL_U <-t(chol(Sigma_U))U_true <-as.vector(L_U %*%rnorm(N_s))#------------------------------------------------------# Observations: y(s)#------------------------------------------------------y_obs <- mu_true + U_true +rnorm(N_s, sd = sigma_e)#------------------------------------------------------# Kriging — build observation covariance Σ_y = sigma2_u C(S,S) + sigma2_e I#------------------------------------------------------Sigma_y <-matern_cov(obs_locs, nu = nu, range = range, sigma2 = sigma_u^2) + (sigma_e^2+1e-9) *diag(N_s)L_obs <-t(chol(Sigma_y))# Kriging weight vector "alpha" via two triangular solvesalpha <-backsolve(t(L_obs), forwardsolve(L_obs, y_obs - mu_true))#------------------------------------------------------# Prediction: 30 × 30 regular grid on [0,1]^2#------------------------------------------------------N_grid <-30pred_grid <-as.matrix(expand.grid(s1 =seq(0, 1, length.out = N_grid),s2 =seq(0, 1, length.out = N_grid)))#------------------------------------------------------# Kriging prediction at each grid point# Cross-covariance k* = sigma2 c* (N_pred × N_obs matrix)#------------------------------------------------------k_cross <-matern_cov(pred_grid, obs_locs, nu = nu, range = range, sigma2 = sigma_u^2)krig_mean <- mu_true +as.vector(k_cross %*% alpha)V <-forwardsolve(L_obs, t(k_cross))krig_var <-pmax(sigma_u^2-colSums(V^2), 0)krig_sd <-sqrt(krig_var)#------------------------------------------------------# Plots#------------------------------------------------------val_lim <-range(c(U_true, y_obs, krig_mean))obs_df <-data.frame(s1 = obs_locs[, "s1"], s2 = obs_locs[, "s2"],U = U_true, y = y_obs)pred_df <-data.frame(s1 = pred_grid[, "s1"], s2 = pred_grid[, "s2"],mean = krig_mean, sd = krig_sd)# True spatial field at observation locationsp_true <-ggplot(obs_df, aes(s1, s2, colour = U)) +geom_point(size =3.5) +scale_colour_distiller(palette ="RdBu", direction =-1,limits = val_lim, name =expression(U(bold(s)))) +coord_equal() +theme_bw(base_size =11) +theme(panel.grid =element_blank()) +labs(title ="True spatial field U(s)",subtitle =bquote(sigma == .(sigma_u) ~" range ="~ .(range) ~" "* nu == .(nu)),x =expression(s[1]), y =expression(s[2]))# Observationsp_obs <-ggplot(obs_df, aes(s1, s2, colour = y)) +geom_point(size =3.5) +scale_colour_distiller(palette ="RdBu", direction =-1,limits = val_lim, name =expression(y(bold(s)))) +coord_equal() +theme_bw(base_size =11) +theme(panel.grid =element_blank()) +labs(title ="Observations y(s)",subtitle =bquote(sigma[e] == .(sigma_e)),x =expression(s[1]), y =expression(s[2]))# Kriging meanp_krig_mean <-ggplot(pred_df, aes(s1, s2, fill = mean)) +geom_raster() +geom_point(data = obs_df, aes(s1, s2),inherit.aes =FALSE, colour ="black", size =1.2) +scale_fill_distiller(palette ="RdBu", direction =-1,limits = val_lim,name =expression(hat(y)(bold(s)["*"]))) +coord_equal() +theme_bw(base_size =11) +theme(axis.text =element_blank(), axis.ticks =element_blank(),panel.grid =element_blank()) +labs(title ="Kriging mean",x =NULL, y =NULL)# Kriging posterior SDp_krig_sd <-ggplot(pred_df, aes(s1, s2, fill = sd)) +geom_raster() +geom_point(data = obs_df, aes(s1, s2),inherit.aes =FALSE, colour ="black", size =1.2) +scale_fill_distiller(palette ="YlOrRd", direction =1,limits =c(0, NA),name =expression(sigma["*"](bold(s)["*"]))) +coord_equal() +theme_bw(base_size =11) +theme(axis.text =element_blank(), axis.ticks =element_blank(),panel.grid =element_blank()) +labs(title ="Kriging uncertainty (posterior SD)",x =NULL, y =NULL)grid.arrange(p_true, p_obs, ncol =2)
grid.arrange(p_krig_mean, p_krig_sd, ncol =2)
5 Conclusion
In this post we have developed Kriging prediction unsder Bayesian inference. Here, the GRF assumption allow us making spatial interpolation into a conditional Gaussian inference: given observations \(\mathbf{y}\), the posterior at any unobserved location \(\mathbf{s}_*\) is a normal distribution whose mean \(\hat{y}_*\) and variance \(\sigma^2_*\) are available in closed form through simple linear algebra. The reason Bayesian Kriging delivers uncertainty estimates is structural: by treating the unobserved field \(U(\mathbf{s})\) as a random variable rather than a fixed unknown, the prediction at any new location \(\mathbf{s}_*\) is not a single number but a full posterior distribution \(Y_* \mid \mathbf{y} \sim \mathcal{N}(\hat{y}_*, \sigma^2_*)\). The variance \[\begin{equation}
\sigma^2_* = \sigma^2 - \mathbf{k}_*^\top \boldsymbol{\Sigma}_{\mathbf{y}}^{-1} \mathbf{k}_*
\end{equation}\]
measures how much the observations have reduced our prior uncertainty \(\sigma^2\) at that specific location. A purely deterministic interpolation method such as splines or inverse distance weighting would produce the same predicted surface but discard this second quantity entirely. The importance of retaining it is that predictions without uncertainty are incomplete: two locations can share the same predicted grade but one might sit surrounded by boreholes while the other lies in an unsampled region of the domain, and \(\sigma^2_*\) distinguishes them exactly. In practice this matters for decision-making (whether to approve a mining block for extraction, allocate a drilling budget, or classify a resource as measured versus inferred) because all of those decisions depend not just on the expected value but on how much confidence one can attach to it.
To extend the conclusion, also we have to consider several aspects of this framework:
The covariance function encodes all prior knowledge about the spatial structure \(\rightarrow\) the Matérn family gives fine-grained control over both the smoothness (via \(\nu\)) and the range of spatial dependence (via \(\ell\)). Choosing these parameters (either by expert knowledge or by maximum likelihood) is the central modeling decision in any Kriging interpolation.
The nugget \(\sigma_e^2\) plays a dual role here, statistically it represents measurement error; geometrically it prevents the covariance matrix from becoming singular and controls the degree of interpolation versus. smoothing.
The posterior variance \(\sigma^2_*\) is a map of data coverage, hence it does not depend on the observed values, only on the sampling design. This property is central to optimal spatial design: one can identify under-sampled regions and guide where to collect future observations before seeing any data.
6 Comments
An interesting comment was made on Linkedin by Ruben Roa Ureta (View here the comment in LinkedIn). Thus, in the next post I will consider his observation to make a comparison between the Maximum Likelihood method and Bayesian inference using the MCMC method (only in estimation terms) along with the uncertainty quantification for predictions in Kriging!
6 Comments
An interesting comment was made on Linkedin by Ruben Roa Ureta (View here the comment in LinkedIn). Thus, in the next post I will consider his observation to make a comparison between the Maximum Likelihood method and Bayesian inference using the MCMC method (only in estimation terms) along with the uncertainty quantification for predictions in Kriging!