测试
Combining heterogeneous spatial datasets with process-based spatial fusion models: A unifying framework
Author links open overlay panelCraig Wang a, Reinhard Furrer a b, SNC Study Group
Show more
Outline
Add to Mendeley
Share
Cite
https://doi.org/10.1016/j.csda.2021.107240Get rights and content
Under a Creative Commons license
open access
Abstract
In modern spatial statistics, the structure of data has become more heterogeneous. Depending on the types of spatial data, different modeling strategies are used. For example, kriging approaches for geostatistical data; Gaussian Markov random field models for lattice data; or log Gaussian Cox process models for point-pattern data. Despite these different modeling choices, the nature of underlying data-generating (latent) processes is often the same, which can be represented by some continuous spatial surfaces. A unifying framework is introduced for process-based multivariate spatial fusion models. The framework can jointly analyze all three aforementioned types of spatial data or any combinations thereof. Moreover, the framework accommodates different likelihoods for geostatistical and lattice data. It is shown that some established approaches, such as linear models of coregionalization, can be viewed as special cases of the proposed framework. A flexible and scalable implementation using R-INLA is provided. Simulation studies confirm that the prediction of latent processes improves as one moves from univariate spatial models to multivariate spatial fusion models. The framework is illustrated via a case study using datasets from a cross-sectional study linked with a national cohort in Switzerland. The differences in underlying spatial risks between respiratory disease and lung cancer are examined in the case study.
Keywords
Bayesian methods
Change of support problem
Data fusion
Gaussian process
1. Introduction
In modern spatial statistics, researchers are dealing with increased heterogeneity in the structure of collected spatial data. Different data sources may contain overlapping information concerning the same research questions. In addition, combining different datasets can mitigate the problem of limited sample size and imperfect measurements (Diggle and Lophaven, 2006). In public health and epidemiology, joint analysis of multiple diseases borrows information from related diseases to account for the underlying correlations and to uncover common risk factors.
Spatial models are useful when residuals exhibit correlation in space after known covariates are accounted for in a regression-type setting. Spatial data has long been classified into three categories, namely geostatistical (point-level) data, lattice (area-level) data and point-pattern data (Cressie, 1991). Depending on data types, different statistical models that capture the residual spatial correlation are used. In a nutshell, (1) geostatistical data are observations with geo-coordinates. For example, rainfall measurements at locations of weather stations (Kyriakidis et al., 2001). The strength of dependency can be modeled as a function of the distance separation between two locations. (2) Lattice data can be either gridded or irregularly aligned, and occur in the form of aggregated observation over areas. They are often collected in epidemiology, such as disease prevalence of each district (Chammartin et al., 2016). Gaussian Markov random field (GMRF) models such as conditionally autoregressive models are typically used to capture the spatial dependency between neighboring areas. Finally, (3) point-pattern data are observations where the locations themselves are stochastic. They are used to model epidemiological data of case locations (Gatrell et al., 1996) or other event locations such as epicenter of earthquakes (Ogata, 1988). One approach to model such data is using a point process, where the intensity function may depend on observed covariates (Gelfand et al., 2010).
Despite there are different modeling strategies for different types of data, a common purpose of all the aforementioned statistical models is to capture the residual spatial dependency between different observations after known covariates are accounted for. A natural way to do this is using Gaussian processes to model continuous spatial surfaces, which represent the underlying scientific process that drives response variables together with observed covariates. It is beneficial to jointly model heterogeneous types of spatial data on the underlying process instead of observed spatial support. Several authors (Moraga et al., 2017, Shi and Kang, 2017, Wang et al., 2018) demonstrated improved prediction performance when multiple datasets were modeled together compared to a single dataset via both simulation study and real-world applications. In addition, better scientific interpretations can be obtained by modeling different data types at the underlying process (Møller et al., 1998, Kelsall and Wakefield, 2002).
In order to model spatial data at a different support than that from the observed data, change of support problem occurs. There are methods proposed to analyze lattice data and point-pattern data using Gaussian process-based models. Kelsall and Wakefield (2002) modeled aggregated disease counts using a Gaussian process approach. Point-pattern data can be linked to Gaussian process with a log-Gaussian Cox process (LGCP) (Møller et al., 1998, Diggle et al., 2013). Taking a step further, several approaches used Gaussian process as a basis to fuse spatial data of different types. We refer to these approaches as spatial fusion models. Some earlier work on spatial fusion models (Berrocal et al., 2010, McMillan et al., 2010, Sahu et al., 2010) implemented efficient algorithms for specific model structures bundled with their data applications. Moraga et al. (2017) proposed a model to analyze spatial data available in both geostatistical and lattice type, with the same set of covariates and a response variable observed at two different spatial resolutions. Wilson and Wakefield (2018) extended the work by allowing non-Gaussian response variables. Shi and Kang (2017) proposed a fixed rank kriging-based fusion model to combine multiple lattice-type remote sensing datasets. Wang et al. (2018) utilized a nearest neighbor Gaussian process to combine geostatistical and lattice data with a flexible model structure.
In general, spatial fusion models are challenged by the trade-off between flexibility and computational efficiency, i.e., more flexible modeling structures require more generalizable inferential tools that come with a higher computational cost. For example, the fusion model in Wilson and Wakefield (2018) with normally-distributed response variables can take advantage of stochastic partial differential equation (SPDE) approaches and INLA which took few minutes to run. A more flexible modeling structure for Poisson-distributed response requires using Hamiltonian Monte Carlo-based inference method and took several weeks.
In this paper, we build on the work in Wang et al. (2018) and extend fusion modeling in two aspects. In terms of flexibility, our framework incorporates an additional data type, namely point-pattern data. To the best of our knowledge, this is the first spatial fusion framework that incorporates all three types of spatial data. We additionally allow arbitrary combinations of those three data types in multivariate settings. We propose a unifying framework that includes features of several well-established models, such as linear models of coregionalization (LMC) (Wackernagel, 2003, MacNab, 2016), spatial factor model (Wang and Wall, 2003) and shared component model (Knorr-Held and Best, 2001). In terms of computational cost, we offer an efficient SPDE and INLA-based implementation which significantly reduces computation time from hours to minutes for thousands of observations. Last but not least, we benchmark the performance of our implementation in terms of prediction and parameter estimation in simulation studies.
The rest of this paper is structured as follows: Section 2 introduces the unifying framework and explicitly shows its link to existing spatial models. Section 3 discusses implementation strategy using the SPDE approach and INLA. Section 4 illustrates the framework using two simulated scenarios and a re-analysis on epidemiological datasets to model respiratory disease risk in Kanton Zurich, Switzerland. Finally, we end with a summary followed by discussions on identifiability problems and research outlook in Section 5.
2. Process-based spatial fusion model
2.1. The unifying framework
For $$$$, we let $$$$ denote the $$$$th response variable with $$$$ observations, with a likelihood that belongs to the exponential family. Each of the $$$$ responses can take any of the following data types: (i) geostatistical data, observed at locations $$$$; (ii) lattice data observed at areas $$$$; or (iii) point-pattern data that has been discretized to regular fine grid containing mostly zeros or ones, observed at gridded locations $$$$, where $$$$ denotes the number of events in the grid cell containing $$$$. Further, we let $$$$ denote a full (column) rank $$$$ matrix of spatially-referenced covariates that are observed at the same spatial units as the corresponding response variables, $$$$ denote a vector $$$$ of fixed effect coefficients. We assume there is a $$$$ vector of zero-mean, unit variance, independent latent Gaussian processes $$$$ having a $$$$ design matrix $$$$ with rows $$$$, i.e. $$$$ is the $$$$th linear combination of Gaussian processes. Each Gaussian process is parameterized by its own covariance function. Finally, non-linear operator $$$$ subsets and aggregates some components of $$$$ such that the linear combination matches the spatial resolution of corresponding response variable. Overall, the framework is formulated as (1)$$$$where $$$$ is a link function that corresponds to the conditional distribution of $$$$.
Although the latent processes have a continuous index, we work with a finite set of locations in practice. The set of locations $$$$ to be modeled in the latent processes $$$$ comprises of locations where geostatistical data are observed, locations of sampling points for lattice data and gridded locations for point-pattern data. The non-linear operator $$$$ takes a different form depending on data types. For geostatistical and point-pattern data, the non-linear operator $$$$ subsets $$$$ to the corresponding locations of the $$$$th response variable. For lattice data, a change of support problem arises (Gotway and Young, 2002) since we only observe aggregated information while the underlying process is continuous. When the link function is linear, $$$$ subsets $$$$ to the sampling point locations and aggregates them to the corresponding areas by taking averages. With non-linear link functions, $$$$ first applies an inverse link function and then aggregates.
When modeling lattice data, we employ a sampling-points approximation approach to stochastic integrals (Gelfand et al., 2001, Fuentes and Raftery, 2005) for aggregating latent processes. Let $$$$ denote the set of all sampling points and let $$$$ denote the number of sampling points of each area. Further, to simplify the notation, we denote with $$$$ an arbitrary component of the $$$$ spatial latent process. For the $$$$th area $$$$ in $$$$th response, under a linear link function we obtain (2)$$$$with $$$$ being the area of $$$$, i.e. the latent process at area $$$$ is approximated by aggregating the process at sampling points within the area. Ecological bias will arise from non-linear link functions (Greenland, 1992), that is, the sum of non-linear functions applied to individual $$$$ is different from the non-linear function applied to the sum of individual $$$$’s. For a general link function $$$$, we can avoid such bias by using the following approximation instead of (2), (3)$$$$Under the identity link function $$$$, Eq. (3) is equivalent to Eq. (2). For Poisson response with log link function $$$$, Eq. (3) becomes (4)$$$$Typically, a small $$$$ between 5 to 10 is chosen to balance the trade-off between computational efficiency and model accuracy (Fuentes and Raftery, 2005, Liu et al., 2011).
2.2. Linking to existing models
Our proposed unifying framework utilizes elements from existing literature and combines them to create a flexible yet efficient spatial fusion model framework. As a result, there are some connections in terms of model structure between this framework and other established methods in spatial statistics. At the same time, they share the same potential identifiability issues.
In univariate settings, the unifying framework allows us to model each type of spatial data individually with a latent Gaussian process. When we have geostatistical data, the framework results in a geostatistical regression (Cressie, 1991). With Poisson-distributed lattice data, we obtain a sampling-points approximation to the model used in Kelsall and Wakefield (2002), which is an alternative modeling strategy to Besag–York–Mollé model (Besag et al., 1991). With point-pattern data, we obtain a discretized LGCP (Møller et al., 1998).
In multivariate geostatistical data settings, the design matrix $$$$ plays a pivotal role in the identifiability of model parameters. When the number of independent Gaussian processes is less than the number of responses $$$$, we obtain a spatial factor model (Wang and Wall, 2003). The latent spatial factors are assumed to have zero-mean unit-variance Gaussian processes, such that $$$$ controls the variance (partial sill) of latent processes. When $$$$, we obtain a general coregionalization framework (Wackernagel, 2003, Schmidt and Gelfand, 2003). A similar LMC framework also exists for lattice data (MacNab, 2016). Identifiability issues occur in the LMC since the number of latent values to be estimated in the latent processes is equal to the total number of observations in response variables. Additional spatial hyper-parameters and fixed-effect coefficients also need to be estimated. For this reason, regularization is done via one of the following: (1) employing empirical Bayes method by fixing some of the hyper-parameters; (2) choosing informative prior distributions in Bayesian models; or (3) using a lower triangular matrix for $$$$ (Schmidt and Gelfand, 2003). In cases of $$$$, we acquire a similar model structure as shared component models (Knorr-Held and Best, 2001) for Gaussian processes, where multiple outcomes have their own latent spatial components plus some shared spatial components. In this setting, the values in $$$$ need to be even further constrained to avoid identifiability issues (Knorr-Held and Best, 2001).
Our framework is also linked to other process-based spatial data fusion models that combine geostatistical and lattice data types. When we let the response variables represent the same information with different data types, we obtain the model presented in Wilson and Wakefield (2018), where an explicit relationship is used to link multiple response variables. If we further allow different information to be represented in the response variables, we reach the generalized spatial fusion model framework proposed in Wang et al. (2018).
To the best of our knowledge, there is no existing approach or implementation that jointly models all three types of spatial data in a multivariate framework. With those links to the existing approaches, our framework extends upon them by combining different features and enhances the overall flexibility of spatial fusion models.
3. Model implementations
It is well known that fitting full Gaussian processes in Bayesian models is computationally expensive in both univariate and multivariate settings. Marginalized and conjugate Gaussian process models dramatically save computation time but they can only be used when geostatistical data with normally-distributed outcomes is fitted (Banerjee et al., 2014, Zhang et al., 2019). There are several approaches to reduce the computational burden, such as low rank (Cressie and Johannesson, 2008, Banerjee et al., 2008, Stein, 2008) and sparse (Furrer et al., 2006, Rue et al., 2009, Datta et al., 2016) methods. Some of those approaches are utilized in existing spatial fusion models. Shi and Kang (2017) adapted the spatial basis function approach from fixed rank kriging (Cressie and Johannesson, 2008). Moraga et al. (2017) used integrated nested Laplace approximations (INLA) (Rue et al., 2009). Wang et al. (2018) exploited the nearest neighbor Gaussian process (NNGP) (Datta et al., 2016). In this paper, we offer an efficient implementation strategy for the unifying spatial fusion model framework. The strategy follows Wilson and Wakefield (2018) to use the SPDE approach and INLA, with additional approximations for non-linear link functions.
Although the computational efficiency can be improved by using NNGPs instead of full Gaussian processes, it is still not feasible to fit multiple latent processes with more than thousands of locations in $$$$. Therefore, we choose the SPDE approach and INLA for the implementation of the fusion models.
Lindgren et al. (2011) established a connection between Matérn Gaussian process and GMRFs through a SPDE approach, where a Gaussian process over finite collection of locations can be approximated by triangulating the spatial domain and using a weighted sum of basis functions as (5)$$$$where $$$$ is the number of points in the triangulation, $$$$ are Gaussian distributed weights and $$$$ are basis functions. The weights $$$$ forms a GMRF which follows a multivariate normal distribution with a sparse precision matrix (Rue and Held, 2005) that makes computation efficient. In this approximation approach, the covariance function of the Gaussian process must be a member of the Matérn family defined as (6)$$$$where $$$$ is the Euclidean distance between $$$$ and $$$$, $$$$ is the modified Bessel function of second kind with integer order $$$$, $$$$ is the partial sill, $$$$ relates to the spatial range and $$$$ is the smoothness parameter. The approximation in Eq. (5) can be written as $$$$, where $$$$ is a projection matrix that maps a GMRF defined on the triangulation mesh nodes to the observations’ locations.
INLA, which is suitable for a wide range of latent Gaussian models, can be used to fit this approximation approach for $$$$ (Lindgren and Rue, 2015). The key to implementing the spatial fusion models in INLA lies within the projection matrix, with a different structure required for each data type (Krainski et al., 2018).
•
For geostatistical data, the $$$$th row of the projection matrix corresponds to the $$$$th location, it is filled with zeros except where (1) the location is on the $$$$th vertex, then the $$$$th column is ones or (2) the location is within a triangulation area, then three cells get values based on a mixture of barycentric based weights from three neighboring vertices of the triangulation.
•
For lattice data, we construct a projection matrix that links the $$$$th area with the mean value of GRF at mesh nodes which falls into the $$$$th area in analogous to Eq. (2). If the link function is linear, increasing the mesh density will increase the number of mesh nodes that fall into each area, therefore, better approximates the average. However, it is sufficient to have a sparse mesh for non-linear link functions (Follestad and Rue, 2003).
•
Finally, for the point-pattern data, we use an augmentation approach by Simpson et al. (2016), which avoids discretizing the spatial domain into grid cells. The projection matrix is built as an identity matrix with a dimension equal to the total number of mesh nodes, row-binded with a projection matrix that is constructed on observed locations in the same way as the geostatistical case.
The final model fitting is done by stacking the projection matrices corresponding to each response variable together using inla.stack() and assigning appropriate priors using the INLA (Lindgren and Rue, 2015) package in R.
Recent advances in INLA (Martins et al., 2013) such as allowing multiple likelihoods and ‘copy’ feature made this implementation possible. When the response variables follow different distributions, the family argument in inla() is used to assign them. The underlying latent process needs to be specified only once and then assigned to different response variables using the copy argument in f() when specifying a model formula. Example R code is available in the supplementary materials.
4. Illustrations
In this section, we conduct two simulation studies and an analysis of epidemiological datasets to illustrate our proposed framework. We do not include comparisons with existing methods since they are either inflexible to be fitted with our model structure or computationally infeasible for the datasets. All results are obtained in R version 3.5.0 (R Core Team, 2018), on a Linux server with 256 GB of RAM and two Intel Xeon 6-core 2.5 GHz processors. All R code used in the simulation studies is provided in the supplementary materials.
4.1. Simulation study one
We are interested in modeling a single latent spatial process within a $$$$ square, using three spatial response variables with one response variable from each data type. First, we simulate a zero-mean GRF on densely uniformly distributed locations with a covariance matrix $$$$. We then sub-sample 200 locations to obtain the latent process at observed locations. For lattice observations, we divide the square into 100 Voronoi cells and compute aggregated GRF from all locations while accounting for ecological bias using Eq. (3). In addition, we generate covariates $$$$ and $$$$ for geostatistical and lattice response by sampling independently from a standard normal distribution. Afterwards, we generate a normally-distributed geostatistical response at the same sampled locations and a Poisson-distributed lattice response for each area. For point-pattern observations, we simulate from the same GRF on a coarse 20 × 20 grid, then exponentiate the values to obtain intensity at the grid cells. Afterwards, we generate Poisson point process using each intensity value multiplied by cell area and an offset term as the final intensity. In summary, the response variables are generated according to $$$$$$$$(7)$$$$ In the simulation, we use an exponential covariance function ($$$$), i.e. $$$$, $$$$, $$$$ two spatial locations in $$$$. The influence of different sample sizes and combination of spatial hyperparameters on predictive performance was investigated in a fusion model by Wang et al. (2018), therefore, we only consider a single setup by setting $$$$ and $$$$. In addition, we set $$$$ and $$$$. $$$$ is a constant term that controls the density of point process, and it is assigned as the product of grid cell area and an offset which takes value 0.25. $$$$ and $$$$ multiplies $$$$ with a matrix of zeros and ones to subset it to the locations of $$$$ and gridded locations of $$$$, while $$$$ applies Eq. (4) on the subset of sampling locations for $$$$.
We consider seven different model specifications within our proposed framework: three univariate models using a single data type each, namely one of geostatistical, lattice and point-pattern data; three fusion models using different combinations of two data types; and a multivariate fusion model combining all three response variables. We use penalized complexity (PC) prior for Matérn GRF (Fuglstad et al., 2019) with $$$$ fixed at 1.5, corresponding to the exponential covariance function. In addition, we choose the median practical spatial range to be 2 (corresponds to the median of $$$$ being 1) and the probability of $$$$ greater than 1.7 is 5%. The remaining priors are default options from R-INLA (Lindgren and Rue, 2015), i.e., a zero-mean normal distribution with precision 0.001 for the coefficients $$$$ and $$$$; and a Gamma distribution with shape being 1 and rate being 0.00001 for the precision $$$$. The simulation is repeated 100 times, each model runs just under one minute.
We choose an additional 1600 sites from a regular grid to evaluate the predictive performance of models on the latent process in terms of posterior standard deviation and root mean squared prediction errors (RMSPE) given by (8)$$$$under each scenario. In Eq. (8), $$$$ denotes the posterior median. Note that the latent process is kept the same under each simulation for a fair comparison. The prediction sites are located at the centers of a 40 × 40 grid that uniformly covers the sampling domain. Their predictive performance is shown in Fig. 1. The first Venn diagram shows the average posterior standard deviation over 100 simulations under different models. The second Venn diagram shows average RMSPEs. The point-process data only model has the lowest posterior standard deviations, however, its RMSPE is the largest. Overall, the RMSPEs are smaller in multivariate fusion models compared to univariate process-based models. The joint modeling of all three types of spatial data has the lowest RMSPE on the prediction of the latent process at unobserved locations, demonstrating the benefits of spatial fusion models.
Fig. 1. Venn diagram of mean posterior standard deviation and root mean squared prediction error for the unifying fusion framework fitted to each data type (and combinations thereof). Values in overlapping areas indicate results from models with multiple data types.
4.2. Simulation study two
In the second simulation study, we simulate a new scenario with three response variables of different types and two latent processes to evaluate the parameter estimates of a multivariate fusion model. We firstly simulate two independent zero-mean unit-variance GRFs $$$$ with exponential covariance function distributed on the spatial domain of $$$$ square, then compute the sub-sampled and aggregated GRF for each response variable in the same way as in simulation one.
Each response depends on the latent processes via the design matrix (9)$$$$The first geostatistical response variable only depends on the first latent process, the second lattice response variable depends on both latent processes, while the third point pattern response variable depends only on the second latent process. The response variables are generated as follows, $$$$$$$$(10)$$$$ where $$$$ consists of 500 geostatistical observations, $$$$ has 100 lattice observations and $$$$ represents the number of events observed at each of 400 cells on a 20 × 20 grid. In addition, we set $$$$, $$$$ and $$$$. $$$$ and $$$$ are defined similarly as in simulation one. Since we have two latent processes in the simulation, using any of the univariate model or fusion model with two response variables can lead to identifiability problem. Therefore, we estimate the parameters using the unifying spatial fusion model with three responses only. The model and their prior specifications are the same as in simulation one, except for the spatial range parameter and the design matrix. The prior for both $$$$ and $$$$ is a PC prior with median practical spatial range $$$$ (corresponds to the median of $$$$ being 10). For implementation purposes, the design matrix components are parameterized differently in INLA. $$$$ and $$$$ are treated as $$$$ and $$$$ of each latent process, while $$$$ and $$$$ are coefficients for the latent processes with variance $$$$ and $$$$. The former has the same prior as in simulation one, the latter has a normal prior with a mean of 1 and a standard deviation of 10. The simulation is run 100 times.
The parameter estimates based on posterior medians are displayed in Fig. 2. The PC prior in R-INLA penalizes complex structure in GRF hence tends to have a slightly over-estimated range and underestimated coefficients in $$$$ (Fuglstad et al., 2019). One example of the posterior median of fitted latent processes at locations with geostatistical observations is shown in Fig. 3. The figure shows both $$$$ and $$$$ are fitted well and the plot of fitted $$$$ and $$$$ shows their independence structure assumed in the model. The root mean squared errors are 0.54 and 0.48 for the first and second latent processes. The computation time for the INLA implementation of the fusion model is 11 min.
Fig. 2. Distribution of posterior median estimates from the unifying spatial fusion model in simulation two. The black vertical line indicates the true parameter value, dashed curves indicate prior density.
Fig. 3. One example of true versus fitted latent process at locations with geostatistical observation in simulation two, for $$$$, $$$$ and fitted $$$$ against $$$$ respectively. Pearson’s correlation coefficients $$$$ are displayed.
4.3. Application to LuftiBus-SNC dataset
In spatial epidemiology, joint analysis of multiple diseases with similar etiology allows us to separate underlying risk factors into shared and disease-specific components. In this analysis, we examine the disease-specific spatial risk surface of lung cancer and shared spatial risk components between lung cancer and respiratory disease while taking PM10 (particulate matter with diameter $$$$) pollution surface into consideration. Previously, a similar analysis was done in Wang et al. (2018) considering combined lung cancer and respiratory disease risk only.
Chronic lung disease contributes substantially to morbidity and mortality worldwide, with chronic obstructive pulmonary disease (COPD) being the third leading cause of death (Lozano et al., 2012). Forced expiratory volume in one second (FEV1) is a measure of the amount of air a person can exhale during a pulmonary test and it can be used to diagnose disease and predict respiratory-related mortality (Menezes et al., 2014). While respiratory disease and lung cancer share many common risk factors such as smoking and exposure to air pollution, it is of interest to examine the lung cancer-specific spatial risk component. It may provide insights into identifying risk factors that are solely associated with lung cancer.
Initiated as a health promotion campaign by Lunge Zürich (2017) in Switzerland, the ‘LuftiBus’ project collected lung function measurements including FEV1 and demographic information from local residents. Data from LuftiBus observed between 2003 and 2012 were deterministically linked with the census-based Swiss National Cohort (SNC) (Bopp et al., 2008). The purpose was to obtain 44,071 individuals with demographic, health and environmental variables in Switzerland. More importantly, the linkage provides us with the residential location of individual participants.
For lattice data, we use the computed expected cause-specific (respiratory and lung cancer, respectively) mortalities in each municipality, adjusted by 5-year age-group and gender based on the SNC data ($$$$ $$$$ 572,993). For point process data, we sample from a PM10 pollution map in 2010 (Geoinformation Kanton Zurich, 2015) to obtain proxy pollution sources. The probability of sampling each location is proportional to standardized PM10 annual mean value. We obtain 253 locations which are mainly distributed along highway roads.
We assume three latent spatial risk surfaces that are associated with respiratory disease and lung cancer. The first risk surface, $$$$, is shared between FEV1, respiratory mortality and lung cancer mortality, while the second risk surface $$$$ is lung cancer-specific risk surface and the third risk surface $$$$ is shared between PM10 and the other three disease-related variables. Typically with lattice data, multivariate conditional autoregressive models allow us to jointly analyze multiple responses and identify different latent components (Jin et al., 2005). Because municipal boundaries are artificial, we argue that a continuous spatial surface is a more natural modeling assumption. Therefore, we use our process-based unifying framework to conduct the analysis. Another advantage is that it allows us to incorporate the rich FEV1 data from Luftibus as well as additional pollution data.
The fusion model is structured as $$$$$$$$$$$$(11)$$$$ where $$$$ and $$$$ are the expected cause-specific mortalities. We set some coefficients of the design matrix $$$$ to zero and directly depict the association between each response variable and the latent processes using the resulting product of $$$$ and $$$$. The terms $$$$ and $$$$ are set to be negative since FEV1 is assumed to decrease as the latent risks increase. In addition, we add an additional intercept for $$$$ to indicate constant background pollution.
More than 60% of the FEV1 measurements in the linked dataset are located in the Canton of Zurich, therefore, we restrict our analysis to the Canton of Zurich. In addition, we focus the analysis on people who are 40 years or older, which results in 16,160 geostatistical observations. We use PC prior for the latent components with $$$$ corresponding to exponential covariance function, median practical range of 1 km and median $$$$ of 1. Fig. 4 shows the locations of geostatistical observation, standardized mortality ratio for respiratory disease and lung cancer and point process locations for pollution source proxy. Table 1 shows parameter estimates and Fig. 5 shows the transformed posterior estimates of the latent processes representing relative risk surfaces in the Canton of Zurich. The shared risk components between FEV1, respiratory mortality, and lung cancer mortality is the highest in urban areas, with an effective range of 1.1 km (95% CI: 0.5, 2.1) based on the exponential covariance function. The estimated relative risk is computed by exponentiating the latent process, which varies between 0.91 and 1.13. Meanwhile, high-risk areas of lung cancer-specific components are scattered around the Canton of Zurich, mainly in the north and west regions with an effective range of 0.9 km (95% CI: 0.3, 3.1). The variability is larger than the shared component with values between 0.90 and 1.20. The lung cancer-specific risk component is modeled via lattice data $$$$ only, hence it appears to have some block-wise structures compared to the shared component and have larger variability in the range estimation. Such results complement classical lattice data-based disease mapping approaches to identify potential disease hotpots at a finer resolution. The pollution surface is shared between all four response variables and has the longest effective range with 5.9 km (95% CI: 3.3, 9.4). It is the strongest around city centers.
Fig. 4. Data used in the fusion model. Top left: locations of FEV1 observations. Top right: respiratory standardized mortality ratio. Bottom left: lung cancer standardized mortality ratio. Bottom right: sampled locations with density proportional to PM10 annual mean.
Table 1. Parameter estimates and their 95% posterior credible intervals (95% CI) for the LuftiBus-SNC dataset. $$$$ and $$$$ are in meters.
Parameter | Median | 95% CI | Parameter | Median | 95% CI |
---|---|---|---|---|---|
$$$$ | 4.74 | (4.70, 4.79) | $$$$ | 0.0809 | (0.0539, 0.122) |
$$$$ | 0.907 | (0.891, 0.923) | $$$$ | 0.0566 | (0.0324, 0.0796) |
$$$$ | $$$$0.0375 | ($$$$0.0382, $$$$0.0368) | $$$$ | 0.0855 | (0.0537, 0.137) |
$$$$ | $$$$0.0643 | ($$$$0.130, $$$$0.00387) | $$$$ | 0.141 | (0.0868, 0.228) |
$$$$ | $$$$0.119 | ($$$$0.191, $$$$0.0502) | $$$$ | 0.0902 | (0.0386, 0.172) |
$$$$ | $$$$0.0501 | ($$$$0.178, $$$$0.0833) | $$$$ | 0.500 | (0.208, 1.170) |
$$$$ | 0.268 | (0.263, 0.274) | $$$$ | 0.159 | (0.0931, 0.251) |
$$$$ | 367 | (173, 690) | $$$$ | 0.123 | (0.0824, 0.168) |
$$$$ | 307 | (98, 1030) | |||
$$$$ | 1970 | (1110, 3130) |
Fig. 5. Estimated spatial relative risk surfaces. Left: shared component between FEV1, respiratory mortality and lung cancer mortality, $$$$. Middle: lung cancer mortality-specific component, $$$$. Right: shared component between PM10 and all three disease-related response variables, $$$$.
5. Summary and discussions
We have proposed a unifying process-based statistical framework to handle spatial data fusion. The framework allows simultaneously modeling all three types of spatial data, namely geostatistical, lattice and point-pattern data, to be easily incorporated into a single multivariate spatial model. This framework contains theoretical and computational elements from several existing literatures: the implementation uses the SPDE approach (Lindgren et al., 2011) and INLA (Rue et al., 2009), the sampling point approximation approach for modeling lattice data is adopted from Fuentes and Raftery (2005) and data augmentation (Simpson et al., 2016) is used in INLA. We have combined all of the individual elements and constructed this unifying framework. The framework extends upon existing flexible spatial fusion models (Wang et al., 2018, Wilson and Wakefield, 2018) by making point-pattern data also compatible, hence it completes all three spatial data types. We have shown in the first simulation study that it is advantageous to conduct multivariate analysis using multiple spatial datasets if they are available.
Identifiability issues arise when there is more than one latent spatial process in the fusion model. Similar concern has been brought up in other multivariate spatial models (Ren and Banerjee, 2013, Knorr-Held and Best, 2001). Since the model becomes invariant under certain orthogonal transformations, the design matrix $$$$ is not identifiable. Knorr-Held and Best (2001) proposed a specific constraint on the relationship among the individual elements of the design matrix. Ren and Banerjee (2013) proposed to constrain one element of each row in the design matrix to be strictly positive and to have an ordered spatial range parameter. The same constraints allow identifiable parameters in our implementation. A distinction between our proposed framework and existing multivariate models is that we can potentially have only one observation at any of the spatial locations even when we have three response variables in our model. This makes it problematic to identify more than one latent process at each location. Our implementation using the SPDE approach and INLA avoids this problem since it does not directly model the latent variable parameters at the set of locations $$$$, but on the mesh vertices. When a model involves Matérn covariance function with a smoothness parameter greater than 1, the models using the SPDE approach can be used as an approximation while the model likelihood can be directly used with a Bayesian hierarchical approach.
The usage of our proposed framework is multifaceted. The interest sometimes lies within latent spatial processes when spatial data are analyzed, which represent residual spatial correlation in the response variables after considering existing covariates. The result can be used for detecting spatial clusters of unexplained risk or shared scientific drivers for response variables, which warrant further investigation in identifying those unknown drivers. When the interest is in predicting a response variable for a newly observed spatial unit, the fusion model improves the prediction of latent processes which in turn improves response variable prediction. Furthermore, the framework can be modified to use a one-dimensional Gaussian process in the latent components such that it applies beyond spatial data. For example, it can be used in time series modeling where all the observations are in $$$$ and as well as in machine learning applications (Rasmussen and Williams, 2005).
Although our current framework assumes independent latent Gaussian processes, it can be viewed as modeling multivariate Gaussian process via the LMC approach (Genton and Kleiber, 2015). One drawback is that the number of parameters in the design matrix $$$$ increases with the number of latent Gaussian processes, which is difficult to estimate and may lead to convergence issues. In those cases, it may be advantageous to directly model the spatial structure with a cross-covariance function. In addition, the framework requires the number of latent processes $$$$ to be specified a priori.
As with any other statistical modeling, model misspecification has an impact on the inference and should be assessed via sensitivity analysis. Further research might include checking the compatibility of different data sources for spatial fusion modeling, i.e. if overlapping information exists between different spatial datasets. Such information helps to inform the model structure, especially the design matrix $$$$.
Supporting information
Supplementary material for this article is available online at the author’s website www.math.uzh.ch/furrer/download/unifying2020.zip, including all the R code used for the simulation studies and additional details on the Stan implementation of the framework using Bayesian hierarchical models.