Source author record

Jon Wakefield

Jon Wakefield appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Computation stat.OT

Catalog footprint

What is connected

13works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Estimating Global and Country-Specific Excess Mortality During the COVID-19 Pandemic

Estimating the true mortality burden of COVID-19 for every country in the world is a difficult, but crucial, public health endeavor. Attributing deaths, direct or indirect, to COVID-19 is problematic. A more attainable target is the "excess deaths", the number of deaths in a particular period, relative to that expected during "normal times", and we estimate this for all countries on a monthly time scale for 2020 and 2021. The excess mortality requires two numbers, the total deaths and the expected deaths, but the former is unavailable for many countries, and so modeling is required for these countries. The expected deaths are based on historic data and we develop a model for producing expected estimates for all countries and we allow for uncertainty in the modeled expected numbers when calculating the excess. We describe the methods that were developed to produce the World Health Organization (WHO) excess death estimates. To achieve both interpretability and transparency we developed a relatively simple overdispersed Poisson count framework, within which the various data types can be modeled. We use data from countries with national monthly data to build a predictive log-linear regression model with time-varying coefficients for countries without data. For a number of countries, subnational data only are available, and we construct a multinomial model for such data, based on the assumption that the fractions of deaths in sub-regions remain approximately constant over time. Based on our modeling, the point estimate for global excess mortality, over 2020-2021, is 14.9 million, with a 95% credible interval of (13.3, 16.6) million. This leads to a point estimate of the ratio of excess deaths to reported COVID-19 deaths of 2.75, which is a huge discrepancy.

preprint2022arXiv

Smoothed Model-Assisted Small Area Estimation

In countries where population census data are limited, generating accurate subnational estimates of health and demographic indicators is challenging. Existing model-based geostatistical methods leverage covariate information and spatial smoothing to reduce the variability of estimates but often ignore survey design, while traditional small area estimation approaches may not incorporate both unit level covariate information and spatial smoothing in a design-consistent way. We propose a smoothed model-assisted estimator that accounts for survey design and leverages both unit level covariates and spatial smoothing. Under certain assumptions, this estimator is both design-consistent and model-consistent. We compare it with existing design-based and model-based estimators using real and simulated data.

preprint2022arXiv

Spatial Aggregation with Respect to a Population Distribution

Spatial aggregation with respect to a population distribution involves estimating aggregate quantities for a population based on an observation of individuals in a subpopulation. In this context, a geostatistical workflow must account for three major sources of `aggregation error': aggregation weights, fine scale variation, and finite population variation. However, common practice is to treat the unknown population distribution as a known population density and ignore empirical variability in outcomes. We improve common practice by introducing a `sampling frame model' that allows aggregation models to account for the three sources of aggregation error simply and transparently. We compare the proposed and the traditional approach using two simulation studies that mimic neonatal mortality rate (NMR) data from the 2014 Kenya Demographic and Health Survey (KDHS2014). For the traditional approach, undercoverage/overcoverage depends arbitrarily on the aggregation grid resolution, while the new approach exhibits low sensitivity. The differences between the two aggregation approaches increase as the population of an area decreases. The differences are substantial at the second administrative level and finer, but also at the first administrative level for some population quantities. We find differences between the proposed and traditional approach are consistent with those we observe in an application to NMR data from the KDHS2014.

preprint2022arXiv

The Central Role of the Identifying Assumption in Population Size Estimation

The problem of estimating the size of a population based on a subset of individuals observed across multiple data sources is often referred to as capture-recapture or multiple-systems estimation. This is fundamentally a missing data problem, where the number of unobserved individuals represents the missing data. As with any missing data problem, multiple-systems estimation requires users to make an untestable identifying assumption in order to estimate the population size from the observed data. If an appropriate identifying assumption cannot be found for a data set, no estimate of the population size should be produced based on that data set, as models with different identifying assumptions can produce arbitrarily different population size estimates -- even with identical observed data fits. Approaches to multiple-systems estimation often do not explicitly specify identifying assumptions. This makes it difficult to decouple the specification of the model for the observed data from the identifying assumption and to provide justification for the identifying assumption. We present a re-framing of the multiple-systems estimation problem that leads to an approach which decouples the specification of the observed-data model from the identifying assumption, and discuss how common models fit into this framing. This approach takes advantage of existing software and facilitates various sensitivity analyses. We demonstrate our approach in a case study estimating the number of civilian casualties in the Kosovo war. Code used to produce this manuscript is available at https://github.com/aleshing/central-role-of-identifying-assumptions.

preprint2020arXiv

Bayesian Multiresolution Modeling Of Georeferenced Data

Current implementations of multiresolution methods are limited in terms of possible types of responses and approaches to inference. We provide a multiresolution approach for spatial analysis of non-Gaussian responses using latent Gaussian models and Bayesian inference via integrated nested Laplace approximation (INLA). The approach builds on `LatticeKrig', but uses a reparameterization of the model parameters that is intuitive and interpretable so that modeling and prior selection can be guided by expert knowledge about the different spatial scales at which dependence acts. The priors can be used to make inference robust and integration over model parameters allows for more accurate posterior estimates of uncertainty. The extended LatticeKrig (ELK) model is compared to a standard implementation of LatticeKrig (LK), and a standard Matérn model, and we find modest improvement in spatial oversmoothing and prediction for the ELK model for counts of secondary education completion for women in Kenya collected in the 2014 Kenya demographic health survey. Through a simulation study with Gaussian responses and a realistic mix of short and long scale dependencies, we demonstrate that the differences between the three approaches for prediction increases with distance to nearest observation.

preprint2020arXiv

Estimation of Health and Demographic Indicators with Incomplete Geographic Information

In low and middle income countries, household surveys are a valuable source of information for a range of health and demographic indicators. Increasingly, subnational estimates are required for targeting interventions and evaluating progress towards targets. In the majority of cases, stratified cluster sampling is used, with clusters corresponding to enumeration areas. The reported geographical information varies. A common procedure, to preserve confidentiality, is to give a jittered location with the true centroid of the cluster is displaced under a known algorithm. An alternative situation, which was used for older surveys in particular, is to report the geographical region within the cluster lies. In this paper, we describe a spatial hierarchical model in which we account for inaccuracies in the cluster locations. The computational algorithm we develop is fast and avoids the heavy computation of a pure MCMC approach. We illustrate by simulation the benefits of the model, over naive alternatives.

preprint2020arXiv

Small Area Estimation of Health Outcomes

Small area estimation (SAE) entails estimating characteristics of interest for domains, often geographical areas, in which there may be few or no samples available. SAE has a long history and a wide variety of methods have been suggested, from a bewildering range of philosophical standpoints. We describe design-based and model-based approaches and models that are specified at the area-level and at the unit-level, focusing on health applications and fully Bayesian spatial models. The use of auxiliary information is a key ingredient for successful inference when response data are sparse and we discuss a number of approaches that allow the inclusion of covariate data. SAE for HIV prevalence, using data collected from a Demographic Health Survey in Malawi in 2015-2016, is used to illustrate a number of techniques. The potential use of SAE techniques for outcomes related to COVID-19 is discussed.

preprint2016arXiv

Space-time smoothing of complex survey data: Small area estimation for child mortality

Many people living in low- and middle-income countries are not covered by civil registration and vital statistics systems. Consequently, a wide variety of other types of data, including many household sample surveys, are used to estimate health and population indicators. In this paper we combine data from sample surveys and demographic surveillance systems to produce small area estimates of child mortality through time. Small area estimates are necessary to understand geographical heterogeneity in health indicators when full-coverage vital statistics are not available. For this endeavor spatio-temporal smoothing is beneficial to alleviate problems of data sparsity. The use of conventional hierarchical models requires careful thought since the survey weights may need to be considered to alleviate bias due to nonrandom sampling and nonresponse. The application that motivated this work is an estimation of child mortality rates in five-year time intervals in regions of Tanzania. Data come from Demographic and Health Surveys conducted over the period 1991-2010 and two demographic surveillance system sites. We derive a variance estimator of under five years child mortality that accounts for the complex survey weighting. For our application, the hierarchical models we consider include random effects for area, time and survey and we compare models using a variety of measures including the conditional predictive ordinate (CPO). The method we propose is implemented via the fast and accurate integrated nested Laplace approximation (INLA).

preprint2016arXiv

Spatial Modeling, with Application to Complex Survey Data: Discussion of "Model-based Geostatistics for Prevalence Mapping in Low-Resource Settings", by Diggle and Giorgi

Prevalence mapping in low resource settings is an increasingly important endeavor to guide policy making and to spatially and temporally characterize the burden of disease. We will focus our discussion on consideration of the complex design when analyzing survey data, and on spatial modeling. With respect to the former, we consider two approaches: direct use of the weights, and a model-based approach using a spatial model to acknowledge clustering. For the latter we consider continuously indexed Markovian Gaussian random field models.

preprint2015arXiv

InSilicoVA: A Method to Automate Cause of Death Assignment for Verbal Autopsy

Verbal autopsies (VA) are widely used to provide cause-specific mortality estimates in developing world settings where vital registration does not function well. VAs assign cause(s) to a death by using information describing the events leading up to the death, provided by care givers. Typically physicians read VA interviews and assign causes using their expert knowledge. Physician coding is often slow, and individual physicians bring bias to the coding process that results in non-comparable cause assignments. These problems significantly limit the utility of physician-coded VAs. A solution to both is to use an algorithmic approach that formalizes the cause-assignment process. This ensures that assigned causes are comparable and requires many fewer person-hours so that cause assignment can be conducted quickly without disrupting the normal work of physicians. Peter Byass' InterVA method is the most widely used algorithmic approach to VA coding and is aligned with the WHO 2012 standard VA questionnaire. The statistical model underpinning InterVA can be improved; uncertainty needs to be quantified, and the link between the population-level CSMFs and the individual-level cause assignments needs to be statistically rigorous. Addressing these theoretical concerns provides an opportunity to create new software using modern languages that can run on multiple platforms and will be widely shared. Building on the overall framework pioneered by InterVA, our work creates a statistical model for automated VA cause assignment.

preprint2015arXiv

Predictive Modeling of Cholera Outbreaks in Bangladesh

Despite seasonal cholera outbreaks in Bangladesh, little is known about the relationship between environmental conditions and cholera cases. We seek to develop a predictive model for cholera outbreaks in Bangladesh based on environmental predictors. To do this, we estimate the contribution of environmental variables, such as water depth and water temperature, to cholera outbreaks in the context of a disease transmission model. We implement a method which simultaneously accounts for disease dynamics and environmental variables in a Susceptible-Infected-Recovered-Susceptible (SIRS) model. The entire system is treated as a continuous-time hidden Markov model, where the hidden Markov states are the numbers of people who are susceptible, infected, or recovered at each time point, and the observed states are the numbers of cholera cases reported. We use a Bayesian framework to fit this hidden SIRS model, implementing particle Markov chain Monte Carlo methods to sample from the posterior distribution of the environmental and transmission parameters given the observed data. We test this method using both simulation and data from Mathbaria, Bangladesh. Parameter estimates are used to make short-term predictions that capture the formation and decline of epidemic peaks. We demonstrate that our model can successfully predict an increase in the number of infected individuals in the population weeks before the observed number of cholera cases increases, which could allow for early notification of an epidemic and timely allocation of resources.

preprint2015arXiv

Restricted Covariance Priors with Applications in Spatial Statistics

We present a Bayesian model for area-level count data that uses Gaussian random effects with a novel type of G-Wishart prior on the inverse variance--covariance matrix. Specifically, we introduce a new distribution called the truncated G-Wishart distribution that has support over precision matrices that lead to positive associations between the random effects of neighboring regions while preserving conditional independence of non-neighboring regions. We describe Markov chain Monte Carlo sampling algorithms for the truncated G-Wishart prior in a disease mapping context and compare our results to Bayesian hierarchical models based on intrinsic autoregression priors. A simulation study illustrates that using the truncated G-Wishart prior improves over the intrinsic autoregressive priors when there are discontinuities in the disease risk surface. The new model is applied to an analysis of cancer incidence data in Washington State.

preprint2012arXiv

Bayesian sandwich posteriors for pseudo-true parameters

Under model misspecification, the MLE generally converges to the pseudo-true parameter, the parameter corresponding to the distribution within the model that is closest to the distribution from which the data are sampled. In many problems, the pseudo-true parameter corresponds to a population parameter of interest, and so a misspecified model can provide consistent estimation for this parameter. Furthermore, the well-known sandwich variance formula of Huber(1967) provides an asymptotically accurate sampling distribution for the MLE, even under model misspecification. However, confidence intervals based on a sandwich variance estimate may behave poorly for low sample sizes, partly due to the use of a plug-in estimate of the variance. From a Bayesian perspective, plug-in estimates of nuisance parameters generally underrepresent uncertainty in the unknown parameters, and averaging over such parameters is expected to give better performance. With this in mind, we present a Bayesian sandwich posterior distribution, whose likelihood is based on the sandwich sampling distribution of the MLE. This Bayesian approach allows for the incorporation of prior information about the parameter of interest, averages over uncertainty in the nuisance parameter and is asymptotically robust to model misspecification. In a small simulation study on estimating a regression parameter under heteroscedasticity, the addition of accurate prior information and the averaging over the nuisance parameter are both seen to improve the accuracy and calibration of confidence intervals for the parameter of interest.

Jon Wakefield

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

Estimating Global and Country-Specific Excess Mortality During the COVID-19 Pandemic

Smoothed Model-Assisted Small Area Estimation

Spatial Aggregation with Respect to a Population Distribution

The Central Role of the Identifying Assumption in Population Size Estimation

Bayesian Multiresolution Modeling Of Georeferenced Data

Estimation of Health and Demographic Indicators with Incomplete Geographic Information

Small Area Estimation of Health Outcomes

Space-time smoothing of complex survey data: Small area estimation for child mortality

Spatial Modeling, with Application to Complex Survey Data: Discussion of "Model-based Geostatistics for Prevalence Mapping in Low-Resource Settings", by Diggle and Giorgi

InSilicoVA: A Method to Automate Cause of Death Assignment for Verbal Autopsy

Predictive Modeling of Cholera Outbreaks in Bangladesh

Restricted Covariance Priors with Applications in Spatial Statistics

Bayesian sandwich posteriors for pseudo-true parameters