Source author record

Andrew J. Holbrook

Andrew J. Holbrook appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Computation stat.OT Methodology Populations and Evolution

Catalog footprint

What is connected

5works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Accelerating Bayesian inference of dependency between complex biological traits

Inferring dependencies between complex biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck -- integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to study Aquilegia flower and pollinator co-evolution.

preprint2022arXiv

Computational Statistics and Data Science in the Twenty-first Century

Data science has arrived, and computational statistics is its engine. As the scale and complexity of scientific and industrial data grow, the discipline of computational statistics assumes an increasingly central role among the statistical sciences. An explosion in the range of real-world applications means the development of more and more specialized computational methods, but five Core Challenges remain. We provide a high-level introduction to computational statistics by focusing on its central challenges, present recent model-specific advances and preach the ever-increasing role of non-sequential computational paradigms such as multi-core, many-core and quantum computing. Data science is bringing major changes to computational statistics, and these changes will shape the trajectory of the discipline in the 21st century.

preprint2022arXiv

Generating MCMC proposals by randomly rotating the regular simplex

We present the simplicial sampler, a class of parallel MCMC methods that generate and choose from multiple proposals at each iteration. The algorithm's multiproposal randomly rotates a simplex connected to the current Markov chain state in a way that inherently preserves symmetry between proposals. As a result, the simplicial sampler leads to a simplified acceptance step: it simply chooses from among the simplex nodes with probability proportional to their target density values. We also investigate a multivariate Gaussian-based symmetric multiproposal mechanism and prove that it also enjoys the same simplified acceptance step. This insight leads to significant theoretical and practical speedups. While both algorithms enjoy natural parallelizability, we show that conventional implementations are sufficient to confer efficiency gains across an array of dimensions and a number of target distributions.

preprint2022arXiv

Synthesizing longitudinal cortical thickness estimates with a flexible and hierarchical multivariate measurement-error model

MRI-based entorhinal cortical thickness (eCT) measurements predict cognitive decline in Alzheimer's disease (AD) with low cost and minimal invasiveness. Two prominent imaging paradigms, FreeSurfer (FS) and Advanced Normalization Tools (ANTs), feature multiple pipelines for extracting region-specific eCT measurements from raw MRI, but the sheer complexity of these pipelines makes it difficult to choose between pipelines, compare results between pipelines, and characterize uncertainty in pipeline estimates. Worse yet, the EC is particularly difficult to image, leading to variations in thickness estimates between pipelines that overwhelm physiologicl variations predictive of AD. We examine the eCT outputs of seven different pipelines on MRIs from the Alzheimer's Disease Neuroimaging Initiative. Because of both theoretical and practical limitations, we have no gold standard by which to evaluate them. Instead, we use a Bayesian hierarchical model to combine the estimates. The resulting posterior distribution yields high-probability idealized eCT values that account for inherent uncertainty through a flexible multivariate error model that supports different constant offsets, standard deviations, tailedness, and correlation structures between pipelines. Our hierarchical model directly relates idealized eCTs to clinical outcomes in a way that propagates eCT estimation uncertainty to clinical estimates while accounting for longitudinal structure in the data. Surprisingly, even though it incorporates greater uncertainty in the predictor and regularization provided by the prior, the combined model reveals a stronger association between eCT and cognitive capacity than do nonhierarchical models based on data from single pipelines alone.

preprint2020arXiv

Scalable Bayesian inference for self-excitatory stochastic processes applied to big American gunfire data

The Hawkes process and its extensions effectively model self-excitatory phenomena including earthquakes, viral pandemics, financial transactions, neural spike trains and the spread of memes through social networks. The usefulness of these stochastic process models within a host of economic sectors and scientific disciplines is undercut by the processes' computational burden: complexity of likelihood evaluations grows quadratically in the number of observations for both the temporal and spatiotemporal Hawkes processes. We show that, with care, one may parallelize these calculations using both central and graphics processing unit implementations to achieve over 100-fold speedups over single-core processing. Using a simple adaptive Metropolis-Hastings scheme, we apply our high-performance computing framework to a Bayesian analysis of big gunshot data generated in Washington D.C. between the years of 2006 and 2019, thereby extending a past analysis of the same data from under 10,000 to over 85,000 observations. To encourage wide-spread use, we provide hpHawkes, an open-source R package, and discuss high-level implementation and program design for leveraging aspects of computational hardware that become necessary in a big data setting.

Andrew J. Holbrook

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Accelerating Bayesian inference of dependency between complex biological traits

Computational Statistics and Data Science in the Twenty-first Century

Generating MCMC proposals by randomly rotating the regular simplex

Synthesizing longitudinal cortical thickness estimates with a flexible and hierarchical multivariate measurement-error model

Scalable Bayesian inference for self-excitatory stochastic processes applied to big American gunfire data