Researcher profile

Marc A. Suchard

Marc A. Suchard contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2025arXiv

PDA in Action: Ten Principles for High-Quality Multi-Site Clinical Evidence Generation

Background: Distributed Research Networks (DRNs) offer significant opportunities for collaborative multi-site research and have significantly advanced healthcare research based on clinical observational data. However, generating high-quality real-world evidence using fit-for-use data from multi-site studies faces important challenges, including biases associated with various types of heterogeneity within and across sites and data sharing difficulties. Over the last ten years, Privacy-Preserving Distributed Algorithms (PDA) have been developed and utilized in numerous national and international real-world studies spanning diverse domains, from comparative effectiveness research, target trial emulation, to healthcare delivery, policy evaluation, and system performance assessment. Despite these advances, there remains a lack of comprehensive and clear guiding principles for generating high-quality real-world evidence through collaborative studies leveraging the methods under PDA. Objective: The paper aims to establish ten principles of best practice for conducting high-quality multi-site studies using PDA. These principles cover all phases of research, including study preparation, protocol development, analysis, and final reporting. Discussion: The ten principles for conducting a PDA study outline a principled, efficient, and transparent framework for employing distributed learning algorithms within DRNs to generate reliable and reproducible real-world evidence.

preprint2022arXiv

Accelerating Bayesian inference of dependency between complex biological traits

Inferring dependencies between complex biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck -- integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to study Aquilegia flower and pollinator co-evolution.

preprint2022arXiv

Adjusting for both sequential testing and systematic error in safety surveillance using observational data: Empirical calibration and MaxSPRT

Post-approval safety surveillance of medical products using observational healthcare data can help identify safety issues beyond those found in pre-approval trials. When testing sequentially as data accrue, maximum sequential probability ratio testing (MaxSPRT) is a common approach to maintaining nominal type 1 error. However, the true type 1 error may still deviate from the specified one because of systematic error due to the observational nature of the analysis. This systematic error may persist even after controlling for known confounders. Here we propose to address this issue by combing MaxSPRT with empirical calibration. In empirical calibration, we assume uncertainty about the systematic error in our analysis, the source of uncertainty commonly overlooked in practice. We infer a probability distribution of systematic error by relying on a large set of negative controls: exposure-outcome where no causal effect is believed to exist. Integrating this distribution into our test statistics has previously been shown to restore type 1 error to nominal. Here we show how we can calibrate the critical value central to MaxSPRT. We evaluate this novel approach using simulations and real electronic health records, using H1N1 vaccinations during the 2009-2010 season as an example. Results show that combining empirical calibration with MaxSPRT restores nominal type 1 error. In our real-world example, adjusting for systematic error using empirical calibration has a larger impact than, and hence is just as essential as, adjusting for sequential testing using MaxSPRT. We recommend performing both, using the method described here.

preprint2022arXiv

Computational Statistics and Data Science in the Twenty-first Century

Data science has arrived, and computational statistics is its engine. As the scale and complexity of scientific and industrial data grow, the discipline of computational statistics assumes an increasingly central role among the statistical sciences. An explosion in the range of real-world applications means the development of more and more specialized computational methods, but five Core Challenges remain. We provide a high-level introduction to computational statistics by focusing on its central challenges, present recent model-specific advances and preach the ever-increasing role of non-sequential computational paradigms such as multi-core, many-core and quantum computing. Data science is bringing major changes to computational statistics, and these changes will shape the trajectory of the discipline in the 21st century.

preprint2022arXiv

Prior-preconditioned conjugate gradient method for accelerated Gibbs sampling in "large $n$ & large $p$" Bayesian sparse regression

In a modern observational study based on healthcare databases, the number of observations and of predictors typically range in the order of $10^5$ ~ $10^6$ and of $10^4$ ~ $10^5$. Despite the large sample size, data rarely provide sufficient information to reliably estimate such a large number of parameters. Sparse regression techniques provide potential solutions, one notable approach being the Bayesian methods based on shrinkage priors. In the "large n & large p" setting, however, posterior computation encounters a major bottleneck at repeated sampling from a high-dimensional Gaussian distribution, whose precision matrix $Φ$ is expensive to compute and factorize. In this article, we present a novel algorithm to speed up this bottleneck based on the following observation: we can cheaply generate a random vector $b$ such that the solution to the linear system $Φβ= b$ has the desired Gaussian distribution. We can then solve the linear system by the conjugate gradient (CG) algorithm through matrix-vector multiplications by $Φ$; this involves no explicit factorization or calculation of $Φ$ itself. Rapid convergence of CG in this context is guaranteed by the theory of prior-preconditioning we develop. We apply our algorithm to a clinically relevant large-scale observational study with n = 72,489 patients and p = 22,175 clinical covariates, designed to assess the relative risk of adverse events from two alternative blood anti-coagulants. Our algorithm demonstrates an order of magnitude speed-up in posterior inference, in our case cutting the computation time from two weeks to less than a day.

preprint2021arXiv

Combining Cox Regressions Across a Heterogeneous Distributed Research Network Facing Small and Zero Counts

Studies of the effects of medical interventions increasingly take place in distributed research settings using data from multiple clinical data sources including electronic health records and administrative claims. In such settings, privacy concerns typically prohibit sharing of individual patient data, and instead, analyses can only utilize summary statistics from the individual databases. In the specific but very common context of the Cox proportional hazards model, we show that standard meta analysis methods then lead to substantial bias when outcome counts are small. This bias derives primarily from the normal approximations that the methods utilize. Here we propose and evaluate methods that eschew normal approximations in favor of three more flexible approximations: a skew-normal, a one-dimensional grid, and a custom parametric function that mimics the behavior of the Cox likelihood function. In extensive simulation studies we demonstrate how these approximations impact bias in the context of both fixed-effects and (Bayesian) random-effects models. We then apply these approaches to three real-world studies of the comparative safety of antidepressants, each using data from four observational healthcare databases.

preprint2020arXiv

Online Bayesian phylodynamic inference in BEAST with application to epidemic reconstruction

Reconstructing pathogen dynamics from genetic data as they become available during an outbreak or epidemic represents an important statistical scenario in which observations arrive sequentially in time and one is interested in performing inference in an 'online' fashion. Widely-used Bayesian phylogenetic inference packages are not set up for this purpose, generally requiring one to recompute trees and evolutionary model parameters de novo when new data arrive. To accommodate increasing data flow in a Bayesian phylogenetic framework, we introduce a methodology to efficiently update the posterior distribution with newly available genetic data. Our procedure is implemented in the BEAST 1.10 software package, and relies on a distance-based measure to insert new taxa into the current estimate of the phylogeny and imputes plausible values for new model parameters to accommodate growing dimensionality. This augmentation creates informed starting values and re-uses optimally tuned transition kernels for posterior exploration of growing data sets, reducing the time necessary to converge to target posterior distributions. We apply our framework to data from the recent West African Ebola virus epidemic and demonstrate a considerable reduction in time required to obtain posterior estimates at different time points of the outbreak. Beyond epidemic monitoring, this framework easily finds other applications within the phylogenetics community, where changes in the data -- in terms of alignment changes, sequence addition or removal -- present common scenarios that can benefit from online inference.

preprint2020arXiv

Scalable Bayesian inference for self-excitatory stochastic processes applied to big American gunfire data

The Hawkes process and its extensions effectively model self-excitatory phenomena including earthquakes, viral pandemics, financial transactions, neural spike trains and the spread of memes through social networks. The usefulness of these stochastic process models within a host of economic sectors and scientific disciplines is undercut by the processes' computational burden: complexity of likelihood evaluations grows quadratically in the number of observations for both the temporal and spatiotemporal Hawkes processes. We show that, with care, one may parallelize these calculations using both central and graphics processing unit implementations to achieve over 100-fold speedups over single-core processing. Using a simple adaptive Metropolis-Hastings scheme, we apply our high-performance computing framework to a Bayesian analysis of big gunshot data generated in Washington D.C. between the years of 2006 and 2019, thereby extending a past analysis of the same data from under 10,000 to over 85,000 observations. To encourage wide-spread use, we provide hpHawkes, an open-source R package, and discuss high-level implementation and program design for leveraging aspects of computational hardware that become necessary in a big data setting.

preprint2018arXiv

Scalable Sparse Cox's Regression for Large-Scale Survival Data via Broken Adaptive Ridge

This paper develops a new scalable sparse Cox regression tool for sparse high-dimensional massive sample size (sHDMSS) survival data. The method is a local $L_0$-penalized Cox regression via repeatedly performing reweighted $L_2$-penalized Cox regression. We show that the resulting estimator enjoys the best of $L_0$- and $L_2$-penalized Cox regressions while overcoming their limitations. Specifically, the estimator is selection consistent, oracle for parameter estimation, and possesses a grouping property for highly correlated covariates. Simulation results suggest that when the sample size is large, the proposed method with pre-specified tuning parameters has a comparable or better performance than some popular penalized regression methods. More importantly, because the method naturally enables adaptation of efficient algorithms for massive $L_2$-penalized optimization and does not require costly data driven tuning parameter selection, it has a significant computational advantage for sHDMSS data, offering an average of 5-fold speedup over its closest competitor in empirical studies.