Source author record

Frederick A. Matsen IV

Frederick A. Matsen IV appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Populations and Evolution Genomics Methodology Discrete Mathematics Applications Computational Engineering, Finance, and Science Machine Learning math.ST Statistics Theory

Catalog footprint

What is connected

15works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Bayesian inference of antibody evolutionary dynamics using multitype branching processes

When our immune system encounters foreign antigens (i.e., from pathogens), the B cells that produce our antibodies undergo a cyclic process of proliferation, mutation, and selection, improving their ability to bind to the specific antigen. Immunologists have recently developed powerful experimental techniques to investigate this process in mouse models. In one such experiment, mice are engineered with a monoclonal B-cell precursor and immunized with a model antigen. B cells are sampled from sacrificed mice after the immune response has progressed, and the mutated genetic loci encoding antibodies are sequenced. This experiment allows parallel replay of antibody evolution, but produces data at only one time point; we are unable to observe the evolutionary trajectories that lead to optimized antibody affinity in each mouse. To address this, we model antibody evolution as a multitype branching process and integrate over unobserved histories conditioned on phylogenetic signal in sequence data, leveraging parallel experimental replays for parameter inference. We infer the functional relationship between B-cell fitness and antigen binding affinity in a Bayesian framework, equipped with an efficient likelihood calculation algorithm and Markov chain Monte Carlo posterior approximation. In a simulation study, we demonstrate that a sigmoidal relationship between fitness and binding affinity can be recovered from realizations of the branching process. We then perform inference for experimental data from 52 replayed B-cell lineages sampled 15 days after immunization, yielding a total of 3,758 sampled B cells. The recovered sigmoidal curve indicates that the fitness of high-affinity B cells is over six times larger than that of low-affinity B cells, with a sharp transition from low to high fitness values as affinity increases.

preprint2022arXiv

How trustworthy is your tree? Bayesian phylogenetic effective sample size through the lens of Monte Carlo error

Bayesian inference is a popular and widely-used approach to infer phylogenies (evolutionary trees). However, despite decades of widespread application, it remains difficult to judge how well a given Bayesian Markov chain Monte Carlo (MCMC) run explores the space of phylogenetic trees. In this paper, we investigate the Monte Carlo error of phylogenies, focusing on high-dimensional summaries of the posterior distribution, including variability in estimated edge/branch (known in phylogenetics as "split") probabilities and tree probabilities, and variability in the estimated summary tree. Specifically, we ask if there is any measure of effective sample size (ESS) applicable to phylogenetic trees which is capable of capturing the Monte Carlo error of these three summary measures. We find that there are some ESS measures capable of capturing the error inherent in using MCMC samples to approximate the posterior distributions on phylogenies. We term these tree ESS measures, and identify a set of three which are useful in practice for assessing the Monte Carlo error. Lastly, we present visualization tools that can improve comparisons between multiple independent MCMC runs by accounting for the Monte Carlo error present in each chain. Our results indicate that common post-MCMC workflows are insufficient to capture the inherent Monte Carlo error of the tree, and highlight the need for both within-chain mixing and between-chain convergence assessments.

preprint2022arXiv

Inference of B cell clonal families using heavy/light chain pairing information

Next generation sequencing of B cell receptor (BCR) repertoires has become a ubiquitous tool for understanding the antibody-mediated immune response: it is now common to have large volumes of sequence data coding for both the heavy and light chain subunits of the BCR. However, until the recent development of high throughput methods of preserving heavy/light chain pairing information, these samples contained no explicit information on which heavy chain sequence pairs with which light chain sequence. One of the first steps in analyzing such BCR repertoire samples is grouping sequences into clonally related families, where each stems from a single rearrangement event. Many methods of accomplishing this have been developed, however, none so far has taken full advantage of the newly-available pairing information. This information can dramatically improve clustering performance, especially for the light chain. The light chain has traditionally been challenging for clonal family inference because of its low diversity and consequent abundance of non-clonal families with indistinguishable naive rearrangements. Here we present a method of incorporating this pairing information into the clustering process in order to arrive at a more accurate partition of the data into clonally related families. We also demonstrate two methods of fixing imperfect pairing information, which may allow for simplified sample preparation and increased sequencing depth. Finally, we describe several other improvements to the partis software package (https://github.com/psathyrella/partis).

preprint2020arXiv

Non-bifurcating phylogenetic tree inference via the adaptive LASSO

Phylogenetic tree inference using deep DNA sequencing is reshaping our understanding of rapidly evolving systems, such as the within-host battle between viruses and the immune system. Densely sampled phylogenetic trees can contain special features, including "sampled ancestors" in which we sequence a genotype along with its direct descendants, and "polytomies" in which multiple descendants arise simultaneously. These features are apparent after identifying zero-length branches in the tree. However, current maximum-likelihood based approaches are not capable of revealing such zero-length branches. In this paper, we find these zero-length branches by introducing adaptive-LASSO-type regularization estimators to phylogenetics, deriving their properties, and showing regularization to be a practically useful approach for phylogenetics.

preprint2020arXiv

Using B cell receptor lineage structures to predict affinity

We are frequently faced with a large collection of antibodies, and want to select those with highest affinity for their cognate antigen. When developing a first-line therapeutic for a novel pathogen, for instance, we might look for such antibodies in patients that have recovered. There exist effective experimental methods of accomplishing this, such as cell sorting and baiting; however they are time consuming and expensive. Next generation sequencing of B cell receptor (BCR) repertoires offers an additional source of sequences that could be tapped if we had a reliable method of selecting those coding for the best antibodies. In this paper we introduce a method that uses evolutionary information from the family of related sequences that share a naive ancestor to predict the affinity of each resulting antibody for its antigen. When combined with information on the identity of the antigen, this method should provide a source of effective new antibodies. We also introduce a method for a related task: given an antibody of interest and its inferred ancestral lineage, which branches in the tree are likely to harbor key affinity-increasing mutations? These methods are implemented as part of continuing development of the partis BCR inference package, available at https://github.com/psathyrella/partis.

preprint2019arXiv

A Bayesian Phylogenetic Hidden Markov Model for B Cell Receptor Sequence Analysis

The human body is able to generate a diverse set of high affinity antibodies, the soluble form of B cell receptors (BCRs), that bind to and neutralize invading pathogens. The natural development of BCRs must be understood in order to design vaccines for highly mutable pathogens such as influenza and HIV. BCR diversity is induced by naturally occurring combinatorial "V(D)J" rearrangement, mutation, and selection processes. Most current methods for BCR sequence analysis focus on separately modeling the above processes. Statistical phylogenetic methods are often used to model the mutational dynamics of BCR sequence data, but these techniques do not consider all the complexities associated with B cell diversification such as the V(D)J rearrangement process. In particular, standard phylogenetic approaches assume the DNA bases of the progenitor (or "naive") sequence arise independently and according to the same distribution, ignoring the complexities of V(D)J rearrangement. In this paper, we introduce a novel approach to Bayesian phylogenetic inference for BCR sequences that is based on a phylogenetic hidden Markov model (phylo-HMM). This technique not only integrates a naive rearrangement model with a phylogenetic model for BCR sequence evolution but also naturally accounts for uncertainty in all unobserved variables, including the phylogenetic tree, via posterior distribution sampling.

preprint2016arXiv

Chain Reduction Preserves the Unrooted Subtree Prune-and-Regraft Distance

The subtree prune-and-regraft (SPR) distance metric is a fundamental way of comparing evolutionary trees. It has wide-ranging applications, such as to study lateral genetic transfer, viral recombination, and Markov chain Monte Carlo phylogenetic inference. Although the rooted version of SPR distance can be com puted relatively efficiently between rooted trees using fixed-parameter-tractable algorithms, in the unrooted case previous algorithms are unable to compute distances larger than 7. One important tool for efficient computation in the rooted case is called chain reduction, which replaces an arbitrary chain of subtrees identical in both trees with a chain of three leaves. Whether chain reduction preserves SPR distance in the unrooted case has remained an open question since it was conjectured in 2001 by Allen and Steel, and was presented as a challenge question at the 2007 Isaac Newton Institute for Mathematical Sciences program on phylogenetics. In this paper we prove that chain reduction preserves the unrooted SPR distance. We do so by introducing a structure called a socket agreement forest that restricts edge modification to predetermined socket vertices, permitting detailed analysis and modification of SPR move sequences. This new chain reduction theorem reduces the unrooted distance problem to a linear size problem kernel, substantially improving on the previous best quadratic size kernel.

preprint2016arXiv

Online Bayesian phylogenetic inference: theoretical foundations via Sequential Monte Carlo

Phylogenetics, the inference of evolutionary trees from molecular sequence data such as DNA, is an enterprise that yields valuable evolutionary understanding of many biological systems. Bayesian phylogenetic algorithms, which approximate a posterior distribution on trees, have become a popular if computationally expensive means of doing phylogenetics. Modern data collection technologies are quickly adding new sequences to already substantial databases. With all current techniques for Bayesian phylogenetics, computation must start anew each time a sequence becomes available, making it costly to maintain an up-to-date estimate of a phylogenetic posterior. These considerations highlight the need for an \emph{online} Bayesian phylogenetic method which can update an existing posterior with new sequences. Here we provide theoretical results on the consistency and stability of methods for online Bayesian phylogenetic inference based on Sequential Monte Carlo (SMC) and Markov chain Monte Carlo (MCMC). We first show a consistency result, demonstrating that the method samples from the correct distribution in the limit of a large number of particles. Next we derive the first reported set of bounds on how phylogenetic likelihood surfaces change when new sequences are added. These bounds enable us to characterize the theoretical performance of sampling algorithms by bounding the effective sample size (ESS) with a given number of particles from below. We show that the ESS is guaranteed to grow linearly as the number of particles in an SMC sampler grows. Surprisingly, this result holds even though the dimensions of the phylogenetic model grow with each new added sequence.

preprint2016arXiv

The shape of the one-dimensional phylogenetic likelihood function

By fixing all parameters in a phylogenetic likelihood model except for one branch length, one obtains a one-dimensional likelihood function. In this work, we introduce a mathematical framework to characterize the shapes of such one-dimensional phylogenetic likelihood functions. This framework is based on analyses of algebraic structures on the space of all frequency patterns with respect to a polynomial representation of the likelihood functions. Using this framework, we provide conditions under which the one-dimensional phylogenetic likelihood functions are guaranteed to have at most one stationary point, and this point is the maximum likelihood branch length. These conditions are satisfied by common simple models including all binary models, the Jukes-Cantor model and the Felsenstein 1981 model. We then prove that for the simplest model that does not satisfy our conditions, namely, the Kimura 2-parameter model, the one-dimensional likelihood functions may have multiple stationary points. As a proof of concept, we construct a non-degenerate example in which the phylogenetic likelihood function has two local maxima and a local minimum. To construct such examples, we derive a general method of constructing a tree and sequence data with a specified frequency pattern at the root. We then extend the result to prove that the space of all rescaled and translated one-dimensional phylogenetic likelihood functions under the Kimura 2-parameter model is dense in the space of all non-negative continuous functions on $[0, \infty)$ with finite limits. These results indicate that one-dimensional likelihood functions under advanced evolutionary models can be more complex than it is typically assumed by phylogenetic inference algorithms; however, these complexities can be effectively captured by the Kimura 2-parameter model.

preprint2015arXiv

Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation

VDJ rearrangement and somatic hypermutation work together to produce antibody-coding B cell receptor (BCR) sequences for a remarkable diversity of antigens. It is now possible to sequence these BCRs in high throughput; analysis of these sequences is bringing new insight into how antibodies develop, in particular for broadly-neutralizing antibodies against HIV and influenza. A fundamental step in such sequence analysis is to annotate each base as coming from a specific one of the V, D, or J genes, or from an N-addition (a.k.a. non-templated insertion). Previous work has used simple parametric distributions to model transitions from state to state in a hidden Markov model (HMM) of VDJ recombination, and assumed that mutations occur via the same process across sites. However, codon frame and other effects have been observed to violate these parametric assumptions for such coding sequences, suggesting that a non-parametric approach to modeling the recombination process could be useful. In our paper, we find that indeed large modern data sets suggest a model using parameter-rich per-allele categorical distributions for HMM transition probabilities and per-allele-per-position mutation probabilities, and that using such a model for inference leads to significantly improved results. We present an accurate and efficient BCR sequence annotation software package using a novel HMM "factorization" strategy. This package, called partis (https://github.com/psathyrella/partis/), is built on a new general-purpose HMM compiler that can perform efficient inference given a simple text description of an HMM.

preprint2015arXiv

Quantifying evolutionary constraints on B cell affinity maturation

The antibody repertoire of each individual is continuously updated by the evolutionary process of B cell receptor mutation and selection. It has recently become possible to gain detailed information concerning this process through high-throughput sequencing. Here, we develop modern statistical molecular evolution methods for the analysis of B cell sequence data, and then apply them to a very deep short-read data set of B cell receptors. We find that the substitution process is conserved across individuals but varies significantly across gene segments. We investigate selection on B cell receptors using a novel method that side-steps the difficulties encountered by previous work in differentiating between selection and motif-driven mutation; this is done through stochastic mapping and empirical Bayes estimators that compare the evolution of in-frame and out-of-frame rearrangements. We use this new method to derive a per-residue map of selection, which provides a more nuanced view of the constraints on framework and variable regions.

preprint2015arXiv

Ricci-Ollivier Curvature of the Rooted Phylogenetic Subtree-Prune-Regraft Graph

Statistical phylogenetic inference methods use tree rearrangement operations to perform either hill-climbing local search or Markov chain Monte Carlo across tree topologies. The canonical class of such moves are the subtree-prune-regraft (SPR) moves that remove a subtree and reattach it somewhere else via the cut edge of the subtree. Phylogenetic trees and such moves naturally form the vertices and edges of a graph, such that tree search algorithms perform a (potentially stochastic) traversal of this SPR graph. Despite the centrality of such graphs in phylogenetic inference, rather little is known about their large-scale properties. In this paper we learn about the rooted-tree version of the graph, known as the rSPR graph, by calculating the Ricci-Ollivier curvature for pairs of vertices in the rSPR graph with respect to two simple random walks on the rSPR graph. By proving theorems and direct calculation with novel algorithms, we find a remarkable diversity of different curvatures on the rSPR graph for pairs of vertices separated by the same distance. We confirm using simulation that degree and curvature have the expected impact on mean access time distributions, demonstrating relevance of these curvature results to stochastic tree search. This indicates significant structure of the rSPR graph beyond that which was previously understood in terms of pairwise distances and vertex degrees; a greater understanding of curvature could ultimately lead to improved strategies for tree search.

preprint2014arXiv

Quantifying MCMC Exploration of Phylogenetic Tree Space

In order to gain an understanding of the effectiveness of phylogenetic Markov chain Monte Carlo (MCMC), it is important to understand how quickly the empirical distribution of the MCMC converges to the posterior distribution. In this paper we investigate this problem on phylogenetic tree topologies with a metric that is especially well suited to the task: the subtree prune-and-regraft (SPR) metric. This metric directly corresponds to the minimum number of MCMC rearrangements required to move between trees in common phylogenetic MCMC implementations. We develop a novel graph-based approach to analyze tree posteriors and find that the SPR metric is much more informative than simpler metrics that are unrelated to MCMC moves. In doing so we show conclusively that topological peaks do occur in Bayesian phylogenetic posteriors from real data sets as sampled with standard MCMC approaches, investigate the efficiency of Metropolis-coupled MCMC (MCMCMC) in traversing the valleys between peaks, and show that conditional clade distribution (CCD) can have systematic problems when there are multiple peaks.

preprint2013arXiv

Abundance-weighted phylogenetic diversity measures distinguish microbial community states and are robust to sampling depth

In microbial ecology studies, the most commonly used ways of investigating alpha (within-sample) diversity are either to apply count-only measures such as Simpson's index to Operational Taxonomic Unit (OTU) groupings, or to use classical phylogenetic diversity (PD), which is not abundance-weighted. Although alpha diversity measures that use abundance information in a phylogenetic framework do exist, but are not widely used within the microbial ecology community. The performance of abundance-weighted phylogenetic diversity measures compared to classical discrete measures has not been explored, and the behavior of these measures under rarefaction (sub-sampling) is not yet clear. In this paper we compare the ability of various alpha diversity measures to distinguish between different community states in the human microbiome for three different data sets. We also present and compare a novel one-parameter family of alpha diversity measures, BWPD_θ, that interpolates between classical phylogenetic diversity (PD) and an abundance-weighted extension of PD. Additionally, we examine the sensitivity of these phylogenetic diversity measures to sampling, via computational experiments and by deriving a closed form solution for the expectation of phylogenetic quadratic entropy under re-sampling. In all three of the datasets considered, an abundance-weighted measure is the best differentiator between community states. OTU-based measures, on the other hand, are less effective in distinguishing community types. In addition, abundance-weighted phylogenetic diversity measures are less sensitive to differing sampling intensity than their unweighted counterparts. Based on these results we encourage the use of abundance-weighted phylogenetic diversity measures, especially for cases such as microbial ecology where species delimitation is difficult.

preprint2013arXiv

The mean and variance of phylogenetic diversity under rarefaction

Phylogenetic diversity (PD) depends on sampling intensity, which complicates the comparison of PD between samples of different depth. One approach to dealing with differing sample depth for a given diversity statistic is to rarefy, which means to take a random subset of a given size of the original sample. Exact analytical formulae for the mean and variance of species richness under rarefaction have existed for some time but no such solution exists for PD. We have derived exact formulae for the mean and variance of PD under rarefaction. We show that these formulae are correct by comparing exact solution mean and variance to that calculated by repeated random (Monte Carlo) subsampling of a dataset of stem counts of woody shrubs of Toohey Forest, Queensland, Australia. We also demonstrate the application of the method using two examples: identifying hotspots of mammalian diversity in Australasian ecoregions, and characterising the human vaginal microbiome. There is a very high degree of correspondence between the analytical and random subsampling methods for calculating mean and variance of PD under rarefaction, although the Monte Carlo method requires a large number of random draws to converge on the exact solution for the variance. Rarefaction of mammalian PD of ecoregions in Australasia to a common standard of 25 species reveals very different rank orderings of ecoregions, indicating quite different hotspots of diversity than those obtained for unrarefied PD. The application of these methods to the vaginal microbiome shows that a classical score used to quantify bacterial vaginosis is correlated with the shape of the rarefaction curve. The analytical formulae for the mean and variance of PD under rarefaction are both exact and more efficient than repeated subsampling. Rarefaction of PD allows for many applications where comparisons of samples of different depth is required.

Frederick A. Matsen IV

What is connected

Connect this record

See the researcher in context

Building this map preview

15 published item(s)

Bayesian inference of antibody evolutionary dynamics using multitype branching processes

How trustworthy is your tree? Bayesian phylogenetic effective sample size through the lens of Monte Carlo error

Inference of B cell clonal families using heavy/light chain pairing information

Non-bifurcating phylogenetic tree inference via the adaptive LASSO

Using B cell receptor lineage structures to predict affinity

A Bayesian Phylogenetic Hidden Markov Model for B Cell Receptor Sequence Analysis

Chain Reduction Preserves the Unrooted Subtree Prune-and-Regraft Distance

Online Bayesian phylogenetic inference: theoretical foundations via Sequential Monte Carlo

The shape of the one-dimensional phylogenetic likelihood function

Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation

Quantifying evolutionary constraints on B cell affinity maturation

Ricci-Ollivier Curvature of the Rooted Phylogenetic Subtree-Prune-Regraft Graph

Quantifying MCMC Exploration of Phylogenetic Tree Space

Abundance-weighted phylogenetic diversity measures distinguish microbial community states and are robust to sampling depth

The mean and variance of phylogenetic diversity under rarefaction