Source author record

Sebastien Roch

Sebastien Roch appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.PR math.ST Populations and Evolution Statistics Theory Computational Engineering, Finance, and Science Data Structures and Algorithms Machine Learning Social and Information Networks Quantitative Methods Computational Complexity Computer Science and Game Theory math.CA math.CO Networking and Internet Architecture

Catalog footprint

What is connected

26works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Expanding the class of global objective functions for dissimilarity-based hierarchical clustering

Recent work on dissimilarity-based hierarchical clustering has led to the introduction of global objective functions for this classical problem. Several standard approaches, such as average linkage, as well as some new heuristics have been shown to provide approximation guarantees. Here we introduce a broad new class of objective functions which satisfy desirable properties studied in prior work. Many common agglomerative and divisive clustering methods are shown to be greedy algorithms for these objectives, which are inspired by related concepts in phylogenetics.

preprint2022arXiv

Impossibility of phylogeny reconstruction from $k$-mer counts

We consider phylogeny estimation under a two-state model of sequence evolution by site substitution on a tree. In the asymptotic regime where the sequence lengths tend to infinity, we show that for any fixed $k$ no statistically consistent phylogeny estimation is possible from $k$-mer counts over the full leaf sequences alone. Formally, we establish that the joint distribution of $k$-mer counts over the entire leaf sequences on two distinct trees have total variation distance bounded away from $1$ as the sequence length tends to infinity. Our impossibility result implies that statistical consistency requires more sophisticated use of $k$-mer count information, such as block techniques developed in previous theoretical work.

preprint2022arXiv

Pairwise sequence alignment at arbitrarily large evolutionary distance

Ancestral sequence reconstruction is a key task in computational biology. It consists in inferring a molecular sequence at an ancestral species of a known phylogeny, given descendant sequences at the tip of the tree. In addition to its many biological applications, it has played a key role in elucidating the statistical performance of phylogeny estimation methods. Here we establish a formal connection to another important bioinformatics problem, multiple sequence alignment, where one attempts to best align a collection of molecular sequences under some mismatch penalty score by inserting gaps. Our result is counter-intuitive: we show that perfect pairwise sequence alignment with high probability is possible in principle at arbitrary large evolutionary distances - provided the phylogeny is known and dense enough. We use techniques from ancestral sequence reconstruction in the taxon-rich setting together with the probabilistic analysis of sequence evolution models involving insertions and deletions.

preprint2020arXiv

Species tree estimation under joint modeling of coalescence and duplication: sample complexity of quartet methods

We consider species tree estimation under a standard stochastic model of gene tree evolution that incorporates incomplete lineage sorting (as modeled by a coalescent process) and gene duplication and loss (as modeled by a branching process). Through a probabilistic analysis of the model, we derive sample complexity bounds for widely used quartet-based inference methods that highlight the effect of the duplication and loss rates in both subcritical and supercritical regimes.

preprint2020arXiv

Sufficient condition for root reconstruction by parsimony on binary trees with general weights

We consider the problem of inferring an ancestral state from observations at the leaves of a tree, assuming the state evolves along the tree according to a two-state symmetric Markov process. We establish a general branching rate condition under which maximum parsimony, a common reconstruction method requiring only the knowledge of the tree, succeeds better than random guessing uniformly in the depth of the tree. We thereby generalize previous results of (Zhang et al., 2010) and (Gascuel and Steel, 2010). Our results apply to both deterministic and i.i.d. edge weights.

preprint2017arXiv

Generalized least squares can overcome the critical threshold in respondent-driven sampling

In order to sample marginalized and/or hard-to-reach populations, respondent-driven sampling (RDS) and similar techniques reach their participants via peer referral. Under a Markov model for RDS, previous research has shown that if the typical participant refers too many contacts, then the variance of common estimators does not decay like $O(n^{-1})$, where $n$ is the sample size. This implies that confidence intervals will be far wider than under a typical sampling design. Here we show that generalized least squares (GLS) can effectively reduce the variance of RDS estimates. In particular, a theoretical analysis indicates that the variance of the GLS estimator is $O(n^{-1})$. We then derive two classes of feasible GLS estimators. The first class is based upon a Degree Corrected Stochastic Blockmodel for the underlying social network. The second class is based upon a rank-two model. It might be of independent interest that in both model classes, the theoretical results show that it is possible to estimate the spectral properties of the population network from the sampled observations. Simulations on empirical social networks show that the feasible GLS (fGLS) estimators can have drastically smaller error and rarely increase the error. A diagnostic plot helps to identify where fGLS will aid estimation. The fGLS estimators continue to outperform standard estimators even when they are built from a misspecified model and when there is preferential recruitment.

preprint2016arXiv

Phase transition on the convergence rate of parameter estimation under an Ornstein-Uhlenbeck diffusion on a tree

Diffusion processes on trees are commonly used in evolutionary biology to model the joint distribution of continuous traits, such as body mass, across species. Estimating the parameters of such processes from tip values presents challenges because of the intrinsic correlation between the observations produced by the shared evolutionary history, thus violating the standard independence assumption of large-sample theory. For instance Ho and Ané \cite{HoAne13} recently proved that the mean (also known in this context as selection optimum) of an Ornstein-Uhlenbeck process on a tree cannot be estimated consistently from an increasing number of tip observations if the tree height is bounded. Here, using a fruitful connection to the so-called reconstruction problem in probability theory, we study the convergence rate of parameter estimation in the unbounded height case. For the mean of the process, we provide a necessary and sufficient condition for the consistency of the maximum likelihood estimator (MLE) and establish a phase transition on its convergence rate in terms of the growth of the tree. In particular we show that a loss of $\sqrt{n}$-consistency (i.e., the variance of the MLE becomes $Ω(n^{-1})$, where $n$ is the number of tips) occurs when the tree growth is larger than a threshold related to the phase transition of the reconstruction problem. For the covariance parameters, we give a novel, efficient estimation method which achieves $\sqrt{n}$-consistency under natural assumptions on the tree.

preprint2014arXiv

Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method

We consider the problem of estimating the evolutionary history of a set of species (phylogeny or species tree) from several genes. It is known that the evolutionary history of individual genes (gene trees) might be topologically distinct from each other and from the underlying species tree, possibly confounding phylogenetic analysis. A further complication in practice is that one has to estimate gene trees from molecular sequences of finite length. We provide the first full data-requirement analysis of a species tree reconstruction method that takes into account estimation errors at the gene level. Under that criterion, we also devise a novel reconstruction algorithm that provably improves over all previous methods in a regime of interest.

preprint2013arXiv

Alignment-free phylogenetic reconstruction: Sample complexity via a branching process analysis

We present an efficient phylogenetic reconstruction algorithm allowing insertions and deletions which provably achieves a sequence-length requirement (or sample complexity) growing polynomially in the number of taxa. Our algorithm is distance-based, that is, it relies on pairwise sequence comparisons. More importantly, our approach largely bypasses the difficult problem of multiple sequence alignment.

preprint2012arXiv

An analytical comparison of coalescent-based multilocus methods: The three-taxon case

Incomplete lineage sorting (ILS) is a common source of gene tree incongruence in multilocus analyses. A large number of methods have been developed to infer species trees in the presence of ILS. Here we provide a mathematical analysis of several coalescent-based methods. Our analysis is performed on a three-taxon species tree and assumes that the gene trees are correctly reconstructed along with their branch lengths.

preprint2012arXiv

Phylogenetic mixtures: Concentration of measure in the large-tree limit

The reconstruction of phylogenies from DNA or protein sequences is a major task of computational evolutionary biology. Common phenomena, notably variations in mutation rates across genomes and incongruences between gene lineage histories, often make it necessary to model molecular data as originating from a mixture of phylogenies. Such mixed models play an increasingly important role in practice. Using concentration of measure techniques, we show that mixtures of large trees are typically identifiable. We also derive sequence-length requirements for high-probability reconstruction.

preprint2012arXiv

Recovering the tree-like trend of evolution despite extensive lateral genetic transfer: A probabilistic analysis

Lateral gene transfer (LGT) is a common mechanism of non-vertical evolution where genetic material is transferred between two more or less distantly related organisms. It is particularly common in bacteria where it contributes to adaptive evolution with important medical implications. In evolutionary studies, LGT has been shown to create widespread discordance between gene trees as genomes become mosaics of gene histories. In particular, the Tree of Life has been questioned as an appropriate representation of bacterial evolutionary history. Nevertheless a common hypothesis is that prokaryotic evolution is primarily tree-like, but that the underlying trend is obscured by LGT. Extensive empirical work has sought to extract a common tree-like signal from conflicting gene trees. Here we give a probabilistic perspective on the problem of recovering the tree-like trend despite LGT. Under a model of randomly distributed LGT, we show that the species phylogeny can be reconstructed even in the presence of surprisingly many (almost linear number of) LGT events per gene tree. Our results, which are optimal up to logarithmic factors, are based on the analysis of a robust, computationally efficient reconstruction method and provides insight into the design of such methods. Finally we show that our results have implications for the discovery of highways of gene sharing.

preprint2011arXiv

Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies

Mutation rate variation across loci is well known to cause difficulties, notably identifiability issues, in the reconstruction of evolutionary trees from molecular sequences. Here we introduce a new approach for estimating general rates-across-sites models. Our results imply, in particular, that large phylogenies are typically identifiable under rate variation. We also derive sequence-length requirements for high-probability reconstruction. Our main contribution is a novel algorithm that clusters sites according to their mutation rate. Following this site clustering step, standard reconstruction techniques can be used to recover the phylogeny. Our results rely on a basic insight: that, for large trees, certain site statistics experience concentration-of-measure phenomena.

preprint2011arXiv

Phase Transition in Distance-Based Phylogeny Reconstruction

We introduce a new distance-based phylogeny reconstruction technique which provably achieves, at sufficiently short branch lengths, a logarithmic sequence-length requirement---improving significantly over previous polynomial bounds for distance-based methods and matching existing results for general methods. The technique is based on an averaging procedure that implicitly reconstructs ancestral sequences. In the same token, we extend previous results on phase transitions in phylogeny reconstruction to general time-reversible models. More precisely, we show that in the so-called Kesten-Stigum zone (roughly, a region of the parameter space where ancestral sequences are well approximated by "linear combinations" of the observed sequences) sequences of length $O(\log n)$ suffice for reconstruction when branch lengths are discretized. Here $n$ is the number of extant species. Our results challenge, to some extent, the conventional wisdom that estimates of evolutionary distances alone carry significantly less information about phylogenies than full sequence datasets.

preprint2011arXiv

Robust estimation of latent tree graphical models: Inferring hidden states with inexact parameters

Latent tree graphical models are widely used in computational biology, signal and image processing, and network tomography. Here we design a new efficient, estimation procedure for latent tree models, including Gaussian and discrete, reversible models, that significantly improves on previous sample requirement bounds. Our techniques are based on a new hidden state estimator which is robust to inaccuracies in estimated parameters. More precisely, we prove that latent tree models can be estimated with high probability in the so-called Kesten-Stigum regime with $O(log^2 n)$ samples where $n$ is the number of nodes.

preprint2010arXiv

On the inference of large phylogenies with long branches: How long is too long?

Recent work has highlighted deep connections between sequence-length requirements for high-probability phylogeny reconstruction and the related problem of the estimation of ancestral sequences. In [Daskalakis et al.'09], building on the work of [Mossel'04], a tight sequence-length requirement was obtained for the CFN model. In particular the required sequence length for high-probability reconstruction was shown to undergo a sharp transition (from $O(\log n)$ to $\hbox{poly}(n)$, where $n$ is the number of leaves) at the "critical" branch length $\critmlq$ (if it exists) of the ancestral reconstruction problem. Here we consider the GTR model. For this model, recent results of [Roch'09] show that the tree can be accurately reconstructed with sequences of length $O(\log(n))$ when the branch lengths are below $\critksq$, known as the Kesten-Stigum (KS) bound. Although for the CFN model $\critmlq = \critksq$, it is known that for the more general GTR models one has $\critmlq \geq \critksq$ with a strict inequality in many cases. Here, we show that this phenomenon also holds for phylogenetic reconstruction by exhibiting a family of symmetric models $Q$ and a phylogenetic reconstruction algorithm which recovers the tree from $O(\log n)$-length sequences for some branch lengths in the range $(\critksq,\critmlq)$. Second we prove that phylogenetic reconstruction under GTR models requires a polynomial sequence-length for branch lengths above $\critmlq$.

preprint2009arXiv

Evolutionary Trees and the Ising Model on the Bethe Lattice: a Proof of Steel's Conjecture

A major task of evolutionary biology is the reconstruction of phylogenetic trees from molecular data. The evolutionary model is given by a Markov chain on a tree. Given samples from the leaves of the Markov chain, the goal is to reconstruct the leaf-labelled tree. It is well known that in order to reconstruct a tree on $n$ leaves, sample sequences of length $Ω(\log n)$ are needed. It was conjectured by M. Steel that for the CFN/Ising evolutionary model, if the mutation probability on all edges of the tree is less than $p^{\ast} = (\sqrt{2}-1)/2^{3/2}$, then the tree can be recovered from sequences of length $O(\log n)$. The value $p^{\ast}$ is given by the transition point for the extremality of the free Gibbs measure for the Ising model on the binary tree. Steel's conjecture was proven by the second author in the special case where the tree is "balanced." The second author also proved that if all edges have mutation probability larger than $p^{\ast}$ then the length needed is $n^{Ω(1)}$. Here we show that Steel's conjecture holds true for general trees by giving a reconstruction algorithm that recovers the tree from $O(\log n)$-length sequences when the mutation probabilities are discretized and less than $p^\ast$. Our proof and results demonstrate that extremality of the free Gibbs measure on the infinite binary tree, which has been studied before in probability, statistical physics and computer science, determines how distinguishable are Gibbs measures on finite binary trees.

preprint2009arXiv

Global Alignment of Molecular Sequences via Ancestral State Reconstruction

Molecular phylogenetic techniques do not generally account for such common evolutionary events as site insertions and deletions (known as indels). Instead tree building algorithms and ancestral state inference procedures typically rely on substitution-only models of sequence evolution. In practice these methods are extended beyond this simplified setting with the use of heuristics that produce global alignments of the input sequences--an important problem which has no rigorous model-based solution. In this paper we consider a new version of the multiple sequence alignment in the context of stochastic indel models. More precisely, we introduce the following {\em trace reconstruction problem on a tree} (TRPT): a binary sequence is broadcast through a tree channel where we allow substitutions, deletions, and insertions; we seek to reconstruct the original sequence from the sequences received at the leaves of the tree. We give a recursive procedure for this problem with strong reconstruction guarantees at low mutation rates, providing also an alignment of the sequences at the leaves of the tree. The TRPT problem without indels has been studied in previous work (Mossel 2004, Daskalakis et al. 2006) as a bootstrapping step towards obtaining optimal phylogenetic reconstruction methods. The present work sets up a framework for extending these works to evolutionary models with indels.

preprint2009arXiv

Network Delay Inference from Additive Metrics

We demonstrate the use of computational phylogenetic techniques to solve a central problem in inferential network monitoring. More precisely, we design a novel algorithm for multicast-based delay inference, i.e. the problem of reconstructing the topology and delay characteristics of a network from end-to-end delay measurements on network paths. Our inference algorithm is based on additive metric techniques widely used in phylogenetics. It runs in polynomial time and requires a sample of size only $\poly(\log n)$.

preprint2009arXiv

On the Submodularity of Influence in Social Networks

We prove and extend a conjecture of Kempe, Kleinberg, and Tardos (KKT) on the spread of influence in social networks. A social network can be represented by a directed graph where the nodes are individuals and the edges indicate a form of social relationship. A simple way to model the diffusion of ideas, innovative behavior, or ``word-of-mouth'' effects on such a graph is to consider an increasing process of ``infected'' (or active) nodes: each node becomes infected once an activation function of the set of its infected neighbors crosses a certain threshold value. Such a model was introduced by KKT in \cite{KeKlTa:03,KeKlTa:05} where the authors also impose several natural assumptions: the threshold values are (uniformly) random; and the activation functions are monotone and submodular. For an initial set of active nodes $S$, let $σ(S)$ denote the expected number of active nodes at termination. Here we prove a conjecture of KKT: we show that the function $σ(S)$ is submodular under the assumptions above. We prove the same result for the expected value of any monotone, submodular function of the set of active nodes at termination.

preprint2009arXiv

Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep

We introduce a new phylogenetic reconstruction algorithm which, unlike most previous rigorous inference techniques, does not rely on assumptions regarding the branch lengths or the depth of the tree. The algorithm returns a forest which is guaranteed to contain all edges that are: 1) sufficiently long and 2) sufficiently close to the leaves. How much of the true tree is recovered depends on the sequence length provided. The algorithm is distance-based and runs in polynomial time.

preprint2009arXiv

Reconstruction on Trees: Exponential Moment Bounds for Linear Estimators

Consider a Markov chain $(ξ_v)_{v \in V} \in [k]^V$ on the infinite $b$-ary tree $T = (V,E)$ with irreducible edge transition matrix $M$, where $b \geq 2$, $k \geq 2$ and $[k] = \{1,...,k\}$. We denote by $L_n$ the level-$n$ vertices of $T$. Assume $M$ has a real second-largest (in absolute value) eigenvalue $λ$ with corresponding real eigenvector $ν\neq 0$. Letting $σ_v = ν_{ξ_v}$, we consider the following root-state estimator, which was introduced by Mossel and Peres (2003) in the context of the "recontruction problem" on trees: \begin{equation*} S_n = (bλ)^{-n} \sum_{x\in L_n} σ_x. \end{equation*} As noted by Mossel and Peres, when $bλ^2 > 1$ (the so-called Kesten-Stigum reconstruction phase) the quantity $S_n$ has uniformly bounded variance. Here, we give bounds on the moment-generating functions of $S_n$ and $S_n^2$ when $bλ^2 > 1$. Our results have implications for the inference of evolutionary trees.

preprint2009arXiv

Sequence-Length Requirement of Distance-Based Phylogeny Reconstruction: Breaking the Polynomial Barrier

We introduce a new distance-based phylogeny reconstruction technique which provably achieves, at sufficiently short branch lengths, a polylogarithmic sequence-length requirement -- improving significantly over previous polynomial bounds for distance-based methods. The technique is based on an averaging procedure that implicitly reconstructs ancestral sequences. In the same token, we extend previous results on phase transitions in phylogeny reconstruction to general time-reversible models. More precisely, we show that in the so-called Kesten-Stigum zone (roughly, a region of the parameter space where ancestral sequences are well approximated by ``linear combinations'' of the observed sequences) sequences of length $\poly(\log n)$ suffice for reconstruction when branch lengths are discretized. Here $n$ is the number of extant species. Our results challenge, to some extent, the conventional wisdom that estimates of evolutionary distances alone carry significantly less information about phylogenies than full sequence datasets.

preprint2007arXiv

First to Market is not Everything: an Analysis of Preferential Attachment with Fitness

In this paper, we provide a rigorous analysis of preferential attachment with fitness, a random graph model introduced by Bianconi and Barabasi. Depending on the shape of the fitness distribution, we observe three distinct phases: a first-mover-advantage phase, a fit-get-richer phase and an innovation-pays-off phase.

preprint2007arXiv

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

We introduce a simple algorithm for reconstructing phylogenies from multiple gene trees in the presence of incomplete lineage sorting, that is, when the topology of the gene trees may differ from that of the species tree. We show that our technique is statistically consistent under standard stochastic assumptions, that is, it returns the correct tree given sufficiently many unlinked loci. We also show that it can tolerate moderate estimation errors.

preprint2006arXiv

The Kesten-Stigum Reconstruction Bound Is Tight for Roughly Symmetric Binary Channels

We establish the exact threshold for the reconstruction problem for a binary asymmetric channel on the b-ary tree, provided that the asymmetry is sufficiently small. This is the first exact reconstruction threshold obtained in roughly a decade. We discuss the implications of our result for Glauber dynamics, phylogenetic reconstruction, and so-called ``replica symmetry breaking'' in spin glasses and random satisfiability problems.

Sebastien Roch

What is connected

Connect this record

See the researcher in context

Building this map preview

26 published item(s)

Expanding the class of global objective functions for dissimilarity-based hierarchical clustering

Impossibility of phylogeny reconstruction from $k$-mer counts

Pairwise sequence alignment at arbitrarily large evolutionary distance

Species tree estimation under joint modeling of coalescence and duplication: sample complexity of quartet methods

Sufficient condition for root reconstruction by parsimony on binary trees with general weights

Generalized least squares can overcome the critical threshold in respondent-driven sampling

Phase transition on the convergence rate of parameter estimation under an Ornstein-Uhlenbeck diffusion on a tree

Data Requirement for Phylogenetic Inference from Multiple Loci: A New Distance Method

Alignment-free phylogenetic reconstruction: Sample complexity via a branching process analysis

An analytical comparison of coalescent-based multilocus methods: The three-taxon case

Phylogenetic mixtures: Concentration of measure in the large-tree limit

Recovering the tree-like trend of evolution despite extensive lateral genetic transfer: A probabilistic analysis

Identifiability and inference of non-parametric rates-across-sites models on large-scale phylogenies

Phase Transition in Distance-Based Phylogeny Reconstruction

Robust estimation of latent tree graphical models: Inferring hidden states with inexact parameters

On the inference of large phylogenies with long branches: How long is too long?

Evolutionary Trees and the Ising Model on the Bethe Lattice: a Proof of Steel's Conjecture

Global Alignment of Molecular Sequences via Ancestral State Reconstruction

Network Delay Inference from Additive Metrics

On the Submodularity of Influence in Social Networks

Phylogenies without Branch Bounds: Contracting the Short, Pruning the Deep

Reconstruction on Trees: Exponential Moment Bounds for Linear Estimators

Sequence-Length Requirement of Distance-Based Phylogeny Reconstruction: Breaking the Polynomial Barrier

First to Market is not Everything: an Analysis of Preferential Attachment with Fitness

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

The Kesten-Stigum Reconstruction Bound Is Tight for Roughly Symmetric Binary Channels