Source author record

Mike Steel

Mike Steel appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Populations and Evolution Molecular Networks math.CO math.PR Data Structures and Algorithms Quantitative Methods Discrete Mathematics math.ST Neurons and Cognition physics.soc-ph Statistics Theory

Catalog footprint

What is connected

55works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Counting and optimising maximum phylogenetic diversity sets

In conservation biology, phylogenetic diversity (PD) provides a way to quantify the impact of the current rapid extinction of species on the evolutionary `Tree of Life'. This approach recognises that extinction not only removes species but also the branches of the tree on which unique features shared by the extinct species arose. In this paper, we investigate three questions that are relevant to PD. The first asks how many sets of species of given size $k$ preserve the maximum possible amount of PD in a given tree. The number of such maximum PD sets can be very large, even for moderate-sized phylogenies. We provide a combinatorial characterisation of maximum PD sets, focusing on the setting where the branch lengths are ultrametric (e.g. proportional to time). This leads to a polynomial-time algorithm for calculating the number of maximum PD sets of size $k$ by applying a generating function; we also investigate the types of tree shapes that harbour the most (or fewest) maximum PD sets of size $k$. Our second question concerns optimising a linear function on the species (regarded as leaves of the phylogenetic tree) across all the maximum PD sets of a given size. Using the characterisation result from the first question, we show how this optimisation problem can be solved in polynomial time, even though the number of maximum PD sets can grow exponentially. Our third question considers a dual problem: If $k$ species were to become extinct, then what is the largest possible {\em loss} of PD in the resulting tree? For this question, we describe a polynomial-time solution based on dynamical programming.

preprint2021arXiv

Combinatorial and stochastic properties of ranked tree-child networks

Tree-child networks are a recently-described class of directed acyclic graphs that have risen to prominence in phylogenetics (the study of evolutionary trees and networks). Although these networks have a number of attractive mathematical properties, many combinatorial questions concerning them remain intractable. In this paper, we show that endowing these networks with a biologically relevant ranking structure yields mathematically tractable objects, which we term ranked tree-child networks (RTCNs). We explain how to derive exact and explicit combinatorial results concerning the enumeration and generation of these networks. We also explore probabilistic questions concerning the properties of RTCNs when they are sampled uniformly at random. These questions include the lengths of random walks between the root and leaves (both from the root to the leaves and from a leaf to the root); the distribution of the number of cherries in the network; and sampling RTCNs conditional on displaying a given tree. We also formulate a conjecture regarding the scaling limit of the process that counts the number of lineages in the ancestry of a leaf. The main idea in this paper, namely using ranking as a way to achieve combinatorial tractability, may also extend to other classes of networks.

preprint2020arXiv

Modeling a Cognitive Transition at the Origin of Cultural Evolution using Autocatalytic Networks

Autocatalytic networks have been used to model the emergence of self-organizing structure capable of sustaining life and undergoing biological evolution. Here, we model the emergence of cognitive structure capable of undergoing cultural evolution. Mental representations of knowledge and experiences play the role of catalytic molecules, and interactions amongst them (e.g., the forging of new associations) play the role of reactions, and result in representational redescription. The approach tags mental representations with their source, i.e., whether they were acquired through social learning, individual learning (of pre-existing information), or creative thought (resulting in the generation of new information). This makes it possible to model how cognitive structure emerges, and to trace lineages of cumulative culture step by step. We develop a formal representation of the cultural transition from Oldowan to Acheulean tool technology using Reflexively Autocatalytifc and Food set generated (RAF) networks. Unlike more primitive Oldowan stone tools, the Acheulean hand axe required not only the capacity to envision and bring into being something that did not yet exist, but hierarchically structured thought and action, and the generation of new mental representations: the concepts EDGING, THINNING, SHAPING, and a meta-concept, HAND AXE. We show how this constituted a key transition towards the emergence of semantic networks that were self-organizing, self-sustaining, and autocatalytic, and discuss how such networks replicated through social interaction. The model provides a promising approach to unraveling one of the greatest anthropological mysteries: that of why development of the Acheulean hand axe was followed by over a million years of cultural stasis.

preprint2019arXiv

A class of phylogenetic networks reconstructable from ancestral profiles

Rooted phylogenetic networks provide an explicit representation of the evolutionary history of a set $X$ of sampled species. In contrast to phylogenetic trees which show only speciation events, networks can also accommodate reticulate processes (for example, hybrid evolution, endosymbiosis, and lateral gene transfer). A major goal in systematic biology is to infer evolutionary relationships, and while phylogenetic trees can be uniquely determined from various simple combinatorial data on $X$, for networks the reconstruction question is much more subtle. Here we ask when can a network be uniquely reconstructed from its `ancestral profile' (the number of paths from each ancestral vertex to each element in $X$). We show that reconstruction holds (even within the class of all networks) for a class of networks we call `orchard networks', and we provide a polynomial-time algorithm for reconstructing any orchard network from its ancestral profile. Our approach relies on establishing a structural theorem for orchard networks, which also provides for a fast (polynomial-time) algorithm to test if any given network is of orchard type. Since the class of orchard networks includes tree-sibling tree-consistent networks and tree-child networks, our result generalise reconstruction results from 2008 and 2009. Orchard networks allow for an unbounded number $k$ of reticulation vertices, in contrast to tree-sibling tree-consistent networks and tree-child networks for which $k$ is at most $2|X|-4$ and $|X|-1$, respectively.

preprint2016arXiv

Autocatalytic sets in polymer networks with variable catalysis distributions

All living systems -- from the origin of life to modern cells -- rely on a set of biochemical reactions that are simultaneously self-sustaining and autocatalytic. This notion of an autocatalytic set has been formalized graph-theoretically (as `RAF'), leading to mathematical results and polynomial-time algorithms that have been applied to simulated and real chemical reaction systems. In this paper, we investigate the emergence of autocatalytic sets in polymer models when the catalysis rate of each molecule type is drawn from some probability distribution. We show that although the average catalysis rate $f$ for RAFs to arise depends on this distribution, a universal linear upper and lower bound for $f$ (with increasing system size) still applies. However, the probability of the appearance (and size) of autocatalytic sets can vary widely, depending on the particular catalysis distribution. We use simulations to explore how tight the mathematical bounds are, and the reasons for the observed variations. We also investigate the impact of inhibition (where molecules can also inhibit reactions) on the emergence of autocatalytic sets, deriving new mathematical and algorithmic results.

preprint2016arXiv

Phylogenetic mixtures and linear invariants for equal input models

The reconstruction of phylogenetic trees from molecular sequence data relies on modelling site substitutions by a Markov process, or a mixture of such processes. In general, allowing mixed processes can result in different tree topologies becoming indistinguishable from the data, even for infinitely long sequences. However, when the underlying Markov process supports linear phylogenetic invariants, then provided these are sufficiently informative, the identifiability of the tree topology can be restored. In this paper, we investigate a class of processes that support linear invariants once the stationary distribution is fixed, the `equal input model'. This model generalizes the `Felsenstein 1981' model (and thereby the Jukes--Cantor model) from four states to an arbitrary number of states (finite or infinite), and it can also be described by a `random cluster' process. We describe the structure and dimension of the vector spaces of phylogenetic mixtures and of linear invariants for any fixed phylogenetic tree (and for all trees -- the so called `model invariants'), on any number $n$ of leaves. We also provide a precise description of the space of mixtures and linear invariants for the special case of $n=4$ leaves. By combining techniques from discrete random processes and (multi-) linear algebra, our results build on a classic result that was first established by James Lake in 1987.

preprint2015arXiv

A consistency lemma in statistical phylogenetics

This short note provides a simple formal proof of a folklore result in statistical phylogenetics concerning the convergence of bootstrap support for a tree and its edges.

preprint2015arXiv

Bounds on the Expected Size of the Maximum Agreement Subtree

We prove polynomial upper and lower bounds on the expected size of the maximum agreement subtree of two random binary phylogenetic trees under both the uniform distribution and Yule-Harding distribution. This positively answers a question posed in earlier work. Determining tight upper and lower bounds remains an open problem.

preprint2015arXiv

Capturing a phylogenetic tree when the number of character states varies with the number of leaves

We show that for any two values $α, β>0 $ for which $α+β>1$ then there is a value $N$ so that for all $n \geq N$ the following holds. For any binary phylogenetic tree $T$ on $n$ leaves there is a set of $\lfloor n^α\rfloor$ characters that capture $T$, and for which each character takes at most $\lfloor n^β\rfloor$ distinct states. Here `capture' means that $T$ is the unique perfect phylogeny for these characters. Our short proof of this combinatorial result is based on the probabilistic method.

preprint2015arXiv

Circumstances in which parsimony but not compatibility will be provably misleading

Phylogenetic methods typically rely on an appropriate model of how data evolved in order to infer an accurate phylogenetic tree. For molecular data, standard statistical methods have provided an effective strategy for extracting phylogenetic information from aligned sequence data when each site (character) is subject to a common process. However, for other types of data (e.g. morphological data), characters can be too ambiguous, homoplastic or saturated to develop models that are effective at capturing the underlying process of change. To address this, we examine the properties of a classic but neglected method for inferring splits in an underlying tree, namely, maximum compatibility. By adopting a simple and extreme model in which each character either fits perfectly on some tree, or is entirely random (but it is not known which class any character belongs to) we are able to derive exact and explicit formulae regarding the performance of maximum compatibility. We show that this method is able to identify a set of non-trivial homoplasy-free characters, when the number $n$ of taxa is large, even when the number of random characters is large. By contrast, we show that a method that makes more uniform use of all the data --- maximum parsimony --- can provably estimate trees in which {\em none} of the original homoplasy-free characters support splits.

preprint2015arXiv

Folding and unfolding phylogenetic trees and networks

Phylogenetic networks are rooted, labelled directed acyclic graphs which are commonly used to represent reticulate evolution. There is a close relationship between phylogenetic networks and multi-labelled trees (MUL-trees). Indeed, any phylogenetic network $N$ can be 'unfolded' to obtain a MUL-tree $U(N)$ and, conversely, a MUL-tree $T$ can in certain circumstances be 'folded' to obtain a phylogenetic network $F(T)$ that exhibits $T$. In this paper, we study properties of the operations $U$ and $F$ in more detail. In particular, we introduce the class of stable networks, phylogenetic networks $N$ for which $F(U(N))$ is isomorphic to $N$, characterise such networks, and show that that they are related to the well-known class of tree-sibling networks. We also explore how the concept of displaying a tree in a network $N$ can be related to displaying the tree in the MUL-tree $U(N)$. To do this, we develop a phylogenetic analogue of graph fibrations. This allows us to view $U(N)$ as the analogue of the universal cover of a digraph, and to establish a close connection between displaying trees in $U(N)$ and reconciling phylogenetic trees with networks.

preprint2015arXiv

Neighbourhoods of phylogenetic trees: exact and asymptotic counts

A central theme in phylogenetics is the reconstruction and analysis of evolutionary trees from a given set of data. To determine the optimal search methods for reconstructing trees, it is crucial to understand the size and structure of the neighbourhoods of trees under tree rearrangement operations. The diameter and size of the immediate neighbourhood of a tree has been well-studied, however little is known about the number of trees at distance two, three or (more generally) $k$ from a given tree. In this paper we provide a number of exact and asymptotic results concerning these quantities, and identify some key aspects of tree shape that play a role in determining these quantities. We obtain several new results for two of the main tree rearrangement operations - Nearest Neighbour Interchange and Subtree Prune and Regraft -- as well as for the Robinson-Foulds metric on trees.

preprint2015arXiv

Self-sustaining autocatalytic networks within open-ended reaction systems

Given any finite and closed chemical reaction system, it is possible to efficiently determine whether or not it contains a `self-sustaining and collectively autocatalytic' subset of reactions, and to find such subsets when they exist. However, for systems that are potentially open-ended (for example, when no prescribed upper bound is placed on the complexity or size/length of molecules types), the theory developed for the finite case breaks down. We investigate a number of subtleties that arise in such systems that are absent in the finite setting, and present several new results.

preprint2015arXiv

Similarities as Evidence for Common Ancestry -- A Likelihood Epistemology

Darwin claims in the {\em Origin} that similarity is evidence for common ancestry, but that adaptive similarities are "almost valueless" as evidence. This claim seems reasonable for some adaptive similarities but not for others. Here we clarify and evaluate these and related matters by using the law of likelihood as an analytic tool and by considering mathematical models of three evolutionary processes -- directional selection, stabilizing selection, and drift. Our results apply both to Darwin's theory of evolution and to modern evolutionary biology.

preprint2015arXiv

The existence and abundance of ghost ancestors in biparental populations

In a randomly-mating biparental population of size $N$ there are, with high probability, individuals who are genealogical ancestors of every extant individual within approximately $\log_2(N)$ generations into the past. We use this result of J. Chang to prove a curious corollary under standard models of recombination: there exist, with high probability, individuals within a constant multiple of $ \log_2(N)$ generations into the past who are simultaneously (i) genealogical ancestors of {\em each} of the individuals at the present, and (ii) genetic ancestors to {\em none} of the individuals at the present. Such ancestral individuals - ancestors of everyone today that left no genetic trace -- represent `ghost' ancestors in a strong sense. In this short note, we use simple analytical argument and simulations to estimate how many such individuals exist in finite Wright-Fisher populations.

preprint2015arXiv

Twisted trees and inconsistency of tree estimation when gaps are treated as missing data -- the impact of model mis-specification in distance corrections

Statistically consistent estimation of phylogenetic trees or gene trees is possible if pairwise sequence dissimilarities can be converted to a set of distances that are proportional to the true evolutionary distances. Susko et al. (2004) reported some strikingly broad results about the forms of inconsistency in tree estimation that can arise if corrected distances are not proportional to the true distances. They showed that if the corrected distance is a concave function of the true distance, then inconsistency due to long branch attraction will occur. If these functions are convex, then two "long branch repulsion" trees will be preferred over the true tree -- though these two incorrect trees are expected to be tied as the preferred true. Here we extend their results, and demonstrate the existence of a tree shape (which we refer to as a "twisted Farris-zone" tree) for which a single incorrect tree topology will be guaranteed to be preferred if the corrected distance function is convex. We also report that the standard practice of treating gaps in sequence alignments as missing data is sufficient to produce non-linear corrected distance functions if the substitution process is not independent of the insertion/deletion process. Taken together, these results imply inconsistent tree inference under mild conditions. For example, if some positions in a sequence are constrained to be free of substitutions and insertion/deletion events while the remaining sites evolve with independent substitutions and insertion/deletion events, then the distances obtained by treating gaps as missing data can support an incorrect tree topology even given an unlimited amount of data.

preprint2015arXiv

Which phylogenetic networks are merely trees with additional arcs?

A binary phylogenetic network may or may not be obtainable from a tree by the addition of directed edges (arcs) between tree arcs. Here, we establish a precise and easily tested criterion (based on `2-SAT') that efficiently determines whether or not any given network can be realized in this way. Moreover, the proof provides a polynomial-time algorithm for finding one or more trees (when they exist) on which the network can be based. A number of interesting consequences are presented as corollaries; these lead to some further relevant questions and observations, which we outline in the conclusion.

preprint2014arXiv

A 'stochastic safety radius' for distance-based tree reconstruction

A variety of algorithms have been proposed for reconstructing trees that show the evolutionary relationships between species by comparing differences in genetic data across present-day taxa. If the leaf-to-leaf distances in a tree can be accurately estimated, then it is possible to reconstruct this tree from these estimated distances, using polynomial-time methods such as the popular `Neighbor-Joining' algorithm. There is a precise combinatorial condition under which distance-based methods are guaranteed to return a correct tree (in full or in part) based on the requirement that the input distances all lie within some `safety radius' of the true distances. Here, we explore a stochastic analogue of this condition, and mathematically establish upper and lower bounds on this `stochastic safety radius' for distance-based tree reconstruction methods. Using simulations, we show how this notion provides a new way to compare the performance of distance-based tree reconstruction methods. This may help explain why Neighbor-Joining performs so well, as its stochastic safety radius appears close to optimal (while its more classical safety radius is the same as many other less accurate methods).

preprint2014arXiv

Axiomatic opportunities and obstacles for inferring a species tree from gene trees

The reconstruction of a central tendency `species tree' from a large number of conflicting gene trees is a central problem in systematic biology. Moreover, it becomes particularly problematic when taxon coverage is patchy, so that not all taxa are present in every gene tree. Here, we list four apparently desirable properties that a method for estimating a species tree from gene trees could have (the strongest property states that building a species tree from input gene trees and then pruning leaves gives a tree that is the same as, or more resolved than, the tree obtained by first removing the taxa from the input trees and then building the species tree). We show that while it is technically possible to simultaneously satisfy these properties when taxon coverage is complete, they cannot all be satisfied in the more general supertree setting. In part two, we discuss a concordance-based consensus method based on Baum's `plurality clusters', and an extension to concordance supertrees.

preprint2014arXiv

Impacts of terraces on phylogenetic inference

Terraces are potentially large sets of trees with precisely the same likelihood or parsimony score, which can be induced by missing sequences in partitioned multi-locus phylogenetic data matrices. The set of trees on a terrace can be characterized by enumeration algorithms or consensus methods that exploit the pattern of partial taxon coverage in the data, independent of the sequence data themselves. Terraces add ambiguity and complexity to phylogenetic inference particularly in settings where inference is already challenging: data sets with many taxa and relatively few loci. In this paper we present five new findings about terraces and their impacts on phylogenetic inference. First we clarify assumptions about model parameters that are necessary for the existence of terraces. Second, we explore the dependence of terrace size on partitioning scheme and indicate how to find the partitioning scheme associated with the largest terrace containing a given tree. Third, we highlight the impact of terraces on bootstrap estimates of confidence limits in clades, and characterize the surprising result that the bootstrap proportion for a clade can be entirely determined by the frequency of bipartitions on a terrace, with some bipartitions receiving high support even when incorrect. Fourth, we dissect some effects of prior distributions of edge lengths on the computed posterior probabilities of clades on terraces, to understand an example in which long edges "attract" each other in Bayesian inference. Fifth, we show that even if data are not partitioned, patterns of missing data studied in the terrace problem can lead to instances of apparent statistical inconsistency when even a small element of heterotachy is introduced to the model generating the sequence data. Finally, we discuss strategies for remediation of some of these problems.

preprint2014arXiv

Majority rule has transition ratio 4 on Yule trees under a 2-state symmetric model

Inferring the ancestral state at the root of a phylogenetic tree from states observed at the leaves is a problem arising in evolutionary biology. The simplest technique -- majority rule -- estimates the root state by the most frequently occurring state at the leaves. Alternative methods -- such as maximum parsimony - explicitly take the tree structure into account. Since either method can outperform the other on particular trees, it is useful to consider the accuracy of the methods on trees generated under some evolutionary null model, such as a Yule pure-birth model. In this short note, we answer a recently posed question concerning the performance of majority rule on Yule trees under a symmetric 2-state Markovian substitution model of character state change. We show that majority rule is accurate precisely when the ratio of the birth (speciation) rate of the Yule process to the substitution rate exceeds the value $4$. By contrast, maximum parsimony has been shown to be accurate only when this ratio is at least 6. Our proof relies on a second moment calculation, coupling, and a novel application of a reflection principle.

preprint2014arXiv

Reflections on the extinction-explosion dichotomy

A wide range of stochastic processes that model the growth and decline of populations exhibit a curious dichotomy: with certainty either the population goes extinct or its size tends to infinity. There is a elegant and classical theorem that explains why this dichotomy must hold under certain assumptions concerning the process. In this note, I explore how these assumptions might be relaxed further in order to obtain the same, or a similar conclusion, and obtain both positive and negative results.

preprint2014arXiv

The most parsimonious tree for random data

Applying a method to reconstruct a phylogenetic tree from random data provides a way to detect whether that method has an inherent bias towards certain tree `shapes'. For maximum parsimony, applied to a sequence of random 2-state data, each possible binary phylogenetic tree has exactly the same distribution for its parsimony score. Despite this pleasing and slightly surprising symmetry, some binary phylogenetic trees are more likely than others to be a most parsimonious (MP) tree for a sequence of $k$ such characters, as we show. For $k=2$, and unrooted binary trees on six taxa, any tree with a caterpillar shape has a higher chance of being an MP tree than any tree with a symmetric shape. On the other hand, if we take any two binary trees, on any number of taxa, we prove that this bias between the two trees vanishes as the number of characters grows. However, again there is a twist: MP trees on six taxa are more likely to have certain shapes than a uniform distribution on binary phylogenetic trees predicts, and this difference does not appear to dissipate as $k$ grows.

preprint2014arXiv

Tracing evolutionary links between species

The idea that all life on earth traces back to a common beginning dates back at least to Charles Darwin's {\em Origin of Species}. Ever since, biologists have tried to piece together parts of this `tree of life' based on what we can observe today: fossils, and the evolutionary signal that is present in the genomes and phenotypes of different organisms. Mathematics has played a key role in helping transform genetic data into phylogenetic (evolutionary) trees and networks. Here, I will explain some of the central concepts and basic results in phylogenetics, which benefit from several branches of mathematics, including combinatorics, probability and algebra.

preprint2014arXiv

Tree-like Reticulation Networks - When Do Tree-like Distances Also Support Reticulate Evolution?

Hybrid evolution and horizontal gene transfer (HGT) are processes where evolutionary relationships may more accurately be described by a reticulated network than by a tree. In such a network, there will often be several paths between any two extant species, reflecting the possible pathways that genetic material may have been passed down from a common ancestor to these species. These paths will typically have different lengths but an `average distance' can still be calculated between any two taxa. In this article, we ask whether this average distance is able to distinguish reticulate evolution from pure tree-like evolution. We consider two types of reticulation networks: hybridization networks and HGT networks. For the former, we establish a general result which shows that average distances between extant taxa can appear tree-like, but only under a single hybridization event near the root; in all other cases, the two forms of evolution can be distinguished by average distances. For HGT networks, we demonstrate some analogous but more intricate results.

preprint2013arXiv

A matroid associated with a phylogenetic tree

A (pseudo-)metric $D$ on a finite set $X$ is said to be a `tree metric' if there is a finite tree with leaf set $X$ and non-negative edge weights so that, for all $x,y \in X$, $D(x,y)$ is the path distance in the tree between $x$ and $y$. It is well known that not every metric is a tree metric. However, when some such tree exists, one can always find one whose interior edges have strictly positive edge weights and that has no vertices of degree 2, any such tree is -- up to canonical isomorphism -- uniquely determined by $D$, and one does not even need all of the distances in order to fully (re-)construct the tree's edge weights in this case. Thus, it seems of some interest to investigate which subsets of $\binom{X}{2}$ suffice to determine (`lasso') these edge weights. In this paper, we use the results of a previous paper to discuss the structure of a matroid that can be associated with an (unweighted) $X-$tree $T$ defined by the requirement that its bases are exactly the `tight edge-weight lassos' for $T$, i.e, the minimal subsets $\cl$ of $\ch$ that lasso the edge weights of $T$.

preprint2013arXiv

An Arrow-type result for inferring a species tree from gene trees

preprint2013arXiv

Autocatalytic Sets and Biological Specificity

A universal feature of the biochemistry of any living system is that all the molecules and catalysts that are required for reactions of the system can be built up from an available food source by repeated application of reactions from within that system. RAF (reflexively autocatalytic and food-generated) theory provides a formal way to study such processes. Beginning with Kauffman's notion of "collectively autocatalytic sets", this theory has been further developed over the last decade with the discovery of efficient algorithms and new mathematical analysis. In this paper, we study how the behaviour of a simple binary polymer model can be extended to models where the pattern of catalysis more precisely reflects the ligation and cleavage reactions involved. We find that certain properties of these models are similar to, and can be accurately predicted from, the simple binary polymer model; however, other properties lead to slightly different estimates. We also establish a number of new results concerning the structure of RAFs in these systems.

preprint2013arXiv

Autocatalytic sets in a partitioned biochemical network

In previous work, RAF theory has been developed as a tool for making theoretical progress on the origin of life question, providing insight into the structure and occurrence of self-sustaining and collectively autocatalytic sets within catalytic polymer networks. We present here an extension in which there are two "independent" polymer sets, where catalysis occurs within and between the sets, but there are no reactions combining polymers from both sets. Such an extension reflects the interaction between nucleic acids and peptides observed in modern cells and proposed forms of early life.

preprint2013arXiv

Consistency of Bayesian inference of resolved phylogenetic trees

Bayesian inference is now a leading technique for reconstructing phylogenetic trees from aligned sequence data. In this short note, we formally show that the maximum posterior tree topology provides a statistically consistent estimate of a fully-resolved evolutionary tree under a wide variety of conditions. This includes the inference of gene trees from aligned sequence data across the entire parameter range of branch lengths, and under general conditions on priors in models where the usual `identifiability' conditions hold. We extend this to the inference of species trees from sequence data, where the gene trees constitute `nuisance parameters', as in the program *BEAST. This note also addresses earlier concerns raised in the literature questioning the extent to which statistical consistency for Bayesian methods might hold in general.

preprint2013arXiv

Hide and seek: placing and finding an optimal tree for thousands of homoplasy-rich sequences

Finding optimal evolutionary trees from sequence data is typically an intractable problem, and there is usually no way of knowing how close to optimal the best tree from some search truly is. The problem would seem to be particularly acute when we have many taxa and when that data has high levels of homoplasy, in which the individual characters require many changes to fit on the best tree. However, a recent mathematical result has provided a precise tool to generate a short number of high-homoplasy characters for any given tree, so that this tree is provably the optimal tree under the maximum parsimony criterion. This provides, for the first time, a rigorous way to test tree search algorithms on homoplasy-rich data, where we know in advance what the `best' tree is. In this short note we consider just one search program (TNT) but show that it is able to locate the globally optimal tree correctly for 32,768 taxa, even though the characters in the dataset requires, on average, 1148 state-changes each to fit on this tree, and the number of characters is only 57.

preprint2013arXiv

Predicting the ancestral character changes in a tree is typically easier than predicting the root state

Predicting the ancestral sequences of a group of homologous sequences related by a phylogenetic tree has been the subject of many studies, and numerous methods have been proposed to this purpose. Theoretical results are available that show that when the mutation rate become too large, reconstructing the ancestral state at the tree root is no longer feasible. Here, we also study the reconstruction of the ancestral changes that occurred along the tree edges. We show that, depending on the tree and branch length distribution, reconstructing these changes (i.e. reconstructing the ancestral state of all internal nodes in the tree) may be easier or harder than reconstructing the ancestral root state. However, results from information theory indicate that for the standard Yule tree, the task of reconstructing internal node states remains feasible, even for very high substitution rates. Moreover, computer simulations demonstrate that for more complex trees and scenarios, this result still holds. For a large variety of counting, parsimony-based and likelihood-based methods, the predictive accuracy of a randomly selected internal node in the tree is indeed much higher than the accuracy of the same method when applied to the tree root. Moreover, parsimony- and likelihood-based methods appear to be remarkably robust to sampling bias and model mis-specification.

preprint2013arXiv

Predicting the loss of phylogenetic diversity under non-stationary diversification models

For many taxa, the current high rates of extinction are likely to result in a significant loss of biodiversity. The evolutionary heritage of biodiversity is frequently quantified by a measure called phylogenetic diversity (PD). We predict the loss of PD under a wide class of phylogenetic tree models, where speciation rates and extinction rates may be time-dependent, and assuming independent random species extinctions at the present. We study the loss of PD when $K$ contemporary species are selected uniformly at random from the $N$ extant species as the surviving taxa, while the remaining $N-K$ become extinct. We consider two models of species sampling, the so-called field of bullets model, where each species independently survives the extinction event at the present with probability $p$, and a model for which the number of surviving species is fixed. We provide explicit formulae for the expected remaining PD in both models, conditional on $N=n$, conditional on $K=k$, or conditional on both events. When $N=n$ is fixed, we show the convergence to an explicit deterministic limit of the ratio of new to initial PD, as $n\to\infty$, both under the field of bullets model, and when $K=k_n$ is fixed and depends on $n$ in such a way that $k_n/n$ converges to $p$. We also prove the convergence of this ratio as $T\to\infty$ in the supercritical, time-homogeneous case, where $N$ simultaneously goes to $\infty$, thereby strengthening previous results of Mooers et al. (2012).

preprint2013arXiv

The standard lateral gene transfer model is statistically consistent for pectinate four-taxon trees

Evolutionary events such as incomplete lineage sorting and lateral gene transfer constitute major problems for inferring species trees from gene trees, as they can sometimes lead to gene trees which conflict with the underlying species tree. One particularly simple and efficient way to infer species trees from gene trees under such conditions is to combine three-taxon analyses for several genes using a majority vote approach. For incomplete lineage sorting this method is known to be statistically consistent, however, in the case of lateral gene transfer it is known that a zone of inconsistency does exist for a specific four-taxon tree topology. In this paper we analyze all remaining four-taxon topologies and show that no other inconsistencies exist.

preprint2013arXiv

The Structure of N-Player Games when Influence and Independence Collide

We study the mathematical properties of probabilistic processes in which the independent actions of $n$ players (`causes') can influence the outcome of each player (`effects'). In such a setting, each pair of outcomes will generally be statistically correlated, even if the actions of all the players provide a complete causal description of the players' outcomes, and even if we condition on the outcome of any one player's action. This correlation always holds when $n=2$, but when $n=3$ there exists a highly symmetric process, recently studied, in which each cause can influence each effect, and yet each pair of effects is probabilistically independent (even upon conditioning on any one cause). We study such symmetric processes in more detail, obtaining a complete classification for all $n \geq 3$. Using a variety of mathematical techniques, we describe the geometry and topology of the underlying probability space that allows independence and influence to coexist.

preprint2013arXiv

Tree split probabilities determine the branch lengths

The evolution of aligned DNA sequence sites is generally modeled by a Markov process operating along the edges of a phylogenetic tree. It is well known that the probability distribution on the site patterns at the tips of the tree determines the tree and its branch lengths. However, the number of patterns is typically much larger than the number of edges, suggesting considerable redundancy in the branch length estimation. In this paper we ask whether the probabilities of just the `edge-specific' patterns (the ones that correspond to a change of state on a single edge) suffice to recover the branch lengths of the tree, under a symmetric 2-state Markov process. We first show that this holds provided the branch lengths are sufficiently short, by applying the inverse function theorem. We then consider whether this restriction to short branch lengths is necessary, and show that for trees with up to four leaves it can be lifted. This leaves open the interesting question of whether this holds in general.

preprint2012arXiv

A formal model of autocatalytic sets emerging in an RNA replicator system

Background: The idea that autocatalytic sets played an important role in the origin of life is not new. However, the likelihood of autocatalytic sets emerging spontaneously has long been debated. Recently, progress has been made along two different lines. Experimental results have shown that autocatalytic sets can indeed emerge in real chemical systems, and theoretical work has shown that the existence of such self-sustaining sets is highly likely in formal models of chemical systems. Here, we take a first step towards merging these two lines of work by constructing and investigating a formal model of a real chemical system of RNA replicators exhibiting autocatalytic sets. Results: We show that the formal model accurately reproduces recent experimental results on an RNA replicator system, in particular how the system goes through a sequence of larger and larger autocatalytic sets, and how a cooperative (autocatalytic) system can outcompete an equivalent selfish system. Moreover, the model provides additional insights that could not be obtained from experiments alone, and it suggests several experimentally testable hypotheses. Conclusions: Given these additional insights and predictions, the modeling framework provides a better and more detailed understanding of the nature of chemical systems in general and the emergence of autocatalytic sets in particular. This provides an important first step in combining experimental and theoretical work on autocatalytic sets in the context of the orgin of life.

preprint2012arXiv

Autocatalytic Sets Extended: Dynamics, Inhibition, and a Generalization

Background: Autocatalytic sets are often considered a necessary (but not sufficient) condition for the origin and early evolution of life. Although the idea of autocatalytic sets was already conceived of many years ago, only recently have they gained more interest, following advances in creating them experimentally in the laboratory. In our own work, we have studied autocatalytic sets extensively from a computational and theoretical point of view. Results: We present results from an initial study of the dynamics of self-sustaining autocatalytic sets (RAFs). In particular, simulations of molecular flow on autocatalytic sets are performed, to illustrate the kinds of dynamics that can occur. Next, we present an extension of our (previously introduced) algorithm for finding autocatalytic sets in general reaction networks, which can also handle inhibition. We show that in this case detecting autocatalytic sets is fixed parameter tractable. Finally, we formulate a generalized version of the algorithm that can also be applied outside the context of chemistry and origin of life, which we illustrate with a toy example from economics. Conclusions: Having shown theoretically (in previous work) that autocatalytic sets are highly likely to exist, we conclude here that also in terms of dynamics such sets are viable and outcompete non-autocatalytic sets. Furthermore, our dynamical results confirm arguments made earlier about how autocatalytic subsets can enable their own growth or give rise to other such subsets coming into existence. Finally, our algorithmic extension and generalization show that more realistic scenarios (e.g., including inhibition) can also be dealt with within our framework, and that it can even be applied to areas outside of chemistry, such as economics.

preprint2012arXiv

Identifying a species tree subject to random lateral gene transfer

A major problem for inferring species trees from gene trees is that evolutionary processes can sometimes favour gene tree topologies that conflict with an underlying species tree. In the case of incomplete lineage sorting, this phenomenon has recently been well-studied, and some elegant solutions for species tree reconstruction have been proposed. One particularly simple and statistically consistent estimator of the species tree under incomplete lineage sorting is to combine three-taxon analyses, which are phylogenetically robust to incomplete lineage sorting. In this paper, we consider whether such an approach will also work under lateral gene transfer (LGT). By providing an exact analysis of some cases of this model, we show that there is a zone of inconsistency for triplet-based species tree reconstruction under LGT. However, a triplet-based approach will consistently reconstruct a species tree under models of LGT, provided that the expected number of LGT transfers is not too high. Our analysis involves a novel connection between the LGT problem and random walks on cyclic graphs. We have implemented a procedure for reconstructing trees subject to LGT or lineage sorting in settings where taxon coverage may be patchy and illustrate its use on two sample data sets.

preprint2012arXiv

Is the Random Tree Puzzle process the same as the Yule-Harding process?

It has been suggested that a Random Tree Puzzle (RTP) process leads to a Yule-Harding (YH) distribution, when the number of taxa becomes large. In this study, we formalize this conjecture, and we prove that the two tree distributions converge for two particular properties, which suggests that the conjecture may be true. However, we present evidence that, while the two distributions are close, the RTP appears to converge on a different distribution than does the YH.

preprint2012arXiv

Minimal autocatalytic networks

Self-sustaining autocatalytic chemical networks represent a necessary, though not sufficient condition for the emergence of early living systems. These networks have been formalised and investigated within the framework of RAF theory, which has led to a number of insights and results concerning the likelihood of such networks forming. In this paper, we extend this analysis by focussing on how {\em small} autocatalytic networks are likely to be when they first emerge. First we show that simulations are unlikely to settle this question, by establishing that the problem of finding a smallest RAF within a catalytic reaction system is NP-hard. However, irreducible RAFs (irrRAFs) can be constructed in polynomial time, and we show it is possible to determine in polynomial time whether a bounded size set of these irrRAFs contain the smallest RAFs within a system. Moreover, we derive rigorous bounds on the sizes of small RAFs and use simulations to sample irrRAFs under the binary polymer model. We then apply mathematical arguments to prove a new result suggested by those simulations: at the transition catalysis level at which RAFs first form in this model, small RAFs are unlikely to be present. We also investigate further the relationship between RAFs and another formal approach to self-sustaining and closed chemical networks, namely chemical organisation theory (COT).

preprint2012arXiv

On Patchworks and Hierarchies

Motivated by questions in biological classification, we discuss some elementary combinatorial and computational properties of certain set systems that generalize hierarchies, namely, 'patchworks', 'weak patchworks', 'ample patchworks' and 'saturated patchworks' and also outline how these concepts relate to an apparently new 'duality theory' for cluster systems that is based on the fundamental concept of 'compatibility' of clusters.

preprint2012arXiv

Reconstructing fully-resolved trees from triplet cover distances

It is a classical result that any finite tree with positively weighted edges, and without vertices of degree 2, is uniquely determined by the weighted path distance between each pair of leaves. Moreover, it is possible for a (small) strict subset $\cl$ of leaf pairs to suffice for reconstructing the tree and its edge weights, given just the distances between the leaf pairs in $\cl$. It is known that any set $\cl$ with this property for a tree in which all interior vertices have degree 3 must form a {\em cover} for $T$ -- that is, for each interior vertex $v$ of $T$, $\cl$ must contain a pair of leaves from each pair of the three components of $T-v$. Here we provide a partial converse of this result by showing that if a set $\cl$ of leaf pairs forms a cover of a certain type for such a tree $T$ then $T$ and its edge weights can be uniquely determined from the distances between the pairs of leaves in $\cl$. Moreover, there is a polynomial-time algorithm for achieving this reconstruction. The result establishes a special case of a recent question concerning `triplet covers', and is relevant to a problem arising in evolutionary genomics.

preprint2012arXiv

Root location in random trees: A polarity property of all sampling consistent phylogenetic models except one

Neutral macroevolutionary models, such as the Yule model, give rise to a probability distribution on the set of discrete rooted binary trees over a given leaf set. Such models can provide a signal as to the approximate location of the root when only the unrooted phylogenetic tree is known, and this signal becomes relatively more significant as the number of leaves grows. In this short note, we show that among models that treat all taxa equally, and are sampling consistent (i.e. the distribution on trees is not affected by taxa yet to be included), all such models, except one, convey some information as to the location of the ancestral root in an unrooted tree.

preprint2012arXiv

The impact and interplay of long and short branches on phylogenetic information content

In molecular systematics, evolutionary trees are reconstructed from sequences at the tips under simple models of site substitution. A central question is how much sequence data is required to reconstruct a tree accurately? The answer depends on the lengths of the branches (edges) of the tree, with very short and very long edges requiring long sequences for accurate tree inference, particularly when these branch lengths are arranged in certain ways. For four-taxon trees, the sequence length question was settled for the case of a rapid speciation event in the distant past. Here, we generalize this result and show that the same sequence length requirement holds even when the speciation event is recent, provided that at least one of the four taxa is distantly related to the others. However, this equivalence disappears if a molecular clock applies, since the length of the long outgroup edge becomes largely irrelevant in the estimation of the tree topology for a recent (but not a deep) divergence. We also show how our results can be extended to models in which substitution rates vary across sites, and to settings where more than four taxa are involved.

preprint2012arXiv

The Structure of Autocatalytic Sets: Evolvability, Enablement, and Emergence

This paper presents new results from a detailed study of the structure of autocatalytic sets. We show how autocatalytic sets can be decomposed into smaller autocatalytic subsets, and how these subsets can be identified and classified. We then argue how this has important consequences for the evolvability, enablement, and emergence of autocatalytic sets. We end with some speculation on how all this might lead to a generalized theory of autocatalytic sets, which could possibly be applied to entire ecologies or even economies.

preprint2011arXiv

'Bureaucratic' set systems, and their role in phylogenetics

We say that a collection $\Cc$ of subsets of $X$ is {\em bureaucratic} if every maximal hierarchy on $X$ contained in $\Cc$ is also maximum. We characterise bureaucratic set systems and show how they arise in phylogenetics. This framework has several useful algorithmic consequences: we generalize some earlier results and derive a polynomial-time algorithm for a parsimony problem arising in phylogenetic networks.

preprint2011arXiv

Branch lengths on Yule trees and the expected loss of phylogenetic diversity

Diversification is nested, and early models suggested this could lead to a great deal of evolutionary redundancy in the Tree of Life. This result is based on a particular set of branch lengths produced by the common coalescent, where pendant branches leading to tips can be very short compared to branches deeper in the tree. Here, we analyze alternative and more realistic Yule and birth-death models. We show how censoring at the present both makes average branches one half what we might expect and makes pendant and interior branches roughly equal in length. Although dependent on whether we condition on the size of the tree, its age, or both, these results hold both for the Yule model and for birth-death models with moderate extinction. Importantly, the rough equivalency in interior and exterior branch lengths means the loss of evolutionary history with loss of species can be roughly linear. Under these models, the Tree of Life may offer limited redundancy in the face of ongoing species loss.

preprint2011arXiv

Clades, clans and reciprocal monophyly under neutral evolutionary models

The Yule model and the coalescent model are two neutral stochastic models for generating trees in phylogenetics and population genetics, respectively. Although these models are quite different, they lead to identical distributions concerning the probability that pre-specified groups of taxa form monophyletic groups (clades) in the tree. We extend earlier work to derive exact formulae for the probability of finding one or more groups of taxa as clades in a rooted tree, or as `clans' in an unrooted tree. Our findings are relevant for calculating the statistical significance of observed monophyly and reciprocal monophyly in phylogenetics.

preprint2011arXiv

Distribution of branch lengths and phylogenetic diversity under homogeneous speciation models

The constant rate birth--death process is a popular null model for speciation and extinction. If one removes extinct and non-sampled lineages, this process induces `reconstructed trees' which describe the relationship between extant lineages. We derive the probability density of the length of a randomly chosen pendant edge in a reconstructed tree. For the special case of a pure-birth process with complete sampling, we also provide the probability density of the length of an interior edge, of the length of an edge descending from the root, and of the diversity (which is the sum of all edge lengths). We show that the results depend on whether the reconstructed trees are conditioned on the number of leaves, the age, or both.

preprint2011arXiv

Predicting template-based catalysis rates in a simple catalytic reaction model

We show that in a particular model of catalytic reaction systems, known as the binary polymer model, there is a mathematical invariance between two versions of the model: (1) random catalysis and (2) template-based catalysis. In particular, we derive an analytical calculation that allows us to accurately predict the (observed) required level of catalysis in one version of the model from that in the other version, for a given probability of having self-sustaining autocatalytic sets exist in instances of both model versions. This provides a tractable connection between two models that have been investigated in theoretical origin-of-life studies.

preprint2011arXiv

The 'Butterfly effect' in Cayley graphs, and its relevance for evolutionary genomics

Suppose a finite set $X$ is repeatedly transformed by a sequence of permutations of a certain type acting on an initial element $x$ to produce a final state $y$. We investigate how 'different' the resulting state $y'$ to $y$ can be if a slight change is made to the sequence, either by deleting one permutation, or replacing it with another. Here the 'difference' between $y$ and $y'$ might be measured by the minimum number of permutations of the permitted type required to transform $y$ to $y'$, or by some other metric. We discuss this first in the general setting of sensitivity to perturbation of walks in Cayley graphs of groups with a specified set of generators. We then investigate some permutation groups and generators arising in computational genomics, and the statistical implications of the findings.

preprint2010arXiv

Inferring ancestral sequences in taxon-rich phylogenies

Statistical consistency in phylogenetics has traditionally referred to the accuracy of estimating phylogenetic parameters for a fixed number of species as we increase the number of characters. However, as sequences are often of fixed length (e.g. for a gene) although we are often able to sample more taxa, it is useful to consider a dual type of statistical consistency where we increase the number of species, rather than characters. This raises some basic questions: what can we learn about the evolutionary process as we increase the number of species? In particular, does having more species allow us to infer the ancestral state of characters accurately? This question is particularly relevant when sequence site evolution varies in a complex way from character to character, as well as for reconstructing ancestral sequences. In this paper, we assemble a collection of results to analyse various approaches for inferring ancestral information with increasing accuracy as the number of taxa increases.

preprint2010arXiv

Locating a tree in a phylogenetic network

Phylogenetic trees and networks are leaf-labelled graphs that are used to describe evolutionary histories of species. The Tree Containment problem asks whether a given phylogenetic tree is embedded in a given phylogenetic network. Given a phylogenetic network and a cluster of species, the Cluster Containment problem asks whether the given cluster is a cluster of some phylogenetic tree embedded in the network. Both problems are known to be NP-complete in general. In this article, we consider the restriction of these problems to several well-studied classes of phylogenetic networks. We show that Tree Containment is polynomial-time solvable for normal networks, for binary tree-child networks, and for level-$k$ networks. On the other hand, we show that, even for tree-sibling, time-consistent, regular networks, both Tree Containment and Cluster Containment remain NP-complete.

preprint2004arXiv

How much can evolved characters tell us about the tree that generated them?

In this paper we review some recent results that shed light on a fundamental question in molecular systematics: how much phylogenetic `signal' can we expect from characters that have evolved under some Markov process? There are many sides to this question and we begin by describing some explicit bounds on the probability of correctly reconstructing an ancestral state from the states observed at the tips. We show how this bound sets upper limits on the probability of tree reconstruction from aligned sequences, and we provide some new extensions that allow site-to-site rate variation or a covarion mechanism. We then explore the relationship between the number of sites required for accurate tree reconstruction and other model parameters - such as the number of species, and substitution probabilities, and we describe a phase transition that occurs when substitution probabilities exceed a critical value. In the remainder of this paper we turn to models of character evolution where the state space is assumed to be either infinite or very large. These models have some relevance to certain types of genomic data (such as gene order) and here we again investigate how many characters are required for accurate tree reconstruction.

Mike Steel

What is connected

Connect this record

See the researcher in context

Building this map preview

55 published item(s)

Counting and optimising maximum phylogenetic diversity sets

Combinatorial and stochastic properties of ranked tree-child networks

Modeling a Cognitive Transition at the Origin of Cultural Evolution using Autocatalytic Networks

A class of phylogenetic networks reconstructable from ancestral profiles

Autocatalytic sets in polymer networks with variable catalysis distributions

Phylogenetic mixtures and linear invariants for equal input models

A consistency lemma in statistical phylogenetics

Bounds on the Expected Size of the Maximum Agreement Subtree

Capturing a phylogenetic tree when the number of character states varies with the number of leaves

Circumstances in which parsimony but not compatibility will be provably misleading

Folding and unfolding phylogenetic trees and networks

Neighbourhoods of phylogenetic trees: exact and asymptotic counts

Self-sustaining autocatalytic networks within open-ended reaction systems

Similarities as Evidence for Common Ancestry -- A Likelihood Epistemology

The existence and abundance of ghost ancestors in biparental populations

Twisted trees and inconsistency of tree estimation when gaps are treated as missing data -- the impact of model mis-specification in distance corrections

Which phylogenetic networks are merely trees with additional arcs?

A 'stochastic safety radius' for distance-based tree reconstruction

Axiomatic opportunities and obstacles for inferring a species tree from gene trees

Impacts of terraces on phylogenetic inference

Majority rule has transition ratio 4 on Yule trees under a 2-state symmetric model

Reflections on the extinction-explosion dichotomy

The most parsimonious tree for random data

Tracing evolutionary links between species

Tree-like Reticulation Networks - When Do Tree-like Distances Also Support Reticulate Evolution?

A matroid associated with a phylogenetic tree

An Arrow-type result for inferring a species tree from gene trees

Autocatalytic Sets and Biological Specificity

Autocatalytic sets in a partitioned biochemical network

Consistency of Bayesian inference of resolved phylogenetic trees

Hide and seek: placing and finding an optimal tree for thousands of homoplasy-rich sequences

Predicting the ancestral character changes in a tree is typically easier than predicting the root state

Predicting the loss of phylogenetic diversity under non-stationary diversification models

The standard lateral gene transfer model is statistically consistent for pectinate four-taxon trees

The Structure of N-Player Games when Influence and Independence Collide

Tree split probabilities determine the branch lengths

A formal model of autocatalytic sets emerging in an RNA replicator system

Autocatalytic Sets Extended: Dynamics, Inhibition, and a Generalization

Identifying a species tree subject to random lateral gene transfer

Is the Random Tree Puzzle process the same as the Yule-Harding process?

Minimal autocatalytic networks

On Patchworks and Hierarchies

Reconstructing fully-resolved trees from triplet cover distances

Root location in random trees: A polarity property of all sampling consistent phylogenetic models except one

The impact and interplay of long and short branches on phylogenetic information content

The Structure of Autocatalytic Sets: Evolvability, Enablement, and Emergence

'Bureaucratic' set systems, and their role in phylogenetics

Branch lengths on Yule trees and the expected loss of phylogenetic diversity

Clades, clans and reciprocal monophyly under neutral evolutionary models

Distribution of branch lengths and phylogenetic diversity under homogeneous speciation models

Predicting template-based catalysis rates in a simple catalytic reaction model

The 'Butterfly effect' in Cayley graphs, and its relevance for evolutionary genomics

Inferring ancestral sequences in taxon-rich phylogenies

Locating a tree in a phylogenetic network

How much can evolved characters tell us about the tree that generated them?