Researcher profile

John A. Rhodes

John A. Rhodes contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
14works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

14 published item(s)

preprint2022arXiv

Parameter Identifiability of a Multitype Pure-Birth Model of Speciation

Diversification models describe the random growth of evolutionary trees, modeling the historical relationships of species through speciation and extinction events. One class of such models allows for independently changing traits, or types, of the species within the tree, upon which speciation and extinction rates depend. Although identifiability of parameters is necessary to justify parameter estimation with a model, it has not been formally established for these models, despite their adoption for inference. This work establishes generic identifiability up to label swapping for the parameters of one of the simpler forms of such a model, a multitype pure birth model of speciation, from an asymptotic distribution derived from a single tree observation as its depth goes to infinity. Crucially for applications to available data, no observation of types is needed at any internal points in the tree, nor even at the leaves.

preprint2022arXiv

The Tree of Blobs of a Species Network: Identifiability under the Coalescent

Inference of species networks from genomic data under the Network Multispecies Coalescent Model is currently severely limited by heavy computational demands. It also remains unclear how complicated networks can be for consistent inference to be possible. As a step toward inferring a general species network, this work considers its tree of blobs, in which non-cut edges are contracted to nodes, so only tree-like relationships between the taxa are shown. An identifiability theorem, that most features of the unrooted tree of blobs can be determined from the distribution of gene quartet topologies, is established. This depends upon an analysis of gene quartet concordance factors under the model, together with a new combinatorial inference rule. The arguments for this theoretical result suggest a practical algorithm for tree of blobs inference, to be fully developed in a subsequent work.

preprint2020arXiv

Inferring metric trees from weighted quartets via an intertaxon distance

A metric phylogenetic tree relating a collection of taxa induces weighted rooted triples and weighted quartets for all subsets of three and four taxa, respectively. New intertaxon distances are defined that can be calculated from these weights, and shown to exactly fit the same tree topology, but with edge weights rescaled by certain factors dependent on the associated split size. These distances are analogs for metric trees of similar ones recently introduced for topological trees that are based on induced unweighted rooted triples and quartets. The distances introduced here lead to new statistically consistent methods of inferring a metric species tree from a collection of topological gene trees generated under the multispecies coalescent model of incomplete lineage sorting. Simulations provide insight into their potential.

preprint2020arXiv

NJst and ASTRID are not statistically consistent under a random model of missing data

Species tree estimation from multi-locus datasets is statistically challenging for multiple reasons, including gene tree heterogeneity across the genome due to incomplete lineage sorting (ILS). Species tree estimation methods have been developed that operate by estimating gene trees and then using those gene trees to estimate the species tree. Several of these methods (e.g., ASTRAL, ASTRID, and NJst) are provably statistically consistent under the multi-species coalescent (MSC) model, provided that the gene trees are estimated correctly, and there is no missing data. Recently, Nute et al. (BMC Genomics 2018) addressed the question of whether these methods remain statistically consistent under random models of taxon deletion, and asserted that they do so. Here we provide a counterexample to one of these theorems, and establish that ASTRID and NJst are not statistically consistent under an i.i.d. model of taxon deletion.

preprint2020arXiv

Parameter identifiability for a profile mixture model of protein evolution

A Profile Mixture Model is a model of protein evolution, describing sequence data in which sites are assumed to follow many related substitution processes on a single evolutionary tree. The processes depend in part on different amino acid distributions, or profiles, varying over sites in aligned sequences. A fundamental question for any stochastic model, which must be answered positively to justify model-based inference, is whether the parameters are identifiable from the probability distribution they determine. Here we show that a Profile Mixture Model has identifiable parameters under circumstances in which it is likely to be used for empirical analyses. In particular, for a tree relating 9 or more taxa, both the tree topology and all numerical parameters are generically identifiable when the number of profiles is less than 74.

preprint2012arXiv

A semialgebraic description of the general Markov model on phylogenetic trees

Many of the stochastic models used in inference of phylogenetic trees from biological sequence data have polynomial parameterization maps. The image of such a map --- the collection of joint distributions for a model --- forms the model space. Since the parameterization is polynomial, the Zariski closure of the model space is an algebraic variety which is typically much larger than the model space, but has been usefully studied with algebraic methods. Of ultimate interest, however, is not the full variety, but only the model space. Here we develop complete semialgebraic descriptions of the model space arising from the k-state general Markov model on a tree, with slightly restricted parameters. Our approach depends upon both recently-formulated analogs of Cayley's hyperdeterminant, and the construction of certain quadratic forms from the joint distribution whose positive (semi-)definiteness encodes information about parameter values. We additionally investigate the use of Sturm sequences for obtaining similar results.

preprint2012arXiv

Species tree inference by the STAR method, and generalizations

The multispecies coalescent model describes the generation of gene trees from a rooted metric species tree, and thus provides a framework for the inference of species trees from sampled gene trees. We prove that the STAR method of Liu et al., and generalizations of it, are statistically consistent methods of topological species tree inference under this model. We discuss the impact of gene tree sampling schemes for species tree inference using generalized STAR methods, and reinterpret the original STAR as a consensus method based on clades.

preprint2012arXiv

Tensor Rank, Invariants, Inequalities, and Applications

Though algebraic geometry over $\mathbb C$ is often used to describe the closure of the tensors of a given size and complex rank, this variety includes tensors of both smaller and larger rank. Here we focus on the $n\times n\times n$ tensors of rank $n$ over $\mathbb C$, which has as a dense subset the orbit of a single tensor under a natural group action. We construct polynomial invariants under this group action whose non-vanishing distinguishes this orbit from points only in its closure. Together with an explicit subset of the defining polynomials of the variety, this gives a semialgebraic description of the tensors of rank $n$ and multilinear rank $(n,n,n)$. The polynomials we construct coincide with Cayley's hyperdeterminant in the case $n=2$, and thus generalize it. Though our construction is direct and explicit, we also recast our functions in the language of representation theory for additional insights. We give three applications in different directions: First, we develop basic topological understanding of how the real tensors of complex rank $n$ and multilinear rank $(n,n,n)$ form a collection of path-connected subsets, one of which contains tensors of real rank $n$. Second, we use the invariants to develop a semialgebraic description of the set of probability distributions that can arise from a simple stochastic model with a hidden variable, a model that is important in phylogenetics and other fields. Third, we construct simple examples of tensors of rank $2n-1$ which lie in the closure of those of rank $n$.

preprint2012arXiv

When Do Phylogenetic Mixture Models Mimic Other Phylogenetic Models?

Phylogenetic mixture models, in which the sites in sequences undergo different substitution processes along the same or different trees, allow the description of heterogeneous evolutionary processes. As data sets consisting of longer sequences become available, it is important to understand such models, for both theoretical insights and use in statistical analyses. Some recent articles have highlighted disturbing "mimicking" behavior in which a distribution from a mixture model is identical to one arising on a different tree or trees. Other works have indicated such problems are unlikely to occur in practice, as they require very special parameter choices. After surveying some of these works on mixture models, we give several new results. In general, if the number of components in a generating mixture is not too large and we disallow zero or infinite branch lengths, then it cannot mimic the behavior of a non-mixture on a different tree. On the other hand, if the mixture model is locally over-parameterized, it is possible for a phylogenetic mixture model to mimic distributions of another tree model. Though theoretical questions remain, these sorts of results can serve as a guide to when the use of mixture models in either ML or Bayesian frameworks is likely to lead to statistically consistent inference, and when mimicking due to heterogeneity should be considered a realistic possibility.

preprint2011arXiv

Determining species tree topologies from clade probabilities under the coalescent

One approach to estimating a species tree from a collection of gene trees is to first estimate probabilities of clades from the gene trees, and then to construct the species tree from the estimated clade probabilities. While a greedy consensus algorithm, which consecutively accepts the most probable clades compatible with previously accepted clades, can be used for this second stage, this method is known to be statistically inconsistent under the multispecies coalescent model. This raises the question of whether it is theoretically possible to reconstruct the species tree from known probabilities of clades on gene trees. We investigate clade probabilities arising from the multispecies coalescent model, with an eye toward identifying features of the species tree. Clades on gene trees with probability greater than 1/3 are shown to reflect clades on the species tree, while those with smaller probabilities may not. Linear invariants of clade probabilities are studied both computationally and theoretically, with certain linear invariants giving insight into the clade structure of the species tree. For species trees with generic edge lengths, these invariants can be used to identify the species tree topology. These theoretical results both confirm that clade probabilities contain full information on the species tree topology and suggest future directions of study for developing statistically consistent inference methods from clade frequencies on gene trees.

preprint2010arXiv

Identifiability of Large Phylogenetic Mixture Models

Phylogenetic mixture models are statistical models of character evolution allowing for heterogeneity. Each of the classes in some unknown partition of the characters may evolve by different processes, or even along different trees. The fundamental question of whether parameters of such a model are identifiable is difficult to address, due to the complexity of the parameterization. We analyze mixture models on large trees, with many mixture components, showing that both numerical and tree parameters are indeed identifiable in these models when all trees are the same. We also explore the extent to which our algebraic techniques can be employed to extend the result to mixtures on different trees.

preprint2010arXiv

Identifying the Rooted Species Tree from the Distribution of Unrooted Gene Trees under the Coalescent

Gene trees are evolutionary trees representing the ancestry of genes sampled from multiple populations. Species trees represent populations of individuals -- each with many genes -- splitting into new populations or species. The coalescent process, which models ancestry of gene copies within populations, is often used to model the probability distribution of gene trees given a fixed species tree. This multispecies coalescent model provides a framework for phylogeneticists to infer species trees from gene trees using maximum likelihood or Bayesian approaches. Because the coalescent models a branching process over time, all trees are typically assumed to be rooted in this setting. Often, however, gene trees inferred by traditional phylogenetic methods are unrooted. We investigate probabilities of unrooted gene trees under the multispecies coalescent model. We show that when there are 4 species with one gene sampled per species, the distribution of unrooted gene tree topologies identifies the unrooted species tree topology and some, but not all, information in the species tree edges (branch lengths). The location of the root on the species tree is not identifiable in this situation. However, for 5 or more species with one gene sampled per species, we show that the distribution of unrooted gene tree topologies identifies the rooted species tree topology and all its internal branch lengths. The length of any pendent branch leading to a leaf of the species tree is also identifiable for any species from which more than one gene is sampled.

preprint2007arXiv

Identifying evolutionary trees and substitution parameters for the general Markov model with invariable sites

The general Markov plus invariable sites (GM+I) model of biological sequence evolution is a two-class model in which an unknown proportion of sites are not allowed to change, while the remainder undergo substitutions according to a Markov process on a tree. For statistical use it is important to know if the model is identifiable; can both the tree topology and the numerical parameters be determined from a joint distribution describing sequences only at the leaves of the tree? We establish that for generic parameters both the tree and all numerical parameter values can be recovered, up to clearly understood issues of `label swapping.' The method of analysis is algebraic, using phylogenetic invariants to study the variety defined by the model. Simple rational formulas, expressed in terms of determinantal ratios, are found for recovering numerical parameters describing the invariable sites.

preprint2005arXiv

The identifiability of tree topology for phylogenetic models, including covarion and mixture models

For a model of molecular evolution to be useful for phylogenetic inference, the topology of evolutionary trees must be identifiable. That is, from a joint distribution the model predicts, it must be possible to recover the tree parameter. We establish tree identifiability for a number of phylogenetic models, including a covarion model and a variety of mixture models with a limited number of classes. The proof is based on the introduction of a more general model, allowing more states at internal nodes of the tree than at leaves, and the study of the algebraic variety formed by the joint distributions to which it gives rise. Tree identifiability is first established for this general model through the use of certain phylogenetic invariants.