Source author record

Steven N. Evans

Steven N. Evans appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.PR Populations and Evolution math.CO Genomics math.AC math.ST Statistics Theory Applications astro-ph.EP math.AP math.DS math.MG math.RT Molecular Networks q-fin.CP q-fin.PR q-fin.RM

Catalog footprint

What is connected

31works

17topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Doob--Martin boundary of Rémy's tree growth chain

Rémy's algorithm is a Markov chain that iteratively generates a sequence of random trees in such a way that the $n^{\mathrm{th}}$ tree is uniformly distributed over the set of rooted, planar, binary trees with $2n+1$ vertices. We obtain a concrete characterization of the Doob--Martin boundary of this transient Markov chain and thereby delineate all the ways in which, loosely speaking, this process can be conditioned to "go to infinity" at large times. A (deterministic) sequence of finite rooted, planar, binary trees converges to a point in the boundary if for each $m$ the random rooted, planar, binary tree spanned by $m+1$ leaves chosen uniformly at random from the $n^{\mathrm{th}}$ tree in the sequence converges in distribution as $n$ tends to infinity -- a notion of convergence that is analogous to one that appears in the recently developed theory of graph limits. We show that a point in the Doob--Martin boundary may be identified with the following ensemble of objects: a complete separable $\mathbb{R}$-tree that is rooted and binary in a suitable sense, a diffuse probability measure on the $\mathbb{R}$-tree that allows us to make sense of sampling points from it, and a kernel on the $\mathbb{R}$-tree that describes the probability that the first of a given pair of points is below and to the left of their most recent common ancestor while the second is below and to the right. The Doob--Martin boundary corresponds bijectively to the set of extreme points of the closed convex set of normalized nonnegative harmonic functions, in other words, the minimal and full Doob--Martin boundaries coincide. These results are in the spirit of the identification of graphons as limit objects in the theory of graph limits.

preprint2020arXiv

Two continua of embedded regenerative sets

Given a two-sided real-valued Lévy process $(X_t)_{t \in \mathbb{R}}$, define processes $(L_t)_{t \in \mathbb{R}}$ and $(M_t)_{t \in \mathbb{R}}$ by $L_t := \sup\{h \in \mathbb{R} : h - α(t-s) \le X_s \text{ for all } s \le t\} = \inf\{X_s + α(t-s) : s \le t\}$, $t \in \mathbb{R}$, and $M_t := \sup \{ h \in \mathbb{R} : h - α|t-s| \leq X_s \text{ for all } s \in \mathbb{R} \} = \inf \{X_s + α|t-s| : s \in \mathbb{R}\}$, $t \in \mathbb{R}$. The corresponding contact sets are the random sets $\mathcal{H}_α:= \{ t \in \mathbb{R} : X_{t}\wedge X_{t-} = L_t\}$ and $\mathcal{Z}_α:= \{ t \in \mathbb{R} : X_{t}\wedge X_{t-} = M_t\}$. For a fixed $α>\mathbb{E}[X_1]$ (resp. $α>|\mathbb{E}[X_1]|$) the set $\mathcal{H}_α$ (resp. $\mathcal{Z}_α$) is non-empty, closed, unbounded above and below, stationary, and regenerative. The collections $(\mathcal{H}_α)_{α> \mathbb{E}[X_1]}$ and $(\mathcal{Z}_α)_{α> |\mathbb{E}[X_1]|}$ are increasing in $α$ and the regeneration property is compatible with these inclusions in that each family is a continuum of embedded regenerative sets in the sense of Bertoin. We show that $(\sup\{t < 0 : t \in \mathcal{H}_α\})_{α> \mathbb{E}[X_1]}$ is a càdlàg, nondecreasing, pure jump process with independent increments and determine the intensity measure of the associated Poisson process of jumps. We obtain a similar result for $(\sup\{t < 0 : t \in \mathcal{Z}_α\})_{α> |β|}$ when $(X_t)_{t \in \mathbb{R}}$ is a (two-sided) Brownian motion with drift $β$.

preprint2016arXiv

Bayesian inference of natural selection from allele frequency time series

The advent of accessible ancient DNA technology now allows the direct ascertainment of allele frequencies in ancestral populations, thereby enabling the use of allele frequency time series to detect and estimate natural selection. Such direct observations of allele frequency dynamics are expected to be more powerful than inferences made using patterns of linked neutral variation obtained from modern individuals. We develop a Bayesian method to make use of allele frequency time series data and infer the parameters of general diploid selection, along with allele age, in non-equilibrium populations. We introduce a novel path augmentation approach, in which we use Markov chain Monte Carlo to integrate over the space of allele frequency trajectories consistent with the observed data. Using simulations, we show that this approach has good power to estimate selection coefficients and allele age. Moreover, when applying our approach to data on horse coat color, we find that ignoring a relevant demographic history can significantly bias the results of inference. Our approach is made available in a C++ software package.

preprint2016arXiv

Doob-Martin compactification of a Markov chain for growing random words sequentially

We consider a Markov chain that iteratively generates a sequence of random finite words in such a way that the $n^{\mathrm{th}}$ word is uniformly distributed over the set of words of length $2n$ in which $n$ letters are $a$ and $n$ letters are $b$: at each step an $a$ and a $b$ are shuffled in uniformly at random among the letters of the current word. We obtain a concrete characterization of the Doob-Martin boundary of this Markov chain. Writing $N(u)$ for the number of letters $a$ (equivalently, $b$) in the finite word $u$, we show that a sequence $(u_n)_{n \in \mathbb{N}}$ of finite words converges to a point in the boundary if, for an arbitrary word $v$, there is convergence as $n$ tends to infinity of the probability that the selection of $N(v)$ letters $a$ and $N(v)$ letters $b$ uniformly at random from $u_n$ and maintaining their relative order results in $v$. We exhibit a bijective correspondence between the points in the boundary and ergodic random total orders on the set $\{a_1, b_1, a_2, b_2, \ldots \}$ that have distributions which are separately invariant under finite permutations of the indices of the $a'$s and those of the $b'$s. We establish a further bijective correspondence between the set of such random total orders and the set of pairs $(μ,ν)$ of diffuse probability measures on $[0,1]$ such that $\frac{1}{2}(μ+ν)$ is Lebesgue measure: the restriction of the random total order to $\{a_1, b_1, \ldots, a_n, b_n\}$ is obtained by taking $X_1, \ldots, X_n$ (resp. $Y_1, \ldots, Y_n$) i.i.d. with common distribution $μ$ (resp. $ν$), letting $(Z_1, \ldots, Z_{2n})$ be $\{X_1, Y_1, \ldots, X_n, Y_n\}$ in increasing order, and declaring that the $k^{\mathrm{th}}$ smallest element in the restricted total order is $a_i$ (resp. $b_j$) if $Z_k = X_i$ (resp. $Z_k = Y_j$).

preprint2016arXiv

Leading the field: Fortune favors the bold in Thurstonian choice models

Schools with the highest average student performance are often the smallest schools; localities with the highest rates of some cancers are frequently small and the effects observed in clinical trials are likely to be largest for the smallest numbers of subjects. Informal explanations of this "small-schools phenomenon" point to the fact that the sample means of smaller samples have higher variances. But this cannot be a complete explanation: If we draw two samples from a diffuse distribution that is symmetric about some point, then the chance that the smaller sample has larger mean is 50\%. A particular consequence of results proved below is that if one draws three or more samples of different sizes from the same normal distribution, then the sample mean of the smallest sample is most likely to be highest, the sample mean of the second smallest sample is second most likely to be highest, and so on; this is true even though for any pair of samples, each one of the pair is equally likely to have the larger sample mean. Our conclusions are relevant to certain stochastic choice models including the following generalization of Thurstone's Law of Comparative Judgment. There are $n$ items. Item $i$ is preferred to item $j$ if $Z_i < Z_j$, where $Z$ is a random $n$-vector of preference scores. Suppose $\mathbb{P}\{Z_i = Z_j\} = 0$ for $i \ne j$, so there are no ties. Item $k$ is the favorite if $Z_k < \min_{i\ne k} Z_i$. Let $p_i$ denote the chance that item $i$ is the favorite. We characterize a large class of distributions for $Z$ for which $p_1 > p_2 > \cdots > p_n$. Our results are most surprising when $\mathbb{P}\{Z_i < Z_j\} = \mathbb{P}\{Z_i > Z_j\} = \frac{1}{2}$ for $i \ne j$, so neither of any two items is likely to be preferred over the other in a pairwise comparison.

preprint2016arXiv

Radix sort trees in the large

The trie-based radix sort algorithm stores pairwise different infinite binary strings in the leaves of a binary tree in a way that the Ulam-Harris coding of each leaf equals a prefix (that is, an initial segment) of the corresponding string, with the prefixes being of minimal length so that they are pairwise different. We investigate the {\em radix sort tree chains} -- the tree-valued Markov chains that arise when successively storing infinite binary strings $Z_1,\ldots, Z_n$, $n=1,2,\ldots$ according to the trie-based radix sort algorithm, where the source strings $Z_1, Z_2,\ldots$ are independent and identically distributed. We establish a bijective correspondence between the full Doob--Martin boundary of the radix sort tree chain with a {\em symmetric Bernoulli source} (that is, each $Z_k$ is a fair coin-tossing sequence) and the family of radix sort tree chains for which the common distribution of the $Z_k$ is a diffuse probability measure on $\{0,1\}^\infty$. In essence, our result characterizes all the ways that it is possible to condition such a chain of radix sort trees consistently on its behavior "in the large".

preprint2015arXiv

Recovering a tree from the lengths of subtrees spanned by a randomly chosen sequence of leaves

Given an edge-weighted tree $T$ with $n$ leaves, sample the leaves uniformly at random without replacement and let $W_k$, $2 \le k \le n$, be the length of the subtree spanned by the first $k$ leaves. We consider the question, "Can $T$ be identified (up to isomorphism) by the joint probability distribution of the random vector $(W_2, \ldots, W_n)$?" We show that if $T$ is known {\em a priori} to belong to one of various families of edge-weighted trees, then the answer is, "Yes." These families include the edge-weighted trees with edge-weights in general position, the ultrametric edge-weighted trees, and certain families with equal weights on all edges such as $(k+1)$-valent and rooted $k$-ary trees for $k \ge 2$ and caterpillars.

preprint2015arXiv

When do skew-products exist?

The classical skew-product decomposition of planar Brownian motion represents the process in polar coordinates as an autonomously Markovian radial part and an angular part that is an independent Brownian motion on the unit circle time-changed according to the radial part. Theorem 4 of Liao (2009) gives a broad generalization of this fact to a setting where there is a diffusion on a manifold $X$ with a distribution that is equivariant under the smooth action of a Lie group $K$. Under appropriate conditions, there is a decomposition into an autonomously Markovian "radial" part that lives on the space of orbits of $K$ and an "angular" part that is an independent Brownian motion on the homogeneous space $K/M$, where $M$ is the isotropy subgroup of a point of $x$, that is time-changed with a time-change that is adapted to the filtration of the radial part. We present two apparent counterexamples to Theorem 4 of Liao (2009). In the first counterexample the angular part is not a time-change of any Brownian motion on $K/M$, whereas in the second counterexample the angular part is the time-change of a Brownian motion on $K/M$ but this Brownian motion is not independent of the radial part. In both of these examples $K/M$ has dimension $1$. The statement and proof of Theorem 4 from Liao (2009) remain valid when $K/M$ has dimension greater than $1$. Our examples raise the question of what conditions lead to the usual sort of skew-product decomposition when $K/M$ has dimension $1$ and what conditions lead to there being no decomposition at all or one in which the angular part is a time-changed Brownian motion but this Brownian motion is not independent of the radial part.

preprint2014arXiv

Killed Brownian motion with a prescribed lifetime distribution and models of default

The inverse first passage time problem asks whether, for a Brownian motion $B$ and a nonnegative random variable $ζ$, there exists a time-varying barrier $b$ such that $\mathbb{P}\{B_s>b(s),0\leq s\leq t\}=\mathbb{P}\{ζ>t\}$. We study a "smoothed" version of this problem and ask whether there is a "barrier" $b$ such that $ \mathbb{E}[\exp(-λ\int_0^tψ(B_s-b(s))\,ds)]=\mathbb{P}\{ζ>t\}$, where $λ$ is a killing rate parameter, and $ψ:\mathbb{R}\to[0,1]$ is a nonincreasing function. We prove that if $ψ$ is suitably smooth, the function $t\mapsto \mathbb{P}\{ζ>t\}$ is twice continuously differentiable, and the condition $0<-\frac{d\log\mathbb{P}\{ζ>t\}}{dt}<λ$ holds for the hazard rate of $ζ$, then there exists a unique continuously differentiable function $b$ solving the smoothed problem. We show how this result leads to flexible models of default for which it is possible to compute expected values of contingent claims.

preprint2014arXiv

Protected polymorphisms and evolutionary stability of patch-selection strategies in stochastic environments

We consider a population living in a patchy environment that varies stochastically in space and time. The population is composed of two morphs (that is, individuals of the same species with different genotypes). In terms of survival and reproductive success, the associated phenotypes differ only in their habitat selection strategies. We compute invasion rates corresponding to the rates at which the abundance of an initially rare morph increases in the presence of the other morph established at equilibrium. If both morphs have positive invasion rates when rare, then there is an equilibrium distribution such that the two morphs coexist; that is, there is a protected polymorphism for habitat selection. Alternatively, if one morph has a negative invasion rate when rare, then it is asymptotically displaced by the other morph under all initial conditions where both morphs are present. We refine the characterization of an evolutionary stable strategy for habitat selection from [Schreiber, 2012] in a mathematically rigorous manner. We provide a necessary and sufficient condition for the existence of an ESS that uses all patches and determine when using a single patch is an ESS. We also provide an explicit formula for the ESS when there are two habitat types. We show that adding environmental stochasticity results in an ESS that, when compared to the ESS for the corresponding model without stochasticity, spends less time in patches with larger carrying capacities and possibly makes use of sink patches, thereby practicing a spatial form of bet hedging.

preprint2014arXiv

The semigroup of metric measure spaces and its infinitely divisible probability measures

A metric measure space is a complete separable metric space equipped with probability measure that has full support. Two such spaces are equivalent if they are isometric as metric spaces via an isometry that maps the probability measure on the first space to the probability measure on the second. The resulting set of equivalence classes can be metrized with the Gromov-Prohorov metric of Greven, Pfaffelhuber and Winter. We consider the natural binary operation $\boxplus$ on this space that takes two metric measure spaces and forms their Cartesian product equipped with the sum of the two metrics and the product of the two probability measures. We show that the metric measure spaces equipped with this operation form a cancellative, commutative, Polish semigroup with a translation invariant metric and that each element has a unique factorization into prime elements. We investigate the interaction between the semigroup structure and the natural action of the positive real numbers on this space that arises from scaling the metric. For example, we show that for any given positive real numbers $a,b,c$ the trivial space is the only space $\mathcal{X}$ that satisfies $a \mathcal{X} \boxplus b \mathcal{X} = c \mathcal{X}$. We establish that there is no analogue of the law of large numbers: if $\mathbf{X}_1, \mathbf{X}_2$..., is an identically distributed independent sequence of random spaces, then no subsequence of $\frac{1}{n} \boxplus_{k=1}^n \mathbf{X}_k$ converges in distribution unless each $\mathbf{X}_k$ is almost surely equal to the trivial space. We characterize the infinitely divisible probability measures and the Lévy processes on this semigroup, characterize the stable probability measures and establish a counterpart of the LePage representation for the latter class.

preprint2014arXiv

Unseparated pairs and fixed points in random permutations

In a uniform random permutation Πof [n] := {1,2,...,n}, the set of elements k in [n-1] such that Π(k+1) = Π(k) + 1 has the same distribution as the set of fixed points of Πthat lie in [n-1]. We give three different proofs of this fact using, respectively, an enumeration relying on the inclusion-exclusion principle, the introduction of two different Markov chains to generate uniform random permutations, and the construction of a combinatorial bijection. We also obtain the distribution of the analogous set for circular permutations that consists of those k in [n] such that Π(k+1 mod n) = Π(k) + 1 mod n. This latter random set is just the set of fixed points of the commutator [ρ, Π], where ρis the n-cycle (1,2,...,n). We show for a general permutation ηthat, under weak conditions on the number of fixed points and 2-cycles of η, the total variation distance between the distribution of the number of fixed points of [η,Π] and a Poisson distribution with expected value 1 is small when n is large.

preprint2013arXiv

Analysis and rejection sampling of Wright-Fisher diffusion bridges

We investigate the properties of a Wright-Fisher diffusion process started from frequency x at time 0 and conditioned to be at frequency y at time T. Such a process is called a bridge. Bridges arise naturally in the analysis of selection acting on standing variation and in the inference of selection from allele frequency time series. We establish a number of results about the distribution of neutral Wright-Fisher bridges and develop a novel rejection sampling scheme for bridges under selection that we use to study their behavior.

preprint2012arXiv

Coalescing systems of non-Brownian particles

A well-known result of Arratia shows that one can make rigorous the notion of starting an independent Brownian motion at every point of an arbitrary closed subset of the real line and then building a set-valued process by requiring particles to coalesce when they collide. Arratia noted that the value of this process will be almost surely a locally finite set at all positive times, and a finite set almost surely if the initial value is compact: the key to both of these facts is the observation that, because of the topology of the real line and the continuity of Brownian sample paths, at the time when two particles collide one or the other of them must have already collided with each particle that was initially between them. We investigate whether such instantaneous coalescence still occurs for coalescing systems of particles where either the state space of the individual particles is not locally homeomorphic to an interval or the sample paths of the individual particles are discontinuous. We give a quite general criterion for a coalescing system of particles on a compact state space to coalesce to a finite set at all positive times almost surely and show that there is almost sure instantaneous coalescence to a locally finite set for systems of Brownian motions on the Sierpinski gasket and stable processes on the real line with stable index greater than one.

preprint2012arXiv

Lipschitz minorants of Brownian Motion and Levy processes

For $α> 0$, the $α$-Lipschitz minorant of a function $f: \mathbb{R} \to \mathbb{R}$ is the greatest function $m : \mathbb{R} \to \mathbb{R}$ such that $m \leq f$ and $|m(s)-m(t)| \le α|s-t|$ for all $s,t \in \mathbb{R}$, should such a function exist. If $X=(X_t)_{t \in \mathbb{R}}$ is a real-valued Lévy process that is not pure linear drift with slope $\pm α$, then the sample paths of $X$ have an $α$-Lipschitz minorant almost surely if and only if $| \mathbb{E}[X_1] | < α$. Denoting the minorant by $M$, we investigate properties of the random closed set $\mathcal{Z} := {t \in \mathbb{R} : M_t = X_t \wedge X_{t-}}$, which, since it is regenerative and stationary, has the distribution of the closed range of some subordinator "made stationary" in a suitable sense. We give conditions for the contact set $\mathcal{Z}$ to be countable or to have zero Lebesgue measure, and we obtain formulas that characterize the Lévy measure of the associated subordinator. We study the limit of $\mathcal{Z}$ as $α\to \infty$ and find for the so-called abrupt Lévy processes introduced by Vigon that this limit is the set of local infima of $X$. When $X$ is a Brownian motion with drift $β$ such that $|β| < α$, we calculate explicitly the densities of various random variables related to the minorant.

preprint2012arXiv

Stochastic equations on projective systems of groups

We consider stochastic equations of the form $X_k = ϕ_k(X_{k+1}) Z_k$, $k \in \mathbb{N}$, where $X_k$ and $Z_k$ are random variables taking values in a compact group $G_k$, $ϕ_k: G_{k+1} \to G_k$ is a continuous homomorphism, and the noise $(Z_k)_{k \in \mathbb{N}}$ is a sequence of independent random variables. We take the sequence of homomorphisms and the sequence of noise distributions as given, and investigate what conditions on these objects result in a unique distribution for the "solution" sequence $(X_k)_{k \in \mathbb{N}}$ and what conditions permits the existence of a solution sequence that is a function of the noise alone (that is, the solution does not incorporate extra input randomness "at infinity"). Our results extend previous work on stochastic equations on a single group that was originally motivated by Tsirelson's example of a stochastic differential equation that has a unique solution in law but no strong solutions.

preprint2012arXiv

Stochastic flights of propellers

Kilometer-sized moonlets in Saturn's A ring create S-shaped wakes called "propellers" in surrounding material. The Cassini spacecraft has tracked the motions of propellers for several years and finds that they deviate from Keplerian orbits having constant semimajor axes. The inferred orbital migration is known to switch sign. We show using a statistical test that the time series of orbital longitudes of the propeller Blériot is consistent with that of a time-integrated Gaussian random walk. That is, Blériot's observed migration pattern is consistent with being stochastic. We further show, using a combination of analytic estimates and collisional N-body simulations, that stochastic migration of the right magnitude to explain the Cassini observations can be driven by encounters with ring particles 10-20 m in radius. That the local ring mass is concentrated in decameter-sized particles is supported on independent grounds by occultation analyses.

preprint2012arXiv

Stochastic population growth in spatially heterogeneous environments

Classical ecological theory predicts that environmental stochasticity increases extinction risk by reducing the average per-capita growth rate of populations. To understand the interactive effects of environmental stochasticity, spatial heterogeneity, and dispersal on population growth, we study the following model for population abundances in $n$ patches: the conditional law of $X_{t+dt}$ given $X_t=x$ is such that when $dt$ is small the conditional mean of $X_{t+dt}^i-X_t^i$ is approximately $[x^iμ_i+\sum_j(x^j D_{ji}-x^i D_{ij})]dt$, where $X_t^i$ and $μ_i$ are the abundance and per capita growth rate in the $i$-th patch respectivly, and $D_{ij}$ is the dispersal rate from the $i$-th to the $j$-th patch, and the conditional covariance of $X_{t+dt}^i-X_t^i$ and $X_{t+dt}^j-X_t^j$ is approximately $x^i x^j σ_{ij}dt$. We show for such a spatially extended population that if $S_t=(X_t^1+...+X_t^n)$ is the total population abundance, then $Y_t=X_t/S_t$, the vector of patch proportions, converges in law to a random vector $Y_\infty$ as $t\to\infty$, and the stochastic growth rate $\lim_{t\to\infty}t^{-1}\log S_t$ equals the space-time average per-capita growth rate $\sum_iμ_i\E[Y_\infty^i]$ experienced by the population minus half of the space-time average temporal variation $\E[\sum_{i,j}σ_{ij}Y_\infty^i Y_\infty^j]$ experienced by the population. We derive analytic results for the law of $Y_\infty$, find which choice of the dispersal mechanism $D$ produces an optimal stochastic growth rate for a freely dispersing population, and investigate the effect on the stochastic growth rate of constraints on dispersal rates. Our results provide fundamental insights into "ideal free" movement in the face of uncertainty, the persistence of coupled sink populations, the evolution of dispersal rates, and the single large or several small (SLOSS) debate in conservation biology.

preprint2011arXiv

A limit theorem for occupation measures of Lévy processes in compact groups

A short proof is given of a necessary and sufficient condition for the normalized occupation measure of a Lévy process in a metrizable compact group to be asymptotically uniform with probability one.

preprint2011arXiv

A mutation-selection model for general genotypes with recombination

We investigate a continuous time, probability measure-valued dynamical system that describes the process of mutation-selection balance in a context where the population is infinite, there may be infinitely many loci, and there are weak assumptions on selective costs. Our model arises when we incorporate very general recombination mechanisms into a previous model of mutation and selection from Steinsaltz, Evans and Wachter (2005) and take the relative strength of mutation and selection to be sufficiently small. The resulting dynamical system is a flow of measures on the space of loci. Each such measure is the intensity measure of a Poisson random measure on the space of loci: the points of a realization of the random measure record the set of loci at which the genotype of a uniformly chosen individual differs from a reference wild type due to an accumulation of ancestral mutations. Our motivation for working in such a general setting is to provide a basis for understanding mutation-driven changes in age-specific demographic schedules that arise from the complex interaction of many genes, and hence to develop a framework for understanding the evolution of aging. We establish the existence and uniqueness of the dynamical system, provide conditions for the existence and stability of equilibrium states, and prove that our continuous-time dynamical system is the limit of a sequence of discrete-time infinite population mutation-selection-recombination models in the standard asymptotic regime where selection and mutation are weak relative to recombination and both scale at the same infinitesimal rate in the limit.

preprint2011arXiv

Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison

Principal components (PCA) and hierarchical clustering are two of the most heavily used techniques for analyzing the differences between nucleic acid sequence samples sampled from a given environment. However, a classical application of these techniques to distances computed between samples can lack transparency because there is no ready interpretation of the axes of classical PCA plots, and it is difficult to assign any clear intuitive meaning to either the internal nodes or the edge lengths of trees produced by distance-based hierarchical clustering methods such as UPGMA. We show that more interesting and interpretable results are produced by two new methods that leverage the special structure of phylogenetic placement data. Edge principal components analysis enables the detection of important differences between samples that contain closely related taxa. Each principal component axis is simply a collection of signed weights on the edges of the phylogenetic tree, and these weights are easily visualized by a suitable thickening and coloring of the edges. Squash clustering outputs a (rooted) clustering tree in which each internal node corresponds to an appropriate "average" of the original samples at the leaves below the node. Moreover, the length of an edge is a suitably defined distance between the averaged samples associated with the two incident nodes, rather than the less interpretable average of distances produced by UPGMA. We present these methods and illustrate their use with data from the microbiome of the human vagina.

preprint2011arXiv

Spectra of large random trees

We analyze the eigenvalues of the adjacency matrices of a wide variety of random trees. Using general, broadly applicable arguments based on the interlacing inequalities for the eigenvalues of a principal submatrix of a Hermitian matrix and a suitable notion of local weak convergence for an ensemble of random trees, we show that the empirical spectral distributions for each of a number of random tree models converge to a deterministic (model dependent) limit as the number of vertices goes to infinity. We conclude for ensembles such as the linear preferential attachment models, random recursive trees, and the uniform random trees that the limiting spectral distribution has a set of atoms that is dense in the real line. We obtain precise asymptotics on the mass assigned to zero by the empirical spectral measures via the connection with the cardinality of a maximal matching. Moreover, we show that the total weight of a weighted matching is asymptotically equivalent to a constant multiple of the number of vertices when the edge weights are independent, identically distributed, non-negative random variables with finite expected value. We greatly extend a celebrated result obtained by Schwenk for the uniform random trees by showing that, under mild conditions, with probability converging to one, the spectrum of a realization is shared by at least one other tree. For the the linear preferential attachment model with parameter $a > -1$, we show that the suitably rescaled $k$ largest eigenvalues converge jointly.

preprint2011arXiv

The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples

Using modern technology, it is now common to survey microbial communities by sequencing DNA or RNA extracted in bulk from a given environment. Comparative methods are needed that indicate the extent to which two communities differ given data sets of this type. UniFrac, a method built around a somewhat ad hoc phylogenetics-based distance between two communities, is one of the most commonly used tools for these analyses. We provide a foundation for such methods by establishing that if one equates a metagenomic sample with its empirical distribution on a reference phylogenetic tree, then the weighted UniFrac distance between two samples is just the classical Kantorovich-Rubinstein (KR) distance between the corresponding empirical distributions. We demonstrate that this KR distance and extensions of it that arise from incorporating uncertainty in the location of sample points can be written as a readily computable integral over the tree, we develop $L^p$ Zolotarev-type generalizations of the metric, and we show how the p-value of the resulting natural permutation test of the null hypothesis "no difference between the two communities" can be approximated using a functional of a Gaussian process indexed by the tree. We relate the $L^2$ case to an ANOVA-type decomposition and find that the distribution of its associated Gaussian functional is that of a computable linear combination of independent $χ_1^2$ random variables.

preprint2011arXiv

Transcriptional regulation: Effects of promoter proximal pausing on speed, synchrony and reliability

Recent whole genome polymerase binding assays have shown that a large proportion of unexpressed genes have pre-assembled RNA pol II transcription initiation complex stably bound to their promoters. Some such promoter proximally paused genes are regulated at transcription elongation rather than at initiation; it has been proposed that this difference allows these genes to both express faster and achieve more synchronous expression across populations of cells, thus overcoming molecular "noise" arising from low copy number factors. It has been established experimentally that genes which are regulated at elongation tend to express faster and more synchronously; however, it has not been shown directly whether or not it is the change in the regulated step {\em per se} that causes this increase in speed and synchrony. We investigate this question by proposing and analyzing a continuous-time Markov chain model of polymerase complex assembly regulated at one of two steps: initial polymerase association with DNA, or release from a paused, transcribing state. Our analysis demonstrates that, over a wide range of physical parameters, increased speed and synchrony are functional consequences of elongation control. Further, we make new predictions about the effect of elongation regulation on the consistent control of total transcript number between cells, and identify which elements in the transcription induction pathway are most sensitive to molecular noise and thus may be most evolutionarily constrained. Our methods produce symbolic expressions for quantities of interest with reasonable computational effort and can be used to explore the interplay between interaction topology and molecular noise in a broader class of biochemical networks. We provide general-purpose code implementing these methods.

preprint2011arXiv

Trickle-down processes and their boundaries

It is possible to represent each of a number of Markov chains as an evolving sequence of connected subsets of a directed acyclic graph that grow in the following way: initially, all vertices of the graph are unoccupied, particles are fed in one-by-one at a distinguished source vertex, successive particles proceed along directed edges according to an appropriate stochastic mechanism, and each particle comes to rest once it encounters an unoccupied vertex. Examples include the binary and digital search tree processes, the random recursive tree process and generalizations of it arising from nested instances of Pitman's two-parameter Chinese restaurant process, tree-growth models associated with Mallows' phi model of random permutations and with Schuetzenberger's non-commutative q-binomial theorem, and a construction due to Luczak and Winkler that grows uniform random binary trees in a Markovian manner. We introduce a framework that encompasses such Markov chains, and we characterize their asymptotic behavior by analyzing in detail their Doob-Martin compactifications, Poisson boundaries and tail sigma-fields.

preprint2010arXiv

Commuting birth-and-death processes

We use methods from combinatorics and algebraic statistics to study analogues of birth-and-death processes that have as their state space a finite subset of the $m$-dimensional lattice and for which the $m$ matrices that record the transition probabilities in each of the lattice directions commute pairwise. One reason such processes are of interest is that the transition matrix is straightforward to diagonalize, and hence it is easy to compute $n$ step transition probabilities. The set of commuting birth-and-death processes decomposes as a union of toric varieties, with the main component being the closure of all processes whose nearest neighbor transition probabilities are positive. We exhibit an explicit monomial parametrization for this main component, and we explore the boundary components using primary decomposition.

preprint2010arXiv

Coverage statistics for sequence census methods

Background: We study the statistical properties of fragment coverage in genome sequencing experiments. In an extension of the classic Lander-Waterman model, we consider the effect of the length distribution of fragments. We also introduce the notion of the shape of a coverage function, which can be used to detect abberations in coverage. The probability theory underlying these problems is essential for constructing models of current high-throughput sequencing experiments, where both sample preparation protocols and sequencing technology particulars can affect fragment length distributions. Results: We show that regardless of fragment length distribution and under the mild assumption that fragment start sites are Poisson distributed, the fragments produced in a sequencing experiment can be viewed as resulting from a two-dimensional spatial Poisson process. We then study the jump skeleton of the the coverage function, and show that the induced trees are Galton-Watson trees whose parameters can be computed. Conclusions: Our results extend standard analyses of shotgun sequencing that focus on coverage statistics at individual sites, and provide a null model for detecting deviations from random coverage in high-throughput sequence census based experiments. By focusing on fragments, we are also led to a new approach for visualizing sequencing data that should be of independent interest.

preprint2010arXiv

Dynamics of the time to the most recent common ancestor in a large branching population

If we follow an asexually reproducing population through time, then the amount of time that has passed since the most recent common ancestor (MRCA) of all current individuals lived will change as time progresses. The resulting "MRCA age" process has been studied previously when the population has a constant large size and evolves via the diffusion limit of standard Wright--Fisher dynamics. For any population model, the sample paths of the MRCA age process are made up of periods of linear upward drift with slope +1 punctuated by downward jumps. We build other Markov processes that have such paths from Poisson point processes on $\mathbb{R}_{++}\times\mathbb{R}_{++}$ with intensity measures of the form $λ\otimesμ$ where $λ$ is Lebesgue measure, and $μ$ (the "family lifetime measure") is an arbitrary, absolutely continuous measure satisfying $μ((0,\infty))=\infty$ and $μ((x,\infty))<\infty$ for all $x>0$. Special cases of this construction describe the time evolution of the MRCA age in $(1+β)$-stable continuous state branching processes conditioned on nonextinction--a particular case of which, $β=1$, is Feller's continuous state branching process conditioned on nonextinction. As well as the continuous time process, we also consider the discrete time Markov chain that records the value of the continuous process just before and after its successive jumps. We find transition probabilities for both the continuous and discrete time processes, determine when these processes are transient and recurrent and compute stationary distributions when they exist.

preprint2010arXiv

Non-existence of Markovian time dynamics for graphical models of correlated default

Filiz et al. (2008) proposed a model for the pattern of defaults seen among a group of firms at the end of a given time period. The ingredients in the model are a graph, where the vertices correspond to the firms and the edges describe the network of interdependencies between the firms, a parameter for each vertex that captures the individual propensity of that firm to default, and a parameter for each edge that captures the joint propensity of the two connected firms to default. The correlated default model can be re-rewritten as a standard Ising model on the graph by identifying the set of defaulting firms in the default model with the set of sites in the Ising model for which the spin is +1. We ask whether there is a suitable continuous time Markov chain taking values in the subsets of the vertex set such that the initial state of the chain is the empty set, each jump of the chain involves the inclusion of a single extra vertex, the distribution of the chain at some fixed time horizon time is the one given by the default model, and the distribution of the chain for other times is described by a probability distribution in the same family as the default model. We show for three simple but financially natural special cases that this is not possible outside of the trivial case where there is complete independence between the firms.

preprint2010arXiv

Shape-based peak identification for ChIP-Seq

We present a new algorithm for the identification of bound regions from ChIP-seq experiments. Our method for identifying statistically significant peaks from read coverage is inspired by the notion of persistence in topological data analysis and provides a non-parametric approach that is robust to noise in experiments. Specifically, our method reduces the peak calling problem to the study of tree-based statistics derived from the data. We demonstrate the accuracy of our method on existing datasets, and we show that it can discover previously missed regions and can more clearly discriminate between multiple binding events. The software T-PIC (Tree shape Peak Identification for ChIP-Seq) is available at http://math.berkeley.edu/~vhower/tpic.html

preprint1998arXiv

Transition operators of diffusions reduce zero-crossing

If $u(t,x)$ is a solution of a one--dimensional, parabolic, second--order, linear partial differential equation (PDE), then it is known that, under suitable conditions, the number of zero--crossings of the function $u(t,\cdot)$ decreases (that is, does not increase) as time $t$ increases. Such theorems have applications to the study of blow--up of solutions of semilinear PDE, time dependent Sturm Liouville theory, curve shrinking problems and control theory. We generalise the PDE results by showing that the transition operator of a (possibly time--inhomogenous) one--dimensional diffusion reduces the number of zero--crossings of a function or even, suitably interpreted, a signed measure. Our proof is completely probabilistic and depends in a transparent manner on little more than the sample--path continuity of diffusion processes.

Steven N. Evans

What is connected

Connect this record

See the researcher in context

Building this map preview

31 published item(s)

Doob--Martin boundary of Rémy's tree growth chain

Two continua of embedded regenerative sets

Bayesian inference of natural selection from allele frequency time series

Doob-Martin compactification of a Markov chain for growing random words sequentially

Leading the field: Fortune favors the bold in Thurstonian choice models

Radix sort trees in the large

Recovering a tree from the lengths of subtrees spanned by a randomly chosen sequence of leaves

When do skew-products exist?

Killed Brownian motion with a prescribed lifetime distribution and models of default

Protected polymorphisms and evolutionary stability of patch-selection strategies in stochastic environments

The semigroup of metric measure spaces and its infinitely divisible probability measures

Unseparated pairs and fixed points in random permutations

Analysis and rejection sampling of Wright-Fisher diffusion bridges

Coalescing systems of non-Brownian particles

Lipschitz minorants of Brownian Motion and Levy processes

Stochastic equations on projective systems of groups

Stochastic flights of propellers

Stochastic population growth in spatially heterogeneous environments

A limit theorem for occupation measures of Lévy processes in compact groups

A mutation-selection model for general genotypes with recombination

Edge principal components and squash clustering: using the special structure of phylogenetic placement data for sample comparison

Spectra of large random trees

The phylogenetic Kantorovich-Rubinstein metric for environmental sequence samples

Transcriptional regulation: Effects of promoter proximal pausing on speed, synchrony and reliability

Trickle-down processes and their boundaries

Commuting birth-and-death processes

Coverage statistics for sequence census methods

Dynamics of the time to the most recent common ancestor in a large branching population

Non-existence of Markovian time dynamics for graphical models of correlated default

Shape-based peak identification for ChIP-Seq

Transition operators of diffusions reduce zero-crossing