Source author record

Jonathan Terhorst

Jonathan Terhorst appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.PR math.ST Populations and Evolution Quantitative Methods Statistics Theory Genomics math.CO math.OC

Catalog footprint

What is connected

7works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Exact and arbitrarily accurate non-parametric two-sample tests based on rank spacings

A common method for deriving non-parametric tests is to reformulate a parametric test in terms of sample ranks. Despite being distribution free (even in finite samples), the resulting tests often display remarkable asymptotic power properties, typically matching the efficiency of their parametric counterpart. Empirically, these favorable power properties have been shown to persist in non-asymptotic regimes as well, prompting the need for finite-sample characterizations of the corresponding rank-based statistics. Here, we provide such characterization for the family of weighted $p$-norms of rank spacings, which includes the classical tests of Mann-Whitney, Dixon, and various generalizations thereof. For $p=1$, we provide exact expressions for the involved distributions, while for $p>1$ we describe the associated moment sequences and derive an algorithm to recover the distributions of interest from these sequences in a fast and stable manner. We use this framework to develop a new family of non-parametric tests mirroring properties of generalized likelihood-ratios, prove new tail bounds for Dixon's and Greenwood's statistics, and prove a previously formulated conjecture regarding the global efficiency of rank-based tests against the $F$-test in the context of scale-families.

preprint2020arXiv

Explaining Groups of Points in Low-Dimensional Representations

A common workflow in data exploration is to learn a low-dimensional representation of the data, identify groups of points in that representation, and examine the differences between the groups to determine what they represent. We treat this workflow as an interpretable machine learning problem by leveraging the model that learned the low-dimensional representation to help identify the key differences between the groups. To solve this problem, we introduce a new type of explanation, a Global Counterfactual Explanation (GCE), and our algorithm, Transitive Global Translations (TGT), for computing GCEs. TGT identifies the differences between each pair of groups using compressed sensing but constrains those pairwise differences to be consistent among all of the groups. Empirically, we demonstrate that TGT is able to identify explanations that accurately explain the model while being relatively sparse, and that these explanations match real patterns in the data.

preprint2015arXiv

Efficient computation of the joint sample frequency spectra for multiple populations

A wide range of studies in population genetics have employed the sample frequency spectrum (SFS), a summary statistic which describes the distribution of mutant alleles at a polymorphic site in a sample of DNA sequences. In particular, recently there has been growing interest in analyzing the joint SFS data from multiple populations to infer parameters of complex demographic histories, including variable population sizes, population split times, migration rates, admixture proportions, and so on. Although much methodological progress has been made, existing SFS-based inference methods suffer from numerical instability and high computational complexity when multiple populations are involved and the sample size is large. In this paper, we present new analytic formulas and algorithms that enable efficient computation of the expected joint SFS for multiple populations related by a complex demographic model with arbitrary population size histories (including piecewise exponential growth). Our results are implemented in a new software package called momi (MOran Models for Inference). Through an empirical study involving tens of populations, we demonstrate our improvements to numerical stability and computational complexity.

preprint2015arXiv

Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum

The sample frequency spectrum (SFS) of DNA sequences from a collection of individuals is a summary statistic which is commonly used for parametric inference in population genetics. Despite the popularity of SFS-based inference methods, currently little is known about the information-theoretic limit on the estimation accuracy as a function of sample size. Here, we show that using the SFS to estimate the size history of a population has a minimax error of at least $O(1/\log s)$, where $s$ is the number of independent segregating sites used in the analysis. This rate is exponentially worse than known convergence rates for many classical estimation problems in statistics. Another surprising aspect of our theoretical bound is that it does not depend on the dimension of the SFS, which is related to the number of sampled individuals. This means that, for a fixed number $s$ of segregating sites considered, using more individuals does not help to reduce the minimax error bound. Our result pertains to populations that have experienced a bottleneck, and we argue that it can be expected to apply to many populations in nature.

preprint2014arXiv

Communication-Efficient Distributed Dual Coordinate Ascent

Communication remains the most significant bottleneck in the performance of distributed optimization algorithms for large-scale machine learning. In this paper, we propose a communication-efficient framework, CoCoA, that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication. We provide a strong convergence rate analysis for this class of algorithms, as well as experiments on real-world distributed datasets with implementations in Spark. In our experiments, we find that as compared to state-of-the-art mini-batch versions of SGD and SDCA algorithms, CoCoA converges to the same .001-accurate solution quality on average 25x as quickly.

preprint2014arXiv

SMaSH: A Benchmarking Toolkit for Human Genome Variant Calling

Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. Results: We propose SMaSH, a benchmarking methodology for evaluating human genome variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms. Availability: We provide free and open access online to the SMaSH toolkit, along with detailed documentation, at smash.cs.berkeley.edu.

preprint2011arXiv

The Kalmanson Complex

Let X be a finite set of cardinality n. The Kalmanson complex K_n is the simplicial complex whose vertices are non-trivial X-splits, and whose facets are maximal circular split systems over X. In this paper we examine K_n from three perspectives. In addition to the T-theoretic description, we show that K_n has a geometric realization as the Kalmanson conditions on a finite metric. A third description arises in terms of binary matrices which possess the circular ones property. We prove the equivalence of these three definitions. This leads to a simplified proof of the well-known equivalence between Kalmanson and circular decomposable metrics, as well as a partial description of the f-vector of K_n.

Jonathan Terhorst

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

Exact and arbitrarily accurate non-parametric two-sample tests based on rank spacings

Explaining Groups of Points in Low-Dimensional Representations

Efficient computation of the joint sample frequency spectra for multiple populations

Fundamental limits on the accuracy of demographic inference based on the sample frequency spectrum

Communication-Efficient Distributed Dual Coordinate Ascent

SMaSH: A Benchmarking Toolkit for Human Genome Variant Calling

The Kalmanson Complex