Researcher profile

Yun S. Song

Yun S. Song contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
15topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2026arXiv

Bayesian inference of antibody evolutionary dynamics using multitype branching processes

When our immune system encounters foreign antigens (i.e., from pathogens), the B cells that produce our antibodies undergo a cyclic process of proliferation, mutation, and selection, improving their ability to bind to the specific antigen. Immunologists have recently developed powerful experimental techniques to investigate this process in mouse models. In one such experiment, mice are engineered with a monoclonal B-cell precursor and immunized with a model antigen. B cells are sampled from sacrificed mice after the immune response has progressed, and the mutated genetic loci encoding antibodies are sequenced. This experiment allows parallel replay of antibody evolution, but produces data at only one time point; we are unable to observe the evolutionary trajectories that lead to optimized antibody affinity in each mouse. To address this, we model antibody evolution as a multitype branching process and integrate over unobserved histories conditioned on phylogenetic signal in sequence data, leveraging parallel experimental replays for parameter inference. We infer the functional relationship between B-cell fitness and antigen binding affinity in a Bayesian framework, equipped with an efficient likelihood calculation algorithm and Markov chain Monte Carlo posterior approximation. In a simulation study, we demonstrate that a sigmoidal relationship between fitness and binding affinity can be recovered from realizations of the branching process. We then perform inference for experimental data from 52 replayed B-cell lineages sampled 15 days after immunization, yielding a total of 3,758 sampled B cells. The recovered sigmoidal curve indicates that the fitness of high-affinity B cells is over six times larger than that of low-affinity B cells, with a sharp transition from low to high fitness values as affinity increases.

preprint2022arXiv

Exact and arbitrarily accurate non-parametric two-sample tests based on rank spacings

A common method for deriving non-parametric tests is to reformulate a parametric test in terms of sample ranks. Despite being distribution free (even in finite samples), the resulting tests often display remarkable asymptotic power properties, typically matching the efficiency of their parametric counterpart. Empirically, these favorable power properties have been shown to persist in non-asymptotic regimes as well, prompting the need for finite-sample characterizations of the corresponding rank-based statistics. Here, we provide such characterization for the family of weighted $p$-norms of rank spacings, which includes the classical tests of Mann-Whitney, Dixon, and various generalizations thereof. For $p=1$, we provide exact expressions for the involved distributions, while for $p>1$ we describe the associated moment sequences and derive an algorithm to recover the distributions of interest from these sequences in a fast and stable manner. We use this framework to develop a new family of non-parametric tests mirroring properties of generalized likelihood-ratios, prove new tail bounds for Dixon's and Greenwood's statistics, and prove a previously formulated conjecture regarding the global efficiency of rank-based tests against the $F$-test in the context of scale-families.

preprint2020arXiv

The key parameters that govern translation efficiency

Translation of mRNA into protein is a fundamental yet complex biological process with multiple factors that can potentially affect its efficiency. Here, we study a stochastic model describing the traffic flow of ribosomes along the mRNA (namely, the inhomogeneous $\ell$-TASEP), and identify the key parameters that govern the overall rate of protein synthesis, sensitivity to initiation rate changes, and efficiency of ribosome usage. By analyzing a continuum limit of the model, we obtain closed-form expressions for stationary currents and ribosomal densities, which agree well with Monte Carlo simulations. Furthermore, we completely characterize the phase transitions in the system, and by applying our theoretical results, we formulate design principles that detail how to tune the key parameters we identified to optimize translation efficiency. Using ribosome profiling data from S. cerevisiae, we shows that its translation system is generally consistent with these principles. Our theoretical results have implications for evolutionary biology, as well as synthetic biology.

preprint2017arXiv

Operator Norm Inequalities between Tensor Unfoldings on the Partition Lattice

Interest in higher-order tensors has recently surged in data-intensive fields, with a wide range of applications including image processing, blind source separation, community detection, and feature extraction. A common paradigm in tensor-related algorithms advocates unfolding (or flattening) the tensor into a matrix and applying classical methods developed for matrices. Despite the popularity of such techniques, how the functional properties of a tensor changes upon unfolding is currently not well understood. In contrast to the body of existing work which has focused almost exclusively on matricizations, we here consider all possible unfoldings of an order-$k$ tensor, which are in one-to-one correspondence with the set of partitions of $\{1,\ldots,k\}$. We derive general inequalities between the $l^p$-norms of arbitrary unfoldings defined on the partition lattice. In particular, we demonstrate how the spectral norm ($p=2$) of a tensor is bounded by that of its unfoldings, and obtain an improved upper bound on the ratio of the Frobenius norm to the spectral norm of an arbitrary tensor. For specially-structured tensors satisfying a generalized definition of orthogonal decomposability, we prove that the spectral norm remains invariant under specific subsets of unfolding operations.

preprint2014arXiv

Decoding coalescent hidden Markov models in linear time

In many areas of computational biology, hidden Markov models (HMMs) have been used to model local genomic features. In particular, coalescent HMMs have been used to infer ancient population sizes, migration rates, divergence times, and other parameters such as mutation and recombination rates. As more loci, sequences, and hidden states are added to the model, however, the runtime of coalescent HMMs can quickly become prohibitive. Here we present a new algorithm for reducing the runtime of coalescent HMMs from quadratic in the number of hidden time states to linear, without making any additional approximations. Our algorithm can be incorporated into various coalescent HMMs, including the popular method PSMC for inferring variable effective population sizes. Here we implement this algorithm to speed up our demographic inference method diCal, which is equivalent to PSMC when applied to a sample of two haplotypes. We demonstrate that the linear-time method can reconstruct a population size change history more accurately than the quadratic-time method, given similar computation resources. We also apply the method to data from the 1000 Genomes project, inferring a high-resolution history of size changes in the European population.

preprint2014arXiv

SMaSH: A Benchmarking Toolkit for Human Genome Variant Calling

Motivation: Computational methods are essential to extract actionable information from raw sequencing data, and to thus fulfill the promise of next-generation sequencing technology. Unfortunately, computational tools developed to call variants from human sequencing data disagree on many of their predictions, and current methods to evaluate accuracy and computational performance are ad-hoc and incomplete. Agreement on benchmarking variant calling methods would stimulate development of genomic processing tools and facilitate communication among researchers. Results: We propose SMaSH, a benchmarking methodology for evaluating human genome variant calling algorithms. We generate synthetic datasets, organize and interpret a wide range of existing benchmarking data for real genomes, and propose a set of accuracy and computational performance metrics for evaluating variant calling methods on this benchmarking data. Moreover, we illustrate the utility of SMaSH to evaluate the performance of some leading single nucleotide polymorphism (SNP), indel, and structural variant calling algorithms. Availability: We provide free and open access online to the SMaSH toolkit, along with detailed documentation, at smash.cs.berkeley.edu.

preprint2012arXiv

An explicit transition density expansion for a multi-allelic Wright-Fisher diffusion with general diploid selection

Characterizing time-evolution of allele frequencies in a population is a fundamental problem in population genetics. In the Wright-Fisher diffusion, such dynamics is captured by the transition density function, which satisfies well-known partial differential equations. For a multi-allelic model with general diploid selection, various theoretical results exist on representations of the transition density, but finding an explicit formula has remained a difficult problem. In this paper, a technique recently developed for a diallelic model is extended to find an explicit transition density for an arbitrary number of alleles, under a general diploid selection model with recurrent parent-independent mutation. Specifically, the method finds the eigenvalues and eigenfunctions of the generator associated with the multi-allelic diffusion, thus yielding an accurate spectral representation of the transition density. Furthermore, this approach allows for efficient, accurate computation of various other quantities of interest, including the normalizing constant of the stationary distribution and the rate of convergence to this distribution.

preprint2012arXiv

Approximate sampling formulae for general finite-alleles models of mutation

Many applications in genetic analyses utilize sampling distributions, which describe the probability of observing a sample of DNA sequences randomly drawn from a population. In the one-locus case with special models of mutation such as the infinite-alleles model or the finite-alleles parent-independent mutation model, closed-form sampling distributions under the coalescent have been known for many decades. However, no exact formula is currently known for more general models of mutation that are of biological interest. In this paper, models with finitely-many alleles are considered, and an urn construction related to the coalescent is used to derive approximate closed-form sampling formulas for an arbitrary irreducible recurrent mutation model or for a reversible recurrent mutation model, depending on whether the number of distinct observed allele types is at most three or four, respectively. It is demonstrated empirically that the formulas derived here are highly accurate when the per-base mutation rate is low, which holds for many biological organisms.

preprint2012arXiv

Padé approximants and exact two-locus sampling distributions

For population genetics models with recombination, obtaining an exact, analytic sampling distribution has remained a challenging open problem for several decades. Recently, a new perspective based on asymptotic series has been introduced to make progress on this problem. Specifically, closed-form expressions have been derived for the first few terms in an asymptotic expansion of the two-locus sampling distribution when the recombination rate $ρ$ is moderate to large. In this paper, a new computational technique is developed for finding the asymptotic expansion to an arbitrary order. Computation in this new approach can be automated easily. Furthermore, it is proved here that only a finite number of terms in the asymptotic expansion is needed to recover (via the method of Padé approximants) the exact two-locus sampling distribution as an analytic function of $ρ$; this function is exact for all values of $ρ\in[0,\infty)$. It is also shown that the new computational framework presented here is flexible enough to incorporate natural selection.

preprint2011arXiv

Closed-form asymptotic sampling distributions under the coalescent with recombination for an arbitrary number of loci

Obtaining a closed-form sampling distribution for the coalescent with recombination is a challenging problem. In the case of two loci, a new framework based on asymptotic series has recently been developed to derive closed-form results when the recombination rate is moderate to large. In this paper, an arbitrary number of loci is considered and combinatorial approaches are employed to find closed-form expressions for the first couple of terms in an asymptotic expansion of the multi-locus sampling distribution. These expressions are universal in the sense that their functional form in terms of the marginal one-locus distributions applies to all finite- and infinite-alleles models of mutation.

preprint2011arXiv

Open string instantons and relative stable morphisms

We show how topological open string theory amplitudes can be computed by using relative stable morphisms in the algebraic category. We achieve our goal by explicitly working through an example which has been previously considered by Ooguri and Vafa from the point of view of physics. By using the method of virtual localization, we successfully reproduce their results for multiple covers of a holomorphic disc, whose boundary lies in a Lagrangian submanifold of a Calabi-Yau 3-fold, by Riemann surfaces with arbitrary genera and number of boundary components. In particular we show that in the case we consider there are no open string instantons with more than one boundary component ending on the Lagrangian submanifold.

preprint2010arXiv

An asymptotic sampling formula for the coalescent with Recombination

Ewens sampling formula (ESF) is a one-parameter family of probability distributions with a number of intriguing combinatorial connections. This elegant closed-form formula first arose in biology as the stationary probability distribution of a sample configuration at one locus under the infinite-alleles model of mutation. Since its discovery in the early 1970s, the ESF has been used in various biological applications, and has sparked several interesting mathematical generalizations. In the population genetics community, extending the underlying random-mating model to include recombination has received much attention in the past, but no general closed-form sampling formula is currently known even for the simplest extension, that is, a model with two loci. In this paper, we show that it is possible to obtain useful closed-form results in the case the population-scaled recombination rate $ρ$ is large but not necessarily infinite. Specifically, we consider an asymptotic expansion of the two-locus sampling formula in inverse powers of $ρ$ and obtain closed-form expressions for the first few terms in the expansion. Our asymptotic sampling formula applies to arbitrary sample sizes and configurations.

preprint1999arXiv

On the Critical Behavior of D1-brane Theories

We study renormalization-group flow patterns in theories arising on D1-branes in various supersymmetry-breaking backgrounds. We argue that the theory of N D1-branes transverse to an orbifold space can be fine-tuned to flow to the corresponding orbifold conformal field theory in the infrared, for particular values of the couplings and theta angles which we determine using the discrete symmetries of the model. By calculating various nonplanar contributions to the scalar potential in the worldvolume theory, we show that fine-tuning is in fact required at finite N, as would be generically expected. We further comment on the presence of singular conformal field theories (such as those whose target space includes a ``throat'' described by an exactly solvable CFT) in the non-supersymmetric context. Throughout the analysis two applications are considered: to gauge theory/gravity duality and to linear sigma model techniques for studying worldsheet string theory.