Source author record

Asger Hobolth

Asger Hobolth appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Populations and Evolution Methodology

Catalog footprint

What is connected

5works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A flexible model-based framework for robust estimation of mutational signatures

Somatic mutations in cancer can be viewed as a mixture distribution of several mutational signatures, which can be inferred using non-negative matrix factorization (NMF). Mutational signatures have previously been parametrized using either simple mono-nucleotide interaction models or general tri-nucleotide interaction models. We describe a flexible and novel framework for identifying biologically plausible parametrizations of mutational signatures, and in particular for estimating di-nucleotide interaction models. The estimation procedure is based on the expectation--maximization (EM) algorithm and regression in the log-linear quasi--Poisson model. We show that di-nucleotide interaction signatures are statistically stable and sufficiently complex to fit the mutational patterns. Di-nucleotide interaction signatures often strike the right balance between appropriately fitting the data and avoiding over-fitting. They provide a better fit to data and are biologically more plausible than mono-nucleotide interaction signatures, and the parametrization is more stable than the parameter-rich tri-nucleotide interaction signatures. We illustrate our framework on three data sets of somatic mutation counts from cancer patients.

preprint2021arXiv

A sampling algorithm to compute the set of feasible solutions for non-negative matrix factorization with an arbitrary rank

Non-negative Matrix Factorization (NMF) is a useful method to extract features from multivariate data, but an important and sometimes neglected concern is that NMF can result in non-unique solutions. Often, there exist a Set of Feasible Solutions (SFS), which makes it more difficult to interpret the factorization. This problem is especially ignored in cancer genomics, where NMF is used to infer information about the mutational processes present in the evolution of cancer. In this paper the extent of non-uniqueness is investigated for two mutational counts data, and a new sampling algorithm, that can find the SFS, is introduced. Our sampling algorithm is easy to implement and applies to an arbitrary rank of NMF. This is in contrast to state of the art, where the NMF rank must be smaller than or equal to four. For lower ranks we show that our algorithm performs similarly to the polygon inflation algorithm that is developed in relations to chemometrics. Furthermore, we show how the size of the SFS can have a high influence on the appearing variability of a solution. Our sampling algorithm is implemented in an R package \textbf{SFS} (\url{https://github.com/ragnhildlaursen/SFS}).

preprint2021arXiv

Multivariate phase-type theory for the site frequency spectrum

Linear functions of the site frequency spectrum (SFS) play a major role for understanding and investigating genetic diversity. Estimators of the mutation rate (e.g. based on the total number of segregating sites or average of the pairwise differences) and tests for neutrality (e.g. Tajima's D) are perhaps the most well-known examples. The distribution of linear functions of the SFS is important for constructing confidence intervals for the estimators, and to determine significance thresholds for neutrality tests. These distributions are often approximated using simulation procedures. In this paper we use multivariate phase-type theory to specify, characterize and calculate the distribution of linear functions of the site frequency spectrum. In particular, we show that many of the classical estimators of the mutation rate are distributed according to a discrete phase-type distribution. Neutrality tests, however, are generally not discrete phase-type distributed. For neutrality tests we derive the probability generating function using continuous multivariate phase-type theory, and numerically invert the function to obtain the distribution. A main result is an analytically tractable formula for the probability generating function of the SFS. Software implementation of the phase-type methodology is available in the R package phasty, and R code for the reproduction of our results is available as an accompanying vignette.

preprint2015arXiv

The SMC' is a highly accurate approximation to the ancestral recombination graph

Two sequentially Markov coalescent models (SMC and SMC') are available as tractable approximations to the ancestral recombination graph (ARG). We present a Markov process describing coalescence at two fixed points along a pair of sequences evolving under the SMC'. Using our Markov process, we derive a number of new quantities related to the pairwise SMC', thereby analytically quantifying for the first time the similarity between the SMC' and ARG. We use our process to show that the joint distribution of pairwise coalescence times at recombination sites under the SMC' is the same as it is marginally under the ARG, which demonstrates that the SMC' is, in a particular well-defined, intuitive sense, the most appropriate first-order sequentially Markov approximation to the ARG. Finally, we use these results to show that population size estimates under the pairwise SMC are asymptotically biased, while under the pairwise SMC' they are approximately asymptotically unbiased.

preprint2014arXiv

Strong selective sweeps associated with ampliconic regions in great ape X chromosomes

The unique inheritance pattern of X chromosomes makes them preferential targets of adaptive evolution. We here investigate natural selection on the X chromosome in all species of great apes. We find that diversity is more strongly reduced around genes on the X compared with autosomes, and that a higher proportion of substitutions results from positive selection. Strikingly, the X exhibits several megabase long regions where diversity is reduced more than five fold. These regions overlap significantly among species, and have a higher singleton proportion, population differentiation, and nonsynonymous to synonymous substitution ratio. We rule out background selection and soft selective sweeps as explanations for these observations, and conclude that several strong selective sweeps have occurred independently in similar regions in several species. Since these regions are strongly associated with ampliconic sequences we propose that intra-genomic conflict between the X and the Y chromosomes is a major driver of X chromosome evolution.