Source author record

Leo Lahti

Leo Lahti appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Computational Engineering, Finance, and Science Machine Learning Genomics Quantitative Methods Molecular Networks Methodology Data Structures and Algorithms nlin.AO Populations and Evolution

Catalog footprint

What is connected

10works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Probabilistic multivariate early warning signals

A broad range of natural and social systems from human microbiome to financial markets can go through critical transitions, where the system suddenly collapses to another stable configuration. Critical transitions can be unexpected, with potentially catastrophic consequences. Anticipating them early and accurately can facilitate controlled system manipulation and mitigation of undesired outcomes. Obtaining reliable predictions have been difficult, however, as often only a small fraction of the relevant variables can be monitored, and even minor perturbations can induce drastic changes in fragile states of a complex system. Data-driven indicators have been proposed as an alternative to prediction and signal an increasing risk of forthcoming transitions. Autocorrelation and variance are examples of generic indicators that tend to increase at the vicinity of an approaching tipping point across a range of systems. An important shortcoming in these and other widely studied indicators is that they deal with simplified one-dimensional representations of complex systems. Here, we demonstrate that a probabilistic data aggregation strategy can provide new ways to improve early warning detection by more efficiently utilizing the available information from multivariate time series. In particular, we consider a probabilistic variant of a vector autoregression model as a novel early warning indicator and argue that it has theoretical advantages related to model regularization, treatment of uncertainties, and parameter interpretation. We evaluate the performance against alternatives in a simulation benchmark and show improved sensitivity in EWS detection in a common ecological model encompassing multiple interacting species.

preprint2020arXiv

Linking statistical and ecological theory: Hubbell's unified neutral theory of biodiversity as a hierarchical Dirichlet process

Neutral models which assume ecological equivalence between species provide null models for community assembly. In Hubbell's Unified Neutral Theory of Biodiversity (UNTB), many local communities are connected to a single metacommunity through differing immigration rates. Our ability to fit the full multi-site UNTB has hitherto been limited by the lack of a computationally tractable and accurate algorithm. We show that a large class of neutral models with this mainland-island structure but differing local community dynamics converge in the large population limit to the hierarchical Dirichlet process. Using this approximation we developed an efficient Bayesian fitting strategy for the multi-site UNTB. We can also use this approach to distinguish between neutral local community assembly given a non-neutral metacommunity distribution and the full UNTB where the metacommunity too assembles neutrally. We applied this fitting strategy to both tropical trees and a data set comprising 570,851 sequences from 278 human gut microbiomes. The tropical tree data set was consistent with the UNTB but for the human gut neutrality was rejected at the whole community level. However, when we applied the algorithm to gut microbial species within the same taxon at different levels of taxonomic resolution, we found that species abundances within some genera were almost consistent with local community assembly. This was not true at higher taxonomic ranks. This suggests that the gut microbiota is more strongly niche constrained than macroscopic organisms, with different groups adopting different functional roles, but within those groups diversity may at least partially be maintained by neutrality.We also observed a negative correlation between body mass index and immigration rates within the family Ruminococcaceae.

preprint2014arXiv

Tipping Elements in the Human Intestinal Ecosystem

Recent studies show that the microbial communities inhabiting the human intestine can have profound impact on our well-being and health. However, we have limited understanding of the mechanisms that control this complex ecosystem. Based on a deep phylogenetic analysis of the intestinal microbiota in a thousand western adults we identified groups of bacteria that tend to be either nearly absent, or abundant in most individuals. The abundances of these bimodally distributed bacteria vary independently, and their contrasting alternative states are associated with host factors such as ageing and overweight. We propose that such bimodal groups represent independent tipping elements of the intestinal microbiota. These reflect the overall state of the intestinal ecosystem whose critical transitions can have profound health implications and diagnostic potential.

preprint2013arXiv

RPA: Probabilistic analysis of probe performance and robust summarization

Probe-level models have led to improved performance in microarray studies but the various sources of probe-level contamination are still poorly understood. Data-driven analysis of probe performance can be used to quantify the uncertainty in individual probes and to highlight the relative contribution of different noise sources. Improved understanding of the probe-level effects can lead to improved preprocessing techniques and microarray design. We have implemented probabilistic tools for probe performance analysis and summarization on short oligonucleotide arrays. In contrast to standard preprocessing approaches, the methods provide quantitative estimates of probe-specific noise and affinity terms and tools to investigate these parameters. Tools to incorporate prior information of the probes in the analysis are provided as well. Comparisons to known probe-level error sources and spike-in data sets validate the approach. Implementation is freely available in R/BioConductor: http://www.bioconductor.org/packages/release/bioc/html/RPA.html

preprint2012arXiv

Fully scalable online-preprocessing algorithm for short oligonucleotide microarray atlases

Accumulation of standardized data collections is opening up novel opportunities for holistic characterization of genome function. The limited scalability of current preprocessing techniques has, however, formed a bottleneck for full utilization of contemporary microarray collections. While short oligonucleotide arrays constitute a major source of genome-wide profiling data, scalable probe-level preprocessing algorithms have been available only for few measurement platforms based on pre-calculated model parameters from restricted reference training sets. To overcome these key limitations, we introduce a fully scalable online-learning algorithm that provides tools to process large microarray atlases including tens of thousands of arrays. Unlike the alternatives, the proposed algorithm scales up in linear time with respect to sample size and is readily applicable to all short oligonucleotide platforms. This is the only available preprocessing algorithm that can learn probe-level parameters based on sequential hyperparameter updates at small, consecutive batches of data, thus circumventing the extensive memory requirements of the standard approaches and opening up novel opportunities to take full advantage of contemporary microarray data collections. Moreover, using the most comprehensive data collections to estimate probe-level effects can assist in pinpointing individual probes affected by various biases and provide new tools to guide array design and quality control. The implementation is freely available in R/Bioconductor at http://www.bioconductor.org/packages/devel/bioc/html/RPA.html

preprint2012arXiv

Global modeling of transcriptional responses in interaction networks

Motivation: Cell-biological processes are regulated through a complex network of interactions between genes and their products. The processes, their activating conditions, and the associated transcriptional responses are often unknown. Organism-wide modeling of network activation can reveal unique and shared mechanisms between physiological conditions, and potentially as yet unknown processes. We introduce a novel approach for organism-wide discovery and analysis of transcriptional responses in interaction networks. The method searches for local, connected regions in a network that exhibit coordinated transcriptional response in a subset of conditions. Known interactions between genes are used to limit the search space and to guide the analysis. Validation on a human pathway network reveals physiologically coherent responses, functional relatedness between physiological conditions, and coordinated, context-specific regulation of the genes. Availability: Implementation is freely available in R and Matlab at http://netpro.r-forge.r-project.org

preprint2011arXiv

A brief overview on the BioPAX and SBML standards for formal presentation of complex biological knowledge

A brief informal overview on the BioPAX and SBML standards for formal presentation of complex biological knowledge.

preprint2011arXiv

Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review

A variety of genome-wide profiling techniques are available to probe complementary aspects of genome structure and function. Integrative analysis of heterogeneous data sources can reveal higher-level interactions that cannot be detected based on individual observations. A standard integration task in cancer studies is to identify altered genomic regions that induce changes in the expression of the associated genes based on joint analysis of genome-wide gene expression and copy number profiling measurements. In this review, we provide a comparison among various modeling procedures for integrating genome-wide profiling data of gene copy number and transcriptional alterations and highlight common approaches to genomic data integration. A transparent benchmarking procedure is introduced to quantitatively compare the cancer gene prioritization performance of the alternative methods. The benchmarking algorithms and data sets are available at http://intcomp.r-forge.r-project.org

preprint2011arXiv

Dependency detection with similarity constraints

Unsupervised two-view learning, or detection of dependencies between two paired data sets, is typically done by some variant of canonical correlation analysis (CCA). CCA searches for a linear projection for each view, such that the correlations between the projections are maximized. The solution is invariant to any linear transformation of either or both of the views; for tasks with small sample size such flexibility implies overfitting, which is even worse for more flexible nonparametric or kernel-based dependency discovery methods. We develop variants which reduce the degrees of freedom by assuming constraints on similarity of the projections in the two views. A particular example is provided by a cancer gene discovery application where chromosomal distance affects the dependencies between gene copy number and activity levels. Similarity constraints are shown to improve detection performance of known cancer genes.

preprint2011arXiv

Probabilistic analysis of the human transcriptome with side information

Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.

Leo Lahti

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Probabilistic multivariate early warning signals

Linking statistical and ecological theory: Hubbell's unified neutral theory of biodiversity as a hierarchical Dirichlet process

Tipping Elements in the Human Intestinal Ecosystem

RPA: Probabilistic analysis of probe performance and robust summarization

Fully scalable online-preprocessing algorithm for short oligonucleotide microarray atlases

Global modeling of transcriptional responses in interaction networks

A brief overview on the BioPAX and SBML standards for formal presentation of complex biological knowledge

Cancer gene prioritization by integrative analysis of mRNA expression and DNA copy number data: a comparative review

Dependency detection with similarity constraints

Probabilistic analysis of the human transcriptome with side information