Source author record

Sohan Seth

Sohan Seth appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Information Retrieval Applications Computational Engineering, Finance, and Science Genomics Neurons and Cognition

Catalog footprint

What is connected

8works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Modeling mental health trajectories during the COVID-19 pandemic using UK-wide data in the presence of sociodemographic variables

Background: The negative effects of the COVID-19 pandemic on the mental health and well-being of populations are an important public health issue. Our study aims to determine the underlying factors shaping mental health trajectories during the COVID-19 pandemic in the UK. Methods: Data from the Understanding Society COVID-19 Study were utilized and the core analysis focussed on GHQ36 scores as the outcome variable. We used GAMs to evaluate trends over time and the role of sociodemographic variables, i.e., age, sex, ethnicity, country of residence (in UK), job status (employment), household income, living with a partner, living with children under age 16, and living with a long-term illness, on the variation of mental health during the study period. Results: Statistically significant differences in mental health were observed for age, sex,ethnicity, country of residence (in UK), job status (employment), household income, living with a partner, living with children under age 16, and living with a long-term illness. Women experienced higher GHQ36 scores relative to men with the GHQ36 score expected to increase by 1.260 (95%CI: 1.176, 1.345). Individuals living without a partner were expected to have higher GHQ36 scores, of 1.050 (95%CI: 0.949, 1.148) more than those living with a partner, and age groups 16-34, 35-44, 45-54, 55-64 experienced higher GHQ36 scores relative to those who were 65+. Individuals with relatively lower household income were likely to have poorer mental health relative to those who were more well off. Conclusion: This study identifies key demographic determinants shaping mental health trajectories during the COVID-19 pandemic in the UK. Policies aiming to reduce mental health inequalities should target women, youth, individuals living without a partner, individuals living with children under 16, individuals with a long-term illness, and lower income families.

preprint2016arXiv

Modelling-based experiment retrieval: A case study with gene expression clustering

Motivation: Public and private repositories of experimental data are growing to sizes that require dedicated methods for finding relevant data. To improve on the state of the art of keyword searches from annotations, methods for content-based retrieval have been proposed. In the context of gene expression experiments, most methods retrieve gene expression profiles, requiring each experiment to be expressed as a single profile, typically of case vs. control. A more general, recently suggested alternative is to retrieve experiments whose models are good for modelling the query dataset. However, for very noisy and high-dimensional query data, this retrieval criterion turns out to be very noisy as well. Results: We propose doing retrieval using a denoised model of the query dataset, instead of the original noisy dataset itself. To this end, we introduce a general probabilistic framework, where each experiment is modelled separately and the retrieval is done by finding related models. For retrieval of gene expression experiments, we use a probabilistic model called product partition model, which induces a clustering of genes that show similar expression patterns across a number of samples. The suggested metric for retrieval using clusterings is the normalized information distance. Empirical results finally suggest that inference for the full probabilistic model can be approximated with good performance using computationally faster heuristic clustering approaches (e.g. $k$-means). The method is highly scalable and straightforward to apply to construct a general-purpose gene expression experiment retrieval method. Availability: The method can be implemented using standard clustering algorithms and normalized information distance, available in many statistical software packages.

preprint2014arXiv

Exploration and retrieval of whole-metagenome sequencing samples

Over the recent years, the field of whole metagenome shotgun sequencing has witnessed significant growth due to the high-throughput sequencing technologies that allow sequencing genomic samples cheaper, faster, and with better coverage than before. This technical advancement has initiated the trend of sequencing multiple samples in different conditions or environments to explore the similarities and dissimilarities of the microbial communities. Examples include the human microbiome project and various studies of the human intestinal tract. With the availability of ever larger databases of such measurements, finding samples similar to a given query sample is becoming a central operation. In this paper, we develop a content-based exploration and retrieval method for whole metagenome sequencing samples. We apply a distributed string mining framework to efficiently extract all informative sequence $k$-mers from a pool of metagenomic samples and use them to measure the dissimilarity between two samples. We evaluate the performance of the proposed approach on two human gut metagenome data sets as well as human microbiome project metagenomic samples. We observe significant enrichment for diseased gut samples in results of queries with another diseased sample and very high accuracy in discriminating between different body sites even though the method is unsupervised. A software implementation of the DSM framework is available at https://github.com/HIITMetagenomics/dsm-framework

preprint2014arXiv

Probabilistic Archetypal Analysis

Archetypal analysis represents a set of observations as convex combinations of pure patterns, or archetypes. The original geometric formulation of finding archetypes by approximating the convex hull of the observations assumes them to be real valued. This, unfortunately, is not compatible with many practical situations. In this paper we revisit archetypal analysis from the basic principles, and propose a probabilistic framework that accommodates other observation types such as integers, binary, and probability vectors. We corroborate the proposed methodology with convincing real-world applications on finding archetypal winter tourists based on binary survey data, archetypal disaster-affected countries based on disaster count data, and document archetypes based on term-frequency data. We also present an appropriate visualization tool to summarize archetypal analysis solution better.

preprint2014arXiv

Retrieval of Experiments by Efficient Estimation of Marginal Likelihood

We study the task of retrieving relevant experiments given a query experiment. By experiment, we mean a collection of measurements from a set of `covariates' and the associated `outcomes'. While similar experiments can be retrieved by comparing available `annotations', this approach ignores the valuable information available in the measurements themselves. To incorporate this information in the retrieval task, we suggest employing a retrieval metric that utilizes probabilistic models learned from the measurements. We argue that such a metric is a sensible measure of similarity between two experiments since it permits inclusion of experiment-specific prior knowledge. However, accurate models are often not analytical, and one must resort to storing posterior samples which demands considerable resources. Therefore, we study strategies to select informative posterior samples to reduce the computational load while maintaining the retrieval performance. We demonstrate the efficacy of our approach on simulated data with simple linear regression as the models, and real world datasets.

preprint2014arXiv

Retrieval of Experiments with Sequential Dirichlet Process Mixtures in Model Space

We address the problem of retrieving relevant experiments given a query experiment, motivated by the public databases of datasets in molecular biology and other experimental sciences, and the need of scientists to relate to earlier work on the level of actual measurement data. Since experiments are inherently noisy and databases ever accumulating, we argue that a retrieval engine should possess two particular characteristics. First, it should compare models learnt from the experiments rather than the raw measurements themselves: this allows incorporating experiment-specific prior knowledge to suppress noise effects and focus on what is important. Second, it should be updated sequentially from newly published experiments, without explicitly storing either the measurements or the models, which is critical for saving storage space and protecting data privacy: this promotes life long learning. We formulate the retrieval as a ``supermodelling'' problem, of sequentially learning a model of the set of posterior distributions, represented as sets of MCMC samples, and suggest the use of Particle-Learning-based sequential Dirichlet process mixture (DPM) for this purpose. The relevance measure for retrieval is derived from the supermodel through the mixture representation. We demonstrate the performance of the proposed retrieval method on simulated data and molecular biological experiments.

preprint2013arXiv

Bayesian Extensions of Kernel Least Mean Squares

The kernel least mean squares (KLMS) algorithm is a computationally efficient nonlinear adaptive filtering method that "kernelizes" the celebrated (linear) least mean squares algorithm. We demonstrate that the least mean squares algorithm is closely related to the Kalman filtering, and thus, the KLMS can be interpreted as an approximate Bayesian filtering method. This allows us to systematically develop extensions of the KLMS by modifying the underlying state-space and observation models. The resulting extensions introduce many desirable properties such as "forgetting", and the ability to learn from discrete data, while retaining the computational simplicity and time complexity of the original algorithm.

preprint2013arXiv

Kernel methods on spike train space for neuroscience: a tutorial

Over the last decade several positive definite kernels have been proposed to treat spike trains as objects in Hilbert space. However, for the most part, such attempts still remain a mere curiosity for both computational neuroscientists and signal processing experts. This tutorial illustrates why kernel methods can, and have already started to, change the way spike trains are analyzed and processed. The presentation incorporates simple mathematical analogies and convincing practical examples in an attempt to show the yet unexplored potential of positive definite functions to quantify point processes. It also provides a detailed overview of the current state of the art and future challenges with the hope of engaging the readers in active participation.

Sohan Seth

What is connected

Connect this record

See the researcher in context

Building this map preview

8 published item(s)

Modeling mental health trajectories during the COVID-19 pandemic using UK-wide data in the presence of sociodemographic variables

Modelling-based experiment retrieval: A case study with gene expression clustering

Exploration and retrieval of whole-metagenome sequencing samples

Probabilistic Archetypal Analysis

Retrieval of Experiments by Efficient Estimation of Marginal Likelihood

Retrieval of Experiments with Sequential Dirichlet Process Mixtures in Model Space

Bayesian Extensions of Kernel Least Mean Squares

Kernel methods on spike train space for neuroscience: a tutorial