Source author record

Subharup Guha

Subharup Guha appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Applications Computation

Catalog footprint

What is connected

6works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Clustering Approach to Integrative Analysis of Multiomic Cancer Data

Rapid technological advances have allowed for molecular profiling across multiple omics domains from a single sample for clinical decision making in many diseases, especially cancer. As tumor development and progression are dynamic biological processes involving composite genomic aberrations, key challenges are to effectively assimilate information from these domains to identify genomic signatures and biological entities that are druggable, develop accurate risk prediction profiles for future patients, and identify novel patient subgroups for tailored therapy and monitoring. We propose integrative probabilistic frameworks for high-dimensional multiple-domain cancer data that coherently incorporate dependence within and between domains to accurately detect tumor subtypes, thus providing a catalogue of genomic aberrations associated with cancer taxonomy. We propose an innovative, flexible and scalable Bayesian nonparametric framework for simultaneous clustering of both tumor samples and genomic probes. We describe an efficient variable selection procedure to identify relevant genomic aberrations that can potentially reveal underlying drivers of a disease. Although the work is motivated by several investigations related to lung cancer, the proposed methods are broadly applicable in a variety of contexts involving high-dimensional data. The success of the methodology is demonstrated using artificial data and lung cancer omics profiles publicly available from The Cancer Genome Atlas.

preprint2022arXiv

Nonparametric Bayes Differential Analysis for Dependent Multigroup Data with Application to DNA Methylation Analyses in Cancer

Modern cancer genomics datasets involve widely varying sizes and scales, measurement variables, and correlation structures. A fundamental analytical goal in these high-throughput studies is the development of general statistical techniques that can cleanly sift the signal from noise in identifying disease-specific genomic signatures across a set of experimental or biological conditions. We propose BayesDiff, a nonparametric Bayesian approach based on a novel class of first order mixture models, called the Sticky Poisson-Dirichlet process or multicuisine restaurant franchise. The BayesDiff methodology flexibly utilizes information from all the measurements and adaptively accommodates any serial dependence in the data, accounting for the inter-probe distances, to perform simultaneous inferences on the variables. The technique is applied to analyze a DNA methylation gastrointestinal (GI) cancer dataset, which displays both serial correlations and complex interaction patterns. Our analyses and results both support and complement known aspects of DNA methylation and gene association in upper GI cancers. In simulation studies, we demonstrate the effectiveness of the BayesDiff procedure relative to existing techniques for differential DNA methylation.

preprint2022arXiv

Predicting Phenotypes from Brain Connection Structure

This article focuses on the problem of predicting a response variable based on a network-valued predictor. Our motivation is the development of interpretable and accurate predictive models for cognitive traits and neuro-psychiatric disorders based on an individual's brain connection network (connectome). Current methods reduce the complex, high dimensional brain network into low-dimensional pre-specified features prior to applying standard predictive algorithms. These methods are sensitive to feature choice and inevitably discard important information. Instead, we propose a nonparametric Bayes class of models that utilize the entire adjacency matrix defining brain region connections to adaptively detect predictive algorithms, while maintaining interpretability. The Bayesian Connectomics (BaCon) model class utilizes Poisson-Dirichlet processes to find a lower-dimensional, bidirectional (covariate, subject) pattern in the adjacency matrix. The small n, large p problem is transformed into a "small n, small q" problem, facilitating an effective stochastic search of the predictors. A spike-and-slab prior for the cluster predictors strikes a balance between regression model parsimony and flexibility, resulting in improved inferences and test case predictions. We describe basic properties of the BaCon model and develop efficient algorithms for posterior computation. The resulting methods are found to outperform existing approaches and applied to a creative reasoning data set.

preprint2020arXiv

Probabilistic Detection and Estimation of Conic Sections from Noisy Data

Inferring unknown conic sections on the basis of noisy data is a challenging problem with applications in computer vision. A major limitation of the currently available methods for conic sections is that estimation methods rely on the underlying shape of the conics (being known to be ellipse, parabola or hyperbola). A general purpose Bayesian hierarchical model is proposed for conic sections and corresponding estimation method based on noisy data is shown to work even when the specific nature of the conic section is unknown. The model, thus, provides probabilistic detection of the underlying conic section and inference about the associated parameters of the conic section. Through extensive simulation studies where the true conics may not be known, the methodology is demonstrated to have practical and methodological advantages relative to many existing techniques. In addition, the proposed method provides probabilistic measures of uncertainty of the estimated parameters. Furthermore, we observe high fidelity to the true conics even in challenging situations, such as data arising from partial conics in arbitrarily rotated and non-standard form, and where a visual inspection is unable to correctly identify the type of conic section underlying the data.

preprint2016arXiv

Nonparametric Variable Selection, Clustering and Prediction for High-Dimensional Regression

The development of parsimonious models for reliable inference and prediction of responses in high-dimensional regression settings is often challenging due to relatively small sample sizes and the presence of complex interaction patterns between a large number of covariates. We propose an efficient, nonparametric framework for simultaneous variable selection, clustering and prediction in high-throughput regression settings with continuous or discrete outcomes, called VariScan. The VariScan model utilizes the sparsity induced by Poisson-Dirichlet processes (PDPs) to group the covariates into lower-dimensional latent clusters consisting of covariates with similar patterns among the samples. The data are permitted to direct the choice of a suitable cluster allocation scheme, choosing between PDPs and their special case, a Dirichlet process. Subsequently, the latent clusters are used to build a nonlinear prediction model for the responses using an adaptive mixture of linear and nonlinear elements, thus achieving a balance between model parsimony and flexibility. We investigate theoretical properties of the VariScan procedure that differentiate the allocations patterns of PDPs and Dirichlet processes both in terms of the number and relative sizes of their clusters. Additional theoretical results guarantee the high accuracy of the model-based clustering procedure, and establish model selection and prediction consistency. Through simulation studies and analyses of benchmark data sets, we demonstrate the reliability of VariScan's clustering mechanism and show that the technique compares favorably to, and often outperforms, existing methodologies in terms of the prediction accuracies of the subject-specific responses.

preprint2015arXiv

hmmSeq: A hidden Markov model for detecting differentially expressed genes from RNA-seq data

We introduce hmmSeq, a model-based hierarchical Bayesian technique for detecting differentially expressed genes from RNA-seq data. Our novel hmmSeq methodology uses hidden Markov models to account for potential co-expression of neighboring genes. In addition, hmmSeq employs an integrated approach to studies with technical or biological replicates, automatically adjusting for any extra-Poisson variability. Moreover, for cases when paired data are available, hmmSeq includes a paired structure between treatments that incoporates subject-specific effects. To perform parameter estimation for the hmmSeq model, we develop an efficient Markov chain Monte Carlo algorithm. Further, we develop a procedure for detection of differentially expressed genes that automatically controls false discovery rate. A simulation study shows that the hmmSeq methodology performs better than competitors in terms of receiver operating characteristic curves. Finally, the analyses of three publicly available RNA-seq data sets demonstrate the power and flexibility of the hmmSeq methodology. An R package implementing the hmmSeq framework will be submitted to CRAN upon publication of the manuscript.

Subharup Guha

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

A Clustering Approach to Integrative Analysis of Multiomic Cancer Data

Nonparametric Bayes Differential Analysis for Dependent Multigroup Data with Application to DNA Methylation Analyses in Cancer

Predicting Phenotypes from Brain Connection Structure

Probabilistic Detection and Estimation of Conic Sections from Noisy Data

Nonparametric Variable Selection, Clustering and Prediction for High-Dimensional Regression

hmmSeq: A hidden Markov model for detecting differentially expressed genes from RNA-seq data