Source author record

Sach Mukherjee

Sach Mukherjee appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology Machine Learning Applications Molecular Networks Quantitative Methods

Catalog footprint

What is connected

17works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

On unsupervised projections and second order signals

Linear projections are widely used in the analysis of high-dimensional data. In unsupervised settings where the data harbour latent classes/clusters, the question of whether class discriminatory signals are retained under projection is crucial. In the case of mean differences between classes, this question has been well studied. However, in many contemporary applications, notably in biomedicine, group differences at the level of covariance or graphical model structure are important. Motivated by such applications, in this paper we ask whether linear projections can preserve differences in second order structure between latent groups. We focus on unsupervised projections, which can be computed without knowledge of class labels. We discuss a simple theoretical framework to study the behaviour of such projections which we use to inform an analysis via quasi-exhaustive enumeration. This allows us to consider the performance, over more than a hundred thousand sets of data-generating population parameters, of two popular projections, namely random projections (RP) and Principal Component Analysis (PCA). Across this broad range of regimes, PCA turns out to be more effective at retaining second order signals than RP and is often even competitive with supervised projection. We complement these results with fully empirical experiments showing 0-1 loss using simulated and real data. We study also the effect of projection dimension, drawing attention to a bias-variance trade-off in this respect. Our results show that PCA can indeed be a suitable first-step for unsupervised analysis, including in cases where differential covariance or graphical model structure are of interest.

preprint2022arXiv

Scalable Regularised Joint Mixture Models

In many applications, data can be heterogeneous in the sense of spanning latent groups with different underlying distributions. When predictive models are applied to such data the heterogeneity can affect both predictive performance and interpretability. Building on developments at the intersection of unsupervised learning and regularised regression, we propose an approach for heterogeneous data that allows joint learning of (i) explicit multivariate feature distributions, (ii) high-dimensional regression models and (iii) latent group labels, with both (i) and (ii) specific to latent groups and both elements informing (iii). The approach is demonstrably effective in high dimensions, combining data reduction for computational efficiency with a re-weighting scheme that retains key signals even when the number of features is large. We discuss in detail these aspects and their impact on modelling and computation, including EM convergence. The approach is modular and allows incorporation of data reductions and high-dimensional estimators that are suitable for specific applications. We show results from extensive simulations and real data experiments, including highly non-Gaussian data. Our results allow efficient, effective analysis of high-dimensional data in settings, such as biomedicine, where both interpretable prediction and explicit feature space models are needed but hidden heterogeneity may be a concern.

preprint2020arXiv

Evaluation of Causal Structure Learning Algorithms via Risk Estimation

Recent years have seen many advances in methods for causal structure learning from data. The empirical assessment of such methods, however, is much less developed. Motivated by this gap, we pose the following question: how can one assess, in a given problem setting, the practical efficacy of one or more causal structure learning methods? We formalize the problem in a decision-theoretic framework, via a notion of expected loss or risk for the causal setting. We introduce a theoretical notion of causal risk as well as sample quantities that can be computed from data, and study the relationship between the two, both theoretically and through an extensive simulation study. Our results provide an assumptions-light framework for assessing causal structure learning methods that can be applied in a range of practical use-cases.

preprint2020arXiv

High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Penalized likelihood approaches are widely used for high-dimensional regression. Although many methods have been proposed and the associated theory is now well-developed, the relative efficacy of different approaches in finite-sample settings, as encountered in practice, remains incompletely understood. There is therefore a need for empirical investigations in this area that can offer practical insight and guidance to users. In this paper we present a large-scale comparison of penalized regression methods. We distinguish between three related goals: prediction, variable selection and variable ranking. Our results span more than 2,300 data-generating scenarios, including both synthetic and semi-synthetic data (real covariates and simulated responses), allowing us to systematically consider the influence of various factors (sample size, dimensionality, sparsity, signal strength and multicollinearity). We consider several widely-used approaches (Lasso, Adaptive Lasso, Elastic Net, Ridge Regression, SCAD, the Dantzig Selector and Stability Selection). We find considerable variation in performance between methods. Our results support a `no panacea' view, with no unambiguous winner across all scenarios or goals, even in this restricted setting where all data align well with the assumptions underlying the methods. The study allows us to make some recommendations as to which approaches may be most (or least) suitable given the goal and some data characteristics. Our empirical results complement existing theory and provide a resource to compare methods across a range of scenarios and metrics.

preprint2016arXiv

Discussion of "Causal inference using invariant prediction: identification and confidence intervals" by Peters, Bühlmann and Meinshausen

Contribution to the discussion of the paper "Causal inference using invariant prediction: identification and confidence intervals" by Peters, Bühlmann and Meinshausen, to appear in the Journal of the Royal Statistical Society, Series B.

preprint2015arXiv

Inferring network structure from interventional time-course experiments

Graphical models are widely used to study biological networks. Interventions on network nodes are an important feature of many experimental designs for the study of biological networks. In this paper we put forward a causal variant of dynamic Bayesian networks (DBNs) for the purpose of modeling time-course data with interventions. The models inherit the simplicity and computational efficiency of DBNs but allow interventional data to be integrated into network inference. We show empirical results, on both simulated and experimental data, that demonstrate the need to appropriately handle interventions when interventions form part of the design.

preprint2014arXiv

Estimating causal structure using conditional DAG models

This paper considers inference of causal structure in a class of graphical models called "conditional DAGs". These are directed acyclic graph (DAG) models with two kinds of variables, primary and secondary. The secondary variables are used to aid in estimation of causal relationships between the primary variables. We give causal semantics for this model class and prove that, under certain assumptions, the direction of causal influence is identifiable from the joint observational distribution of the primary and secondary variables. A score-based approach is developed for estimation of causal structure using these models and consistency results are established. Empirical results demonstrate gains compared with formulations that treat all variables on an equal footing, or that ignore secondary variables. The methodology is motivated by applications in molecular biology and is illustrated here using simulated data and in an analysis of proteomic data from the Cancer Genome Atlas.

preprint2014arXiv

Exact Estimation of Multiple Directed Acyclic Graphs

This paper considers the problem of estimating the structure of multiple related directed acyclic graph (DAG) models. Building on recent developments in exact estimation of DAGs using integer linear programming (ILP), we present an ILP approach for joint estimation over multiple DAGs, that does not require that the vertices in each DAG share a common ordering. Furthermore, we allow also for (potentially unknown) dependency structure between the DAGs. Results are presented on both simulated data and fMRI data obtained from multiple subjects.

preprint2014arXiv

Joint estimation of multiple related biological networks

Graphical models are widely used to make inferences concerning interplay in multivariate systems. In many applications, data are collected from multiple related but nonidentical units whose underlying networks may differ but are likely to share features. Here we present a hierarchical Bayesian formulation for joint estimation of multiple networks in this nonidentically distributed setting. The approach is general: given a suitable class of graphical models, it uses an exchangeability assumption on networks to provide a corresponding joint formulation. Motivated by emerging experimental designs in molecular biology, we focus on time-course data with interventions, using dynamic Bayesian networks as the graphical models. We introduce a computationally efficient, deterministic algorithm for exact joint inference in this setting. We provide an upper bound on the gains that joint estimation offers relative to separate estimation for each network and empirical results that support and extend the theory, including an extensive simulation study and an application to proteomic data from human cancer cell lines. Finally, we describe approximations that are still more computationally efficient than the exact algorithm and that also demonstrate good empirical performance.

preprint2014arXiv

Joint Structure Learning of Multiple Non-Exchangeable Networks

Several methods have recently been developed for joint structure learning of multiple (related) graphical models or networks. These methods treat individual networks as exchangeable, such that each pair of networks are equally encouraged to have similar structures. However, in many practical applications, exchangeability in this sense may not hold, as some pairs of networks may be more closely related than others, for example due to group and sub-group structure in the data. Here we present a novel Bayesian formulation that generalises joint structure learning beyond the exchangeable case. In addition to a general framework for joint learning, we (i) provide a novel default prior over the joint structure space that requires no user input; (ii) allow for latent networks; (iii) give an efficient, exact algorithm for the case of time series data and dynamic Bayesian networks. We present empirical results on non-exchangeable populations, including a real data example from biology, where cell-line-specific networks are related according to genomic features.

preprint2014arXiv

Penalized estimation in high-dimensional hidden Markov models with state-specific graphical models

We consider penalized estimation in hidden Markov models (HMMs) with multivariate Normal observations. In the moderate-to-large dimensional setting, estimation for HMMs remains challenging in practice, due to several concerns arising from the hidden nature of the states. We address these concerns by $\ell_1$-penalization of state-specific inverse covariance matrices. Penalized estimation leads to sparse inverse covariance matrices which can be interpreted as state-specific conditional independence graphs. Penalization is nontrivial in this latent variable setting; we propose a penalty that automatically adapts to the number of states $K$ and the state-specific sample sizes and can cope with scaling issues arising from the unknown states. The methodology is adaptive and very general, applying in particular to both low- and high-dimensional settings without requiring hand tuning. Furthermore, our approach facilitates exploration of the number of states $K$ by coupling estimation for successive candidate values $K$. Empirical results on simulated examples demonstrate the effectiveness of the proposed approach. In a challenging real data example from genome biology, we demonstrate the ability of our approach to yield gains in predictive power and to deliver richer estimates than existing methods.

preprint2013arXiv

Network Inference Using Steady State Data and Goldbeter-Koshland Kinetics

Network inference approaches are widely used to shed light on regulatory interplay between molecular players such as genes and proteins. Biochemical processes underlying networks of interest (e.g. gene regulatory or protein signalling networks) are generally nonlinear. In many settings, knowledge is available concerning relevant chemical kinetics. However, existing network inference methods for continuous, steady-state data are typically rooted in statistical formulations, which do not exploit chemical kinetics to guide inference. Herein, we present an approach to network inference for steady-state data that is rooted in non-linear descriptions of biochemical mechanism. We use equilibrium analysis of chemical kinetics to obtain functional forms that are in turn used to infer networks using steady-state data. The approach we propose is directly applicable to conventional steady-state gene expression or proteomic data and does not require knowledge of either network topology or any kinetic parameters. We illustrate the approach in the context of protein phosphorylation networks, using data simulated from a recent mechanistic model and proteomic data from cancer cell lines. In the former, the true network is known and used for assessment, whereas in the latter, results are compared against known biochemistry. We find that the proposed methodology is more effective at estimating network topology than methods based on linear models.

preprint2013arXiv

Network-based clustering with mixtures of L1-penalized Gaussian graphical models: an empirical investigation

In many applications, multivariate samples may harbor previously unrecognized heterogeneity at the level of conditional independence or network structure. For example, in cancer biology, disease subtypes may differ with respect to subtype-specific interplay between molecular components. Then, both subtype discovery and estimation of subtype-specific networks present important and related challenges. To enable such analyses, we put forward a mixture model whose components are sparse Gaussian graphical models. This brings together model-based clustering and graphical modeling to permit simultaneous estimation of cluster assignments and cluster-specific networks. We carry out estimation within an L1-penalized framework, and investigate several specific penalization regimes. We present empirical results on simulated data and provide general recommendations for the formulation and use of mixtures of L1-penalized Gaussian graphical models.

preprint2013arXiv

Network-based multivariate gene-set testing

The identification of predefined groups of genes ("gene-sets") which are differentially expressed between two conditions ("gene-set analysis", or GSA) is a very popular analysis in bioinformatics. GSA incorporates biological knowledge by aggregating over genes that are believed to be functionally related. This can enhance statistical power over analyses that consider only one gene at a time. However, currently available GSA approaches are all based on univariate two-sample comparison of single genes. This means that they cannot test for differences in covariance structure between the two conditions. Yet interplay between genes is a central aspect of biological investigation and it is likely that such interplay may differ between conditions. This paper proposes a novel approach for gene-set analysis that allows for truly multivariate hypotheses, in particular differences in gene-gene networks between conditions. Testing hypotheses concerning networks is challenging due the nature of the underlying estimation problem. Our starting point is a recent, general approach for high-dimensional two-sample testing. We refine the approach and show how it can be used to perform multivariate, network-based gene-set testing. We validate the approach in simulated examples and show results using high-throughput data from several studies in cancer biology.

preprint2013arXiv

Two-Sample Testing in High-Dimensional Models

We propose novel methodology for testing equality of model parameters between two high-dimensional populations. The technique is very general and applicable to a wide range of models. The method is based on sample splitting: the data is split into two parts; on the first part we reduce the dimensionality of the model to a manageable size; on the second part we perform significance testing (p-value calculation) based on a restricted likelihood ratio statistic. Assuming that both populations arise from the same distribution, we show that the restricted likelihood ratio statistic is asymptotically distributed as a weighted sum of chi-squares with weights which can be efficiently estimated from the data. In high-dimensional problems, a single data split can result in a "p-value lottery". To ameliorate this effect, we iterate the splitting process and aggregate the resulting p-values. This multi-split approach provides improved p-values. We illustrate the use of our general approach in two-sample comparisons of high-dimensional regression models ("differential regression") and graphical models ("differential network"). In both cases we show results on simulated data as well as real data from recent, high-throughput cancer studies.

preprint2012arXiv

Network inference and biological dynamics

Network inference approaches are now widely used in biological applications to probe regulatory relationships between molecular components such as genes or proteins. Many methods have been proposed for this setting, but the connections and differences between their statistical formulations have received less attention. In this paper, we show how a broad class of statistical network inference methods, including a number of existing approaches, can be described in terms of variable selection for the linear model. This reveals some subtle but important differences between the methods, including the treatment of time intervals in discretely observed data. In developing a general formulation, we also explore the relationship between single-cell stochastic dynamics and network inference on averages over cells. This clarifies the link between biochemical networks as they operate at the cellular level and network inference as carried out on data that are averages over populations of cells. We present empirical results, comparing thirty-two network inference methods that are instances of the general formulation we describe, using two published dynamical models. Our investigation sheds light on the applicability and limitations of network inference and provides guidance for practitioners and suggestions for experimental design.

preprint2012arXiv

On the relationship between ODEs and DBNs

Recently, Li et al. (Bioinformatics 27(19), 2686-91, 2011) proposed a method, called Differential Equation-based Local Dynamic Bayesian Network (DELDBN), for reverse engineering gene regulatory networks from time-course data. We commend the authors for an interesting paper that draws attention to the close relationship between dynamic Bayesian networks (DBNs) and differential equations (DEs). Their central claim is that modifying a DBN to model Euler approximations to the gradient rather than expression levels themselves is beneficial for network inference. The empirical evidence provided is based on time-course data with equally-spaced observations. However, as we discuss below, in the particular case of equally-spaced observations, Euler approximations and conventional DBNs lead to equivalent statistical models that, absent artefacts due to the estimation procedure, yield networks with identical inter-gene edge sets. Here, we discuss further the relationship between DEs and conventional DBNs and present new empirical results on unequally spaced data which demonstrate that modelling Euler approximations in a DBN can lead to improved network reconstruction.

Sach Mukherjee

What is connected

Connect this record

See the researcher in context

Building this map preview

17 published item(s)

On unsupervised projections and second order signals

Scalable Regularised Joint Mixture Models

Evaluation of Causal Structure Learning Algorithms via Risk Estimation

High-dimensional regression in practice: an empirical study of finite-sample prediction, variable selection and ranking

Discussion of "Causal inference using invariant prediction: identification and confidence intervals" by Peters, Bühlmann and Meinshausen

Inferring network structure from interventional time-course experiments

Estimating causal structure using conditional DAG models

Exact Estimation of Multiple Directed Acyclic Graphs

Joint estimation of multiple related biological networks

Joint Structure Learning of Multiple Non-Exchangeable Networks

Penalized estimation in high-dimensional hidden Markov models with state-specific graphical models

Network Inference Using Steady State Data and Goldbeter-Koshland Kinetics

Network-based clustering with mixtures of L1-penalized Gaussian graphical models: an empirical investigation

Network-based multivariate gene-set testing

Two-Sample Testing in High-Dimensional Models

Network inference and biological dynamics

On the relationship between ODEs and DBNs