Source author record

Sumanta Basu

Sumanta Basu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Methodology q-fin.ST Machine Learning math.ST Statistics Theory Computation Genomics q-fin.CP q-fin.TR

Catalog footprint

What is connected

10works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

LARGE: A Locally Adaptive Regularization Approach for Estimating Gaussian Graphical Models

The graphical Lasso (GLASSO) is a widely used algorithm for learning high-dimensional undirected Gaussian graphical models (GGM). Given i.i.d. observations from a multivariate normal distribution, GLASSO estimates the precision matrix by maximizing the log-likelihood with an \ell_1-penalty on the off-diagonal entries. However, selecting an optimal regularization parameter λin this unsupervised setting remains a significant challenge. A well-known issue is that existing methods, such as out-of-sample likelihood maximization, select a single global λand do not account for heterogeneity in variable scaling or partial variances. Standardizing the data to unit variances, although a common workaround, has been shown to negatively affect graph recovery. Addressing the problem of nodewise adaptive tuning in graph estimation is crucial for applications like computational neuroscience, where brain networks are constructed from highly heterogeneous, region-specific fMRI data. In this work, we develop Locally Adaptive Regularization for Graph Estimation (LARGE), an approach to adaptively learn nodewise tuning parameters to improve graph estimation and selection. In each block coordinate descent step of GLASSO, we augment the nodewise Lasso regression to jointly estimate the regression coefficients and error variance, which in turn guides the adaptive learning of nodewise penalties. In simulations, LARGE consistently outperforms benchmark methods in graph recovery, demonstrates greater stability across replications, and achieves the best estimation accuracy in the most difficult simulation settings. We demonstrate the practical utility of our method by estimating brain functional connectivity from a real fMRI data set.

preprint2022arXiv

Exploring Financial Networks Using Quantile Regression and Granger Causality

In the post-crisis era, financial regulators and policymakers are increasingly interested in data-driven tools to measure systemic risk and to identify systemically important firms. Granger Causality (GC) based techniques to build networks among financial firms using time series of their stock returns have received significant attention in recent years. Existing GC network methods model conditional means, and do not distinguish between connectivity in lower and upper tails of the return distribution - an aspect crucial for systemic risk analysis. We propose statistical methods that measure connectivity in the financial sector using system-wide tail-based analysis and is able to distinguish between connectivity in lower and upper tails of the return distribution. This is achieved using bivariate and multivariate GC analysis based on regular and Lasso penalized quantile regressions, an approach we call quantile Granger causality (QGC). By considering centrality measures of these financial networks, we can assess the build-up of systemic risk and identify risk propagation channels. We provide an asymptotic theory of QGC estimators under a quantile vector autoregressive model, and show its benefit over regular GC analysis on simulated data. We apply our method to the monthly stock returns of large U.S. firms and demonstrate that lower tail based networks can detect systemically risky periods in historical data with higher accuracy than mean-based networks. In a similar analysis of large Indian banks, we find that upper and lower tail networks convey different information and have the potential to distinguish between periods of high connectivity that are governed by positive vs negative news in the market.

preprint2022arXiv

Graphical models for nonstationary time series

We propose NonStGM, a general nonparametric graphical modeling framework for studying dynamic associations among the components of a nonstationary multivariate time series. It builds on the framework of Gaussian Graphical Models (GGM) and stationary time series Gaussian Graphical model (StGM), and complements existing works on parametric graphical models based on change point vector autoregressions (VAR). Analogous to StGM, the proposed framework captures conditional noncorrelations (both intertemporal and contemporaneous) in the form of an undirected graph. In addition, to describe the more nuanced nonstationary relationships among the components of the time series, we introduce the new notion of conditional nonstationarity/stationarity and incorporate it within the graph architecture. This allows one to distinguish between direct and indirect nonstationary relationships among system components, and can be used to search for small subnetworks that serve as the "source" of nonstationarity in a large system. Together, the two concepts of conditional noncorrelation and nonstationarity/stationarity provide a parsimonious description of the dependence structure of the time series.

preprint2022arXiv

Learning Financial Networks with High-frequency Trade Data

Financial networks are typically estimated by applying standard time series analyses to price-based economic variables collected at low-frequency (e.g., daily or monthly stock returns or realized volatility). These networks are used for risk monitoring and for studying information flows in financial markets. High-frequency intraday trade data sets may provide additional insights into network linkages by leveraging high-resolution information. However, such data sets pose significant modeling challenges due to their asynchronous nature, nonlinear dynamics, and nonstationarity. To tackle these challenges, we estimate financial networks using random forests. The edges in our network are determined by using microstructure measures of one firm to forecast the sign of the change in a market measure (either realized volatility or returns kurtosis) of another firm. We first investigate the evolution of network connectivity in the period leading up to the U.S. financial crisis of 2007-09. We find that the networks have the highest density in 2007, with high degree connectivity associated with Lehman Brothers in 2006. A second analysis into the nature of linkages among firms suggests that larger firms tend to offer better predictive power than smaller firms, a finding qualitatively consistent with prior works in the market microstructure literature.

preprint2022arXiv

Modeling Multivariate Positive-Valued Time Series Using R-INLA

In this paper we describe fast Bayesian statistical analysis of vector positive-valued time series, with application to interesting financial data streams. We discuss a flexible level correlated model (LCM) framework for building hierarchical models for vector positive-valued time series. The LCM allows us to combine marginal gamma distributions for the positive-valued component responses, while accounting for association among the components at a latent level. We use integrated nested Laplace approximation (INLA) for fast approximate Bayesian modeling via the R-INLA package, building custom functions to handle this setup. We use the proposed method to model interdependencies between realized volatility measures from several stock indexes.

preprint2021arXiv

An empirical Bayes approach to estimating dynamic models of co-regulated gene expression

Time-course gene expression datasets provide insight into the dynamics of complex biological processes, such as immune response and organ development. It is of interest to identify genes with similar temporal expression patterns because such genes are often biologically related. However, this task is challenging due to the high dimensionality of these datasets and the nonlinearity of gene expression time dynamics. We propose an empirical Bayes approach to estimating ordinary differential equation (ODE) models of gene expression, from which we derive a similarity metric between genes called the Bayesian lead-lag $R^2$ (LLR2). Importantly, the calculation of the LLR2 leverages biological databases that document known interactions amongst genes; this information is automatically used to define informative prior distributions on the ODE model's parameters. As a result, the LLR2 is a biologically-informed metric that can be used to identify clusters or networks of functionally-related genes with co-moving or time-delayed expression patterns. We then derive data-driven shrinkage parameters from Stein's unbiased risk estimate that optimally balance the ODE model's fit to both data and external biological information. Using real gene expression data, we demonstrate that our methodology allows us to recover interpretable gene clusters and sparse networks. These results reveal new insights about the dynamics of biological systems.

preprint2017arXiv

Iterative Random Forests to detect predictive and stable high-order interactions

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on Random Forests (RF), Random Intersection Trees (RITs), and through extensive, biologically inspired simulations, we developed the iterative Random Forest algorithm (iRF). iRF trains a feature-weighted ensemble of decision trees to detect stable, high-order interactions with same order of computational cost as RF. We demonstrate the utility of iRF for high-order interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and alternative splicing of primary transcripts in human derived cell lines. In Drosophila, among the 20 pairwise transcription factor interactions iRF identifies as stable (returned in more than half of bootstrap replicates), 80% have been previously reported as physical interactions. Moreover, novel third-order interactions, e.g. between Zelda (Zld), Giant (Gt), and Twist (Twi), suggest high-order relationships that are candidates for follow-up experiments. In human-derived cells, iRF re-discovered a central role of H3K36me3 in chromatin-mediated splicing regulation, and identified novel 5th and 6th order interactions, indicative of multi-valent nucleosomes with specific roles in splicing regulation. By decoupling the order of interactions from the computational cost of identification, iRF opens new avenues of inquiry into the molecular mechanisms underlying genome biology.

preprint2016arXiv

Penalized Maximum Likelihood Estimation of Multi-layered Gaussian Graphical Models

Analyzing multi-layered graphical models provides insight into understanding the conditional relationships among nodes within layers after adjusting for and quantifying the effects of nodes from other layers. We obtain the penalized maximum likelihood estimator for Gaussian multi-layered graphical models, based on a computational approach involving screening of variables, iterative estimation of the directed edges between layers and undirected edges within layers and a final refitting and stability selection step that provides improved performance in finite sample settings. We establish the consistency of the estimator in a high-dimensional setting. To obtain this result, we develop a strategy that leverages the biconvexity of the likelihood function to ensure convergence of the developed iterative algorithm to a stationary point, as well as careful uniform error control of the estimates over iterations. The performance of the maximum likelihood estimator is illustrated on synthetic data.

preprint2015arXiv

Regularized estimation in sparse high-dimensional time series models

Many scientific and economic problems involve the analysis of high-dimensional time series datasets. However, theoretical studies in high-dimensional statistics to date rely primarily on the assumption of independent and identically distributed (i.i.d.) samples. In this work, we focus on stable Gaussian processes and investigate the theoretical properties of $\ell _1$-regularized estimates in two important statistical problems in the context of high-dimensional time series: (a) stochastic regression with serially correlated errors and (b) transition matrix estimation in vector autoregressive (VAR) models. We derive nonasymptotic upper bounds on the estimation errors of the regularized estimates and establish that consistent estimation under high-dimensional scaling is possible via $\ell_1$-regularization for a large class of stable processes under sparsity constraints. A key technical contribution of the work is to introduce a measure of stability for stationary processes using their spectral properties that provides insight into the effect of dependence on the accuracy of the regularized estimates. With this proposed stability measure, we establish some useful deviation bounds for dependent data, which can be used to study several important regularized estimates in a time series setting.

preprint2013arXiv

Network Granger Causality with Inherent Grouping Structure

The problem of estimating high-dimensional network models arises naturally in the analysis of many physical, biological and socio-economic systems. Examples include stock price fluctuations in financial markets and gene regulatory networks representing effects of regulators (transcription factors) on regulated genes in genetics. We aim to learn the structure of the network over time employing the framework of Granger causal models under the assumptions of sparsity of its edges and inherent grouping structure among its nodes. We introduce a thresholded variant of the Group Lasso estimator for discovering Granger causal interactions among the nodes of the network. Asymptotic results on the consistency of the new estimation procedure are developed. The performance of the proposed methodology is assessed through an extensive set of simulation studies and comparisons with existing techniques.

Sumanta Basu

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

LARGE: A Locally Adaptive Regularization Approach for Estimating Gaussian Graphical Models

Exploring Financial Networks Using Quantile Regression and Granger Causality

Graphical models for nonstationary time series

Learning Financial Networks with High-frequency Trade Data

Modeling Multivariate Positive-Valued Time Series Using R-INLA

An empirical Bayes approach to estimating dynamic models of co-regulated gene expression

Iterative Random Forests to detect predictive and stable high-order interactions

Penalized Maximum Likelihood Estimation of Multi-layered Gaussian Graphical Models

Regularized estimation in sparse high-dimensional time series models

Network Granger Causality with Inherent Grouping Structure