Source author record

Tanujit Chakraborty

Tanujit Chakraborty appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Applications Machine Learning math.ST Statistics Theory Artificial Intelligence Computation Databases econ.EM Methodology Quantitative Methods

Catalog footprint

What is connected

10works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting

The increasing adoption of data-driven decision-making in public health has established epidemic forecasting as a critical area of research. Recent advances in multivariate forecasting models better capture complex temporal dependencies than conventional univariate approaches, which model individual series independently. Despite this potential, the development of robust epidemic forecasting methods is constrained by the lack of high-quality benchmarks comprising diverse multivariate datasets across infectious diseases and geographical regions. To address this gap, we present EpiCastBench, a large-scale benchmarking framework featuring 40 curated (correlated) multivariate epidemic datasets. These publicly available datasets span a wide range of infectious diseases and exhibit diverse characteristics in terms of temporal granularity, series length, and sparsity. We analyze these datasets to identify their global features and structural patterns. To ensure reproducibility and fair comparison, we establish standardized evaluation settings, including a unified forecasting horizon, consistent preprocessing pipelines, diverse performance metrics, and statistical significance testing. By leveraging this framework, we conduct a comprehensive evaluation of 15 multivariate forecasting models spanning statistical baselines to state-of-the-art deep learning and foundation models. All datasets and code are publicly available on Kaggle (https://www.kaggle.com/datasets/aimltsf/epicastbench) and GitHub (https://github.com/aimltsf/EpiCastBench).

preprint2026arXiv

Graph Convolutional Support Vector Regression for Robust Spatiotemporal Forecasting of Urban Air Pollution

Urban air quality forecasting is challenging because pollutant concentrations are nonlinear, nonstationary, spatiotemporally dependent, and often affected by anomalous observations caused by traffic congestion, industrial emissions, and seasonal meteorological variability. This study proposes a Graph Convolutional Support Vector Regression (GCSVR) framework for robust spatiotemporal forecasting of urban air pollution. The model combines graph convolutional learning to capture inter-station spatial dependence with support vector regression to model nonlinear temporal dynamics while reducing sensitivity to outlier observations. The proposed framework is evaluated using air quality records from 37 monitoring stations in Delhi and 18 stations in Mumbai, representing inland and coastal metropolitan environments in India. Forecasting performance is assessed across multiple horizons and compared with established temporal and spatiotemporal benchmarks. The results show that GCSVR consistently improves predictive accuracy and maintains stable performance across seasons and outlier-prone pollution episodes. Statistical test further confirms the reliability of the proposed approach across the two cities. Finally, conformal prediction is integrated with GCSVR to generate calibrated prediction intervals, enhancing its practical value for uncertainty-aware air quality monitoring and public health decision-making.

preprint2026arXiv

MDAS: A Diagnostic Approach to Assess the Quality of Data Splitting in Machine Learning

In the field of machine learning, model performance is usually assessed by randomly splitting data into training and test sets. Different random splits, however, can yield markedly different performance estimates, so a genuinely good model may be discarded or a poor one selected purely due to an unlucky partition. This motivates a principled way to diagnose the quality of a given data split. We propose a diagnostic framework based on a new discrepancy measure, the Mahalanobis Distribution Alignment Score (MDAS). MDAS is a symmetric dissimilarity measure between two multivariate samples, rather than a strict metric. MDAS captures both mean and covariance differences and is affine invariant. Building on this, we construct a Monte Carlo test that evaluates whether an observed split is statistically compatible with typical random splits, yielding an interpretable p-value for split quality. Using several real data sets, we study the relationship between MDAS and model robustness, including its association with the normalized Akaike information criterion. Finally, we apply MDAS to compare existing state-of-the-art deterministic data-splitting strategies with standard random splitting. The experimental results show that MDAS provides a simple, model-agnostic tool for auditing data splits and improving the reliability of empirical model evaluation.

preprint2026arXiv

Probabilistic Forecasting of Climate Policy Uncertainty: The Role of Macro-financial Variables and Google Search Data

Accurately forecasting Climate Policy Uncertainty (CPU) is essential for designing climate strategies that balance economic growth with environmental objectives. Elevated CPU levels can delay regulatory implementation, hinder investment in green technologies, and amplify public resistance to policy reforms, particularly during periods of economic stress. Despite the growing literature documenting the economic relevance of CPU, forecasting its evolution and understanding the role of macro-financial drivers in shaping its fluctuations have not been explored. This study addresses this gap by presenting the first effort to forecast CPU and identify its key drivers. We employ various statistical tools to identify macro-financial exogenous drivers, alongside Google search data to capture early public attention to climate policy. Local projection impulse response analysis quantifies the dynamic effects of these variables, revealing that household financial vulnerability, housing market activity, business confidence, credit conditions, and financial market sentiment exert the most substantial impacts. These predictors are incorporated into a Bayesian Structural Time Series (BSTS) framework to produce probabilistic forecasts for both US and Global CPU indices. Extensive experiments and statistical validation demonstrate that BSTS with time-invariant regression coefficients achieves superior forecasting performance. We demonstrate that this performance stems from its variable selection mechanism, which identifies exogenous predictors that are empirically significant and theoretically grounded, as confirmed by the feature importance analysis. From a policy perspective, the findings underscore the importance of adaptive climate policies that remain effective across shifting economic conditions while supporting long-term environmental and growth objectives.

preprint2020arXiv

Modified Lomax Model: A heavy-tailed distribution for fitting large-scale real-world complex networks

Real-world networks are generally claimed to be scale-free, meaning that the degree distributions follow the classical power-law, at least asymptotically. Yet, closer observation shows that the classical power-law distribution is often inadequate to meet the data characteristics due to the existence of a clearly identifiable non-linearity in the entire degree distribution in the log-log scale. The present paper proposes a new variant of the popular heavy-tailed Lomax distribution which we named as the Modified Lomax (MLM) distribution that can efficiently capture the crucial aspect of heavy-tailed behavior of the entire degree distribution of real-world complex networks. The proposed MLM model, derived from a hierarchical family of Lomax distributions, can efficiently fit the entire degree distribution of real-world networks without removing lower degree nodes as opposed to the classical power-law based fitting. The MLM distribution belongs to the maximum domain of attraction of the Frechet distribution and is right tail equivalent to Pareto distribution. Various statistical properties including characteristics of the maximum likelihood estimates and asymptotic distributions have also been derived for the proposed MLM model. Finally, the effectiveness of the proposed MLM model is demonstrated through rigorous experiments over fifty real-world complex networks from diverse applied domains.

preprint2020arXiv

Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis

The coronavirus disease 2019 (COVID-19) has become a public health emergency of international concern affecting 201 countries and territories around the globe. As of April 4, 2020, it has caused a pandemic outbreak with more than 11,16,643 confirmed infections and more than 59,170 reported deaths worldwide. The main focus of this paper is two-fold: (a) generating short term (real-time) forecasts of the future COVID-19 cases for multiple countries; (b) risk assessment (in terms of case fatality rate) of the novel COVID-19 for some profoundly affected countries by finding various important demographic characteristics of the countries along with some disease characteristics. To solve the first problem, we presented a hybrid approach based on autoregressive integrated moving average model and Wavelet-based forecasting model that can generate short-term (ten days ahead) forecasts of the number of daily confirmed cases for Canada, France, India, South Korea, and the UK. The predictions of the future outbreak for different countries will be useful for the effective allocation of health care resources and will act as an early-warning system for government policymakers. In the second problem, we applied an optimal regression tree algorithm to find essential causal variables that significantly affect the case fatality rates for different countries. This data-driven analysis will necessarily provide deep insights into the study of early risk assessments for 50 immensely affected countries.

preprint2018arXiv

A Nonparametric Ensemble Binary Classifier and its Statistical Properties

In this work, we propose an ensemble of classification trees (CT) and artificial neural networks (ANN). Several statistical properties including universal consistency and upper bound of an important parameter of the proposed classifier are shown. Numerical evidence is also provided using various real life data sets to assess the performance of the model. Our proposed nonparametric ensemble classifier doesn't suffer from the `curse of dimensionality' and can be used in a wide variety of feature selection cum classification problems. Performance of the proposed model is quite better when compared to many other state-of-the-art models used for similar situations.

preprint2018arXiv

A novel distribution-free hybrid regression model for manufacturing process efficiency improvement

This work is motivated by a particular problem of a modern paper manufacturing industry, in which maximum efficiency of the fiber-filler recovery process is desired. A lot of unwanted materials along with valuable fibers and fillers come out as a by-product of the paper manufacturing process and mostly goes as waste. The job of an efficient Krofta supracell is to separate the unwanted materials from the valuable ones so that fibers and fillers can be collected from the waste materials and reused in the manufacturing process. The efficiency of Krofta depends on several crucial process parameters and monitoring them is a difficult proposition. To solve this problem, we propose a novel hybridization of regression trees (RT) and artificial neural networks (ANN), hybrid RT-ANN model, to solve the problem of low recovery percentage of the supracell. This model is used to achieve the goal of improving supracell efficiency, viz., gain in percentage recovery. In addition, theoretical results for the universal consistency of the proposed model are given with the optimal value of a vital model parameter. Experimental findings show that the proposed hybrid RT-ANN model achieves higher accuracy in predicting Krofta recovery percentage than other conventional regression models for solving the Krofta efficiency problem. This work will help the paper manufacturing company to become environmentally friendly with minimal ecological damage and improved waste recovery.

preprint2018arXiv

Imbalanced Ensemble Classifier for learning from imbalanced business school data set

Private business schools in India face a common problem of selecting quality students for their MBA programs to achieve the desired placement percentage. Generally, such data sets are biased towards one class, i.e., imbalanced in nature. And learning from the imbalanced dataset is a difficult proposition. This paper proposes an imbalanced ensemble classifier which can handle the imbalanced nature of the dataset and achieves higher accuracy in case of the feature selection (selection of important characteristics of students) cum classification problem (prediction of placements based on the students' characteristics) for Indian business school dataset. The optimal value of an important model parameter is found. Numerical evidence is also provided using Indian business school dataset to assess the outstanding performance of the proposed classifier.

preprint2018arXiv

Superensemble Classifier for Improving Predictions in Imbalanced Datasets

Learning from an imbalanced dataset is a tricky proposition. Because these datasets are biased towards one class, most existing classifiers tend not to perform well on minority class examples. Conventional classifiers usually aim to optimize the overall accuracy without considering the relative distribution of each class. This article presents a superensemble classifier, to tackle and improve predictions in imbalanced classification problems, that maps Hellinger distance decision trees (HDDT) into radial basis function network (RBFN) framework. Regularity conditions for universal consistency and the idea of parameter optimization of the proposed model are provided. The proposed distribution-free model can be applied for feature selection cum imbalanced classification problems. We have also provided enough numerical evidence using various real-life data sets to assess the performance of the proposed model. Its effectiveness and competitiveness with respect to different state-of-the-art models are shown.

Tanujit Chakraborty

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

EpiCastBench: Datasets and Benchmarks for Multivariate Epidemic Forecasting

Graph Convolutional Support Vector Regression for Robust Spatiotemporal Forecasting of Urban Air Pollution

MDAS: A Diagnostic Approach to Assess the Quality of Data Splitting in Machine Learning

Probabilistic Forecasting of Climate Policy Uncertainty: The Role of Macro-financial Variables and Google Search Data

Modified Lomax Model: A heavy-tailed distribution for fitting large-scale real-world complex networks

Real-time forecasts and risk assessment of novel coronavirus (COVID-19) cases: A data-driven analysis

A Nonparametric Ensemble Binary Classifier and its Statistical Properties

A novel distribution-free hybrid regression model for manufacturing process efficiency improvement

Imbalanced Ensemble Classifier for learning from imbalanced business school data set

Superensemble Classifier for Improving Predictions in Imbalanced Datasets