Researcher profile

Arnab Bhattacharya

Arnab Bhattacharya contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
12works
0followers
16topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

12 published item(s)

preprint2026arXiv

INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects

Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6\% to 89.8\% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI" approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.

preprint2025arXiv

Lexical and Statistical Analysis of Bangla Newspaper and Literature: A Corpus-Driven Study on Diversity, Readability, and NLP Adaptation

In this paper, we present a comprehensive corpus-driven analysis of Bangla literary and newspaper texts to investigate their lexical diversity, structural complexity and readability. We undertook Vacaspati and IndicCorp, which are the most extensive literature and newspaper-only corpora for Bangla. We examine key linguistic properties, including the type-token ratio (TTR), hapax legomena ratio (HLR), Bigram diversity, average syllable and word lengths, and adherence to Zipfs Law, for both newspaper (IndicCorp) and literary corpora (Vacaspati).For all the features, such as Bigram Diversity and HLR, despite its smaller size, the literary corpus exhibits significantly higher lexical richness and structural variation. Additionally, we tried to understand the diversity of corpora by building n-gram models and measuring perplexity. Our findings reveal that literary corpora have higher perplexity than newspaper corpora, even for similar sentence sizes. This trend can also be observed for the English newspaper and literature corpus, indicating its generalizability. We also examined how the performance of models on downstream tasks is influenced by the inclusion of literary data alongside newspaper data. Our findings suggest that integrating literary data with newspapers improves the performance of models on various downstream tasks. We have also demonstrated that a literary corpus adheres more closely to global word distribution properties, such as Zipfs law, than a newspaper corpus or a merged corpus of both literary and newspaper texts. Literature corpora also have higher entropy and lower redundancy values compared to a newspaper corpus. We also further assess the readability using Flesch and Coleman-Liau indices, showing that literary texts are more complex.

preprint2022arXiv

Koopman-based Differentiable Predictive Control for the Dynamics-Aware Economic Dispatch Problem

The dynamics-aware economic dispatch (DED) problem embeds low-level generator dynamics and operational constraints to enable near real-time scheduling of generation units in a power network. DED produces a more dynamic supervisory control policy than traditional economic dispatch (T-ED) that leads to reduced overall generation costs. However, the incorporation of differential equations that govern the system dynamics makes DED an optimization problem that is computationally prohibitive to solve. In this work, we present a new data-driven approach based on differentiable programming to efficiently obtain parametric solutions to the underlying DED problem. In particular, we employ the recently proposed differentiable predictive control (DPC) for offline learning of explicit neural control policies using an identified Koopman operator (KO) model of the power system dynamics. We demonstrate the high solution quality and five orders of magnitude computational-time savings of the DPC method over the original online optimization-based DED approach on a 9-bus test power grid network.

preprint2022arXiv

Predictions of Reynolds and Nusselt numbers in turbulent convection using machine-learning models

In this paper, we develop a multivariate regression model and a neural network model to predict the Reynolds number (Re) and Nusselt number in turbulent thermal convection. We compare their predictions with those of earlier models of convection: Grossmann-Lohse~[Phys. Rev. Lett. \textbf{86}, 3316 (2001)], revised Grossmann-Lohse~[Phys. Fluids \textbf{33}, 015113 (2021)], and Pandey-Verma [Phys. Rev. E \textbf{94}, 053106 (2016)] models. We observe that although the predictions of all the models are quite close to each other, the machine learning models developed in this work provide the best match with the experimental and numerical results.

preprint2022arXiv

Recommendation of Compatible Outfits Conditioned on Style

Recommendation in the fashion domain has seen a recent surge in research in various areas, for example, shop-the-look, context-aware outfit creation, personalizing outfit creation, etc. The majority of state of the art approaches in the domain of outfit recommendation pursue to improve compatibility among items so as to produce high quality outfits. Some recent works have realized that style is an important factor in fashion and have incorporated it in compatibility learning and outfit generation. These methods often depend on the availability of fine-grained product categories or the presence of rich item attributes (e.g., long-skirt, mini-skirt, etc.). In this work, we aim to generate outfits conditional on styles or themes as one would dress in real life, operating under the practical assumption that each item is mapped to a high level category as driven by the taxonomy of an online portal, like outdoor, formal etc and an image. We use a novel style encoder network that renders outfit styles in a smooth latent space. We present an extensive analysis of different aspects of our method and demonstrate its superiority over existing state of the art baselines through rigorous experiments.

preprint2022arXiv

Terahertz Optical Properties and Birefringence in Single Crystal Vanadium doped [100] \b{eta}-Ga2O3

We report the Terahertz optical properties of the Vanadium doped [100] \b{eta}-Ga2O3 using Terahertz Time-Domain Spectroscopy (THz-TDS). The V-doped \b{eta}-Ga2O3 crystal shows strong birefringence in the 0.2-2.4 THz range. Further, phase retardation by the V-doped \b{eta}-Ga2O3 has been measured over the whole THz range by Terahertz Time-Domain Polarimetry (THz-TDP). It is observed that the V-doped \b{eta}-Ga2O3 crystal behaves both as a quarter waveplate (QWP) at 0.38, 1.08, 1.71, 2.28 THz, and a half waveplate (HWP) at 0.74 and 1.94 THz, respectively.

preprint2022arXiv

Vanadium doped beta-Ga2O3 single crystals: Growth, Optical and Terahertz characterization

We report the growth of electrically-resistive vanadium-doped beta-Ga2O3 single crystals via the optical floating zone technique. By carefully controlling the growth parameters V-doped crystals with very high electrical resistivity compared to the usual n-type V-doped beta-Ga2O3 (ne~10^(18)/cm^3) can be synthesized. The optical properties of such high resistive V-doped b-Ga2O3 are significantly different compared to the undoped and n-doped crystals. We study the polarization-dependent Raman spectra, polarization-dependent transmission, temperature-dependent photoluminescence in the optical wavelength range and the THz transmission properties in the 0.2 - 2.6 THz range. The V-doped insulating Ga2O3 crystals show strong birefringence with refractive index contrast Dn of 0.3+-0.02 at 1 THz, suggesting it to be an ideal material for optical applications in the THz region.

preprint2020arXiv

C-MI-GAN : Estimation of Conditional Mutual Information using MinMax formulation

Estimation of information theoretic quantities such as mutual information and its conditional variant has drawn interest in recent times owing to their multifaceted applications. Newly proposed neural estimators for these quantities have overcome severe drawbacks of classical $k$NN-based estimators in high dimensions. In this work, we focus on conditional mutual information (CMI) estimation by utilizing its formulation as a minmax optimization problem. Such a formulation leads to a joint training procedure similar to that of generative adversarial networks. We find that our proposed estimator provides better estimates than the existing approaches on a variety of simulated data sets comprising linear and non-linear relations between variables. As an application of CMI estimation, we deploy our estimator for conditional independence (CI) testing on real data and obtain better results than state-of-the-art CI testers.

preprint2020arXiv

How and Why is An Answer (Still) Correct? Maintaining Provenance in Dynamic Knowledge Graphs

Knowledge graphs (KGs) have increasingly become the backbone of many critical knowledge-centric applications. Most large-scale KGs used in practice are automatically constructed based on an ensemble of extraction techniques applied over diverse data sources. Therefore, it is important to establish the provenance of results for a query to determine how these were computed. Provenance is shown to be useful for assigning confidence scores to the results, for debugging the KG generation itself, and for providing answer explanations. In many such applications, certain queries are registered as standing queries since their answers are needed often. However, KGs keep continuously changing due to reasons such as changes in the source data, improvements to the extraction techniques, refinement/enrichment of information, and so on. This brings us to the issue of efficiently maintaining the provenance polynomials of complex graph pattern queries for dynamic and large KGs instead of having to recompute them from scratch each time the KG is updated. Addressing these issues, we present HUKA which uses provenance polynomials for tracking the derivation of query results over knowledge graphs by encoding the edges involved in generating the answer. More importantly, HUKA also maintains these provenance polynomials in the face of updates---insertions as well as deletions of facts---to the underlying KG. Experimental results over large real-world KGs such as YAGO and DBpedia with various benchmark SPARQL query workloads reveals that HUKA can be almost 50 times faster than existing systems for provenance computation on dynamic KGs.

preprint2020arXiv

Learning Koopman Representations for Hybrid Systems

The Koopman operator lifts nonlinear dynamical systems into a functional space of observables, where the dynamics are linear. In this paper, we provide three different Koopman representations for hybrid systems. The first is specific to switched systems, and the second and third preserve the original hybrid dynamics while eliminating the discrete state variables; the second approach is straightforward, and we provide conditions under which the transformation associated with the third holds. Eliminating discrete state variables provides computational benefits when using data-driven methods to learn the Koopman operator and its observables. Following this, we use deep learning to implement each representation on two test cases, discuss the challenges associated with those implementations, and propose areas of future work.

preprint2020arXiv

Model Predictive Control of Discrete-Continuous Energy Systems via Generalized Disjunctive Programming

Generalized Disjunctive Programming (GDP) provides an alternative framework to model optimization problems with both discrete and continuous variables. The key idea behind GDP involves the use of logical disjunctions to represent discrete decisions in the continuous space, and logical propositions to denote algebraic constraints in the discrete space. Compared to traditional mixed-integer programming (MIP), the inherent logic structure in GDP yields tighter relaxations that are exploited by global branch and bound algorithms to improve solution quality. In this paper, we present a general GDP model for optimal control of hybrid systems that exhibit both discrete and continuous dynamics. Specifically, we use GDP to formulate a model predictive control (MPC) model for piecewise-affine systems with implicit switching logic. As an example, the GDP-based MPC approach is used as a supervisory control to improve energy efficiency in residential buildings with binary on/off, relay-based thermostats. A simulation study is used to demonstrate the validity of the proposed approach, and the improved solution quality compared to existing MIP-based control approaches.

preprint2020arXiv

Recharging and rejuvenation of decontaminated N95 masks

N95 respirators comprise a critical part of the personal protective equipment used by frontline health-care workers, and are typically meant for one-time usage. However, the recent COVID-19 pandemic has resulted in a serious shortage of these masks leading to a worldwide effort to develop decontamination and re-use procedures. A major factor contributing to the filtration efficiency of N95 masks is the presence of an intermediate layer of charged polypropylene electret fibers that trap particles through electrostatic or electrophoretic effects. This charge can degrade when the mask is used. Moreover, simple decontamination procedures (e.g. use of alcohol) can degrade any remaining charge from the polypropylene, thus severely impacting the filtration efficiency post decontamination. In this report, we summarize our results on the development of a simple laboratory setup allowing measurement of charge and filtration efficiency in N95 masks. In particular, we propose and show that it is possible to recharge the masks post-decontamination and recover filtration efficiency.