Researcher profile

Tarun Sharma

Tarun Sharma contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2026arXiv

INDIC DIALECT: A Multi Task Benchmark to Evaluate and Translate in Indian Language Dialects

Recent NLP advances focus primarily on standardized languages, leaving most low-resource dialects under-served especially in Indian scenarios. In India, the issue is particularly important: despite Hindi being the third most spoken language globally (over 600 million speakers), its numerous dialects remain underrepresented. The situation is similar for Odia, which has around 45 million speakers. While some datasets exist which contain standard Hindi and Odia languages, their regional dialects have almost no web presence. We introduce INDIC-DIALECT, a human-curated parallel corpus of 13k sentence pairs spanning 11 dialects and 2 languages: Hindi and Odia. Using this corpus, we construct a multi-task benchmark with three tasks: dialect classification, multiple-choice question (MCQ) answering, and machine translation (MT). Our experiments show that LLMs like GPT-4o and Gemini 2.5 perform poorly on the classification task. While fine-tuned transformer based models pretrained on Indian languages substantially improve performance e.g., improving F1 from 19.6\% to 89.8\% on dialect classification. For dialect to language translation, we find that hybrid AI model achieves highest BLEU score of 61.32 compared to the baseline score of 23.36. Interestingly, due to complexity in generating dialect sentences, we observe that for language to dialect translation the ``rule-based followed by AI" approach achieves best BLEU score of 48.44 compared to the baseline score of 27.59. INDIC-DIALECT thus is a new benchmark for dialect-aware Indic NLP, and we plan to release it as open source to support further work on low-resource Indian dialects.

preprint2026arXiv

IVF-TQ: Streaming-Robust Approximate Nearest Neighbor Search via a Codebook-Free Residual Layer

We propose IVF-TQ, an IVF index with a codebook-free residual layer: a fixed random rotation followed by precomputed Lloyd-Max scalar quantization depending only on (b, d). Only the IVF coarse partition is trained. Building on TurboQuant (Zandieh et al., 2025), the design substantially reduces a key failure mode of trained-codebook ANN indexes (PQ, OPQ, ScaNN): staleness under streaming ingestion.Empirical (3 seeds): Per-batch PQ retraining does not recover the streaming gap at any tested bit budget (paired-t p > 0.28 everywhere). On streaming Deep-10M, IVF-TQ holds at 87.4% -> 86.6% (Delta = -0.80 +/- 0.10pp) while IVF-PQ degrades -3.23pp. A shuffled-i.i.d. control on SIFT-1M shows IVF-PQ losing -3.9pp without distribution shift. At higher PQ bit budgets (~1.5x IVF-TQ memory), absolute recall favors PQ as expected from rate-distortion (+6.1pp Deep-10M; +2.0pp SIFT-10M); the durable IVF-TQ benefit is operational (no codebook to retrain), robust across memory regimes.Prior art: IVF around a codebook-free residual quantizer is architecturally not new -- IVF-RaBitQ ships in Milvus, cuVS, LanceDB, Weaviate; Shi et al. (2026) is concurrent GPU work. TurboQuant itself tests only flat-rotation ANN.Contributions: (i) A multi-seed streaming-operational story for codebook-free IVF: 10M-scale evidence across PQ memory budgets. (ii) A uniform-over-sphere IP-error bound for the TQ residual quantizer with one fixed rotation (proof sketch in v1; rigorous in v2). (iii) Adaptive IVF-TQ: a partition-only refresh recovering 67% -> 97.8% under worst-case rotation shift with re-ranking (90.3% without).Code, data: https://github.com/tarun-ks/turboquant_search

preprint2023arXiv

Teaching Computer Vision for Ecology

Computer vision can accelerate ecology research by automating the analysis of raw imagery from sensors like camera traps, drones, and satellites. However, computer vision is an emerging discipline that is rarely taught to ecologists. This work discusses our experience teaching a diverse group of ecologists to prototype and evaluate computer vision systems in the context of an intensive hands-on summer workshop. We explain the workshop structure, discuss common challenges, and propose best practices. This document is intended for computer scientists who teach computer vision across disciplines, but it may also be useful to ecologists or other domain experts who are learning to use computer vision themselves.

preprint2022arXiv

Epidemic Control Modeling using Parsimonious Models and Markov Decision Processes

Many countries have experienced at least two waves of the COVID-19 pandemic. The second wave is far more dangerous as distinct strains appear more harmful to human health, but it stems from the complacency about the first wave. This paper introduces a parsimonious yet representative stochastic epidemic model that simulates the uncertain spread of the disease regardless of the latency and recovery time distributions. We also propose a Markov decision process to seek an optimal trade-off between the usage of the healthcare system and the economic costs of an epidemic. We apply the model to COVID-19 data from New Delhi, India and simulate the epidemic spread with different policy review times. The results show that the optimal policy acts swiftly to curb the epidemic in the first wave, thus avoiding the collapse of the healthcare system and the future costs of posterior outbreaks. An analysis of the recent collapse of the healthcare system of India during the second COVID-19 wave suggests that many lives could have been preserved if swift mitigation was promoted after the first wave.

preprint2022arXiv

The Hilbert Space of large $N$ Chern-Simons matter theories

We demonstrate that the known expressions for the thermal partition function of large $N$ Chern-Simons matter theories admit a simple Hilbert space interpretation as the partition function of an associated ungauged large $N$ matter theory with one additional condition: the Fock space of this associated theory is projected down to the subspace of its \emph{quantum} singlets i.e.~singlets under the Gauss law for Chern-Simons gauge theory. Via the Chern-Simons / WZW correspondence, the space of quantum singlets are equivalent to the space of WZW conformal blocks. One step in our demonstration involves recasting the Verlinde formula for the dimension of the space of conformal blocks in $SU(N)_k$ and $U(N)_{k,k'}$ WZW theories into a simple and physically transparent form, which we also rederive by evaluating the partition function and superconformal index of pure Chern-Simons theory in the presence of Wilson lines. A particular consequence of the projection of the Fock space of Chern-Simons matter theories to quantum (or WZW) singlets is the `Bosonic Exclusion Principle': the number of bosons occupying any single particle state is bounded above by the Chern-Simons level. The quantum singlet condition (unlike its Yang-Mills Gauss Law counterpart) has a nontrivial impact on thermodynamics even in the infinite volume limit. In this limit the projected Fock space partition function reduces to a product of partition functions, one for each single particle state. These single particle state partition functions are $q$-deformations of their free boson and free fermion counterparts and interpolate between these two special cases. We also propose a formula for the large $N$ partition function that is valid for arbitrary finite volume of the spatial $S^2$ and not only at large volume.

preprint2019arXiv

Correlation functions in ${\cal N}=2$ Supersymmetric vector matter Chern-Simons theory

We compute the two, three point function of the opearators in the spin zero multiplet of ${\cal N}=2$ Supersymmetric vector matter Chern-Simons theory at large $N$ and at all orders of 't Hooft coupling by solving the Schwinger-Dyson equation. Schwinger-Dyson method to compute four point function becomes extremely complicated and hence we use bootstrap method to solve for four point function of scaler operator $J_0^{f}=\barψψ$ and $J_0^{b}=\barϕϕ$. Interestingly, due to the fact that $\langle J_0^{f}J_0^{f}J_0^{b} \rangle$ is a contact term, the four point function of $ J_0^{f}$ operator looks like that of free theory up to overall coupling constant dependent factors and up to some bulk AdS contact terms. On the other hand the $J_0^{b}$ four-point function receives an additional contribution compared to the free theory expression due to the $J_0^{f}$ exchange. Interestingly, double discontinuity of this single trace operator $J_0^{f}$ vanishes and hence it only contributes to AdS-contact term.