Researcher profile

Vivek Gupta

Vivek Gupta contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
15works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

15 published item(s)

preprint2026arXiv

Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities

The rapid integration of large language models (LLMs) into high-stakes legal work has exposed a critical gap: no benchmark exists to systematically stress-test their reliability against the nuanced, adversarial, and often subtle flaws present in real-world contracts. To address this, we introduce CLAUSE, a first-of-its-kind benchmark designed to evaluate the fragility of an LLM's legal reasoning. We study the capabilities of LLMs to detect and reason about fine-grained discrepancies by producing over 7500 real-world perturbed contracts from foundational datasets like CUAD and ContractNLI. Our novel, persona-driven pipeline generates 10 distinct anomaly categories, which are then validated against official statutes using a Retrieval-Augmented Generation (RAG) system to ensure legal fidelity. We use CLAUSE to evaluate leading LLMs' ability to detect embedded legal flaws and explain their significance. Our analysis shows a key weakness: these models often miss subtle errors and struggle even more to justify them legally. Our work outlines a path to identify and correct such reasoning failures in legal AI.

preprint2026arXiv

DIAGRAMS: A Review Framework for Reasoning-Level Attribution in Diagram QA

Diagram question answering (Diagram QA) requires reasoning-level attribution that links each question-answer pair to all visual regions needed to derive the answer, rather than only the region containing the final response. Creating such structured evidence across diagrams, charts, maps, circuits, and infographics is time-consuming, and existing annotation tools tightly couple their interfaces to dataset-specific formats. We present DIAGRAMS, a lightweight, schema-driven review framework that decouples interface logic from dataset-specific JSON structures through an internal meta-schema and dataset adapters. Given an image and QA pair with optional candidate regions, the system performs QA-conditioned evidence selection and proposes the regions required for reasoning. When QA pairs or candidate regions are missing, it generates them and supports human verification and refinement. Across six Diagram QA datasets, model-suggested evidence achieves 85.39% precision and 75.30% recall against reviewer-final selections (micro-averaged). These results indicate that the review-first framework reduces manual region creation while maintaining high agreement with final reasoning-level attributions. We release a public demo and installable package to support dataset auditing, grounded supervision creation, and grounded evaluation.

preprint2026arXiv

Digitally Controlled Mechatronic Metamaterials for Actively Induced Targeted Bandgaps

This paper presents an experimental framework for inducing and tuning vibration bandgaps in digitally controlled mechatronic metamaterials. A slender-beam structure instrumented with collocated piezoelectric sensor-actuator pairs distributed periodically along the length is used as the host medium, with decentralized second-order low-pass resonant filter with negative position feedback controllers implemented in real time on an FPGA platform. Unlike conventional approaches that assess bandgap formation through tip displacement, this study relies on bending strain minimization of piezoelectric sensors as the principal indicator of control-induced bandgaps. This reflects more accurately the moment-based phase cancellation dynamics similar to resonator behavior. We derive analytical expressions for transmissibility in an n x n decentralized feedback architecture and verify them experimentally using a 7 x 7 unit-cell configuration. The findings show that resonant controllers with negative feedback applied at the unit-cell level can be systematically tuned through controller gain and damping to open targeted low-frequency bandgaps and significantly improve vibration attenuation. By shifting the focus to localized dynamics, this work deepens the understanding of how control-induced bandgaps emerge and demonstrates a scalable pathway for designing programmable mechatronic metamaterials based on unconventional resonator behavior.

preprint2026arXiv

DoPE: Decoy Oriented Perturbation Encapsulation Human-Readable, AI-Hostile Documents for Academic Integrity

Multimodal Large Language Models (MLLMs) can directly consume exam documents, threatening conventional assessments and academic integrity. We present DoPE (Decoy-Oriented Perturbation Encapsulation), a document-layer defense framework that embeds semantic decoys into PDF/HTML assessments to exploit render-parse discrepancies in MLLM pipelines. By instrumenting exams at authoring time, DoPE provides model-agnostic prevention (stop or confound automated solving) and detection (flag blind AI reliance) without relying on conventional one-shot classifiers. We formalize prevention and detection tasks, and introduce FewSoRT-Q, an LLM-guided pipeline that generates question-level semantic decoys and FewSoRT-D to encapsulate them into watermarked documents. We evaluate on Integrity-Bench, a novel benchmark of 1826 exams (PDF+HTML) derived from public QA datasets and OpenCourseWare. Against black-box MLLMs from OpenAI and Anthropic, DoPE yields strong empirical gains: a 91.4% detection rate at an 8.7% false-positive rate using an LLM-as-Judge verifier, and prevents successful completion or induces decoy-aligned failures in 96.3% of attempts. We release Integrity-Bench, our toolkit, and evaluation code to enable reproducible study of document-layer defenses for academic integrity.

preprint2026arXiv

Integrity Shield A System for Ethical AI Use & Authorship Transparency in Assessments

Large Language Models (LLMs) can now solve entire exams directly from uploaded PDF assessments, raising urgent concerns about academic integrity and the reliability of grades and credentials. Existing watermarking techniques either operate at the token level or assume control over the model's decoding process, making them ineffective when students query proprietary black-box systems with instructor-provided documents. We present Integrity Shield, a document-layer watermarking system that embeds schema-aware, item-level watermarks into assessment PDFs while keeping their human-visible appearance unchanged. These watermarks consistently prevent MLLMs from answering shielded exam PDFs and encode stable, item-level signatures that can be reliably recovered from model or student responses. Across 30 exams spanning STEM, humanities, and medical reasoning, Integrity Shield achieves exceptionally high prevention (91-94% exam-level blocking) and strong detection reliability (89-93% signature retrieval) across four commercial MLLMs. Our demo showcases an interactive interface where instructors upload an exam, preview watermark behavior, and inspect pre/post AI performance & authorship evidence.

preprint2025arXiv

Ultra-Wideband Polarimetry of the April 2021 Profile Change Event in PSR J1713+0747

The millisecond pulsar PSR J1713+0747 is a high-priority target for pulsar timing array experiments due to its long-term timing stability, and bright, narrow pulse profile. In April 2021, PSR~J1713$+$0747 underwent a significant profile change event, observed by several telescopes worldwide. Using the broad-bandwidth and polarimetric fidelity of the Ultra-Wideband Low-frequency receiver on Murriyang, CSIRO's Parkes radio telescope, we investigated the long-term spectro-polarimetric behaviour of this profile change in detail. We highlight the broad-bandwidth nature of the event, which exhibits frequency dependence that is inconsistent with cold-plasma propagation effects. We also find that spectral and temporal variations are stronger in one of the orthogonal polarisation modes than the other, and observe mild variations ($\sim 3$ - $5\,σ$ significance) in circular polarisation above 1400 MHz following the event. However, the linear polarisation position angle remained remarkably stable in the profile leading edge throughout the event. With over three years of data post-event, we find that the profile has not yet recovered back to its original state, indicating a long-term asymptotic recovery, or a potential reconfiguration of the pulsar's magnetic field. These findings favour a magnetospheric origin of the profile change event over a line-of-sight propagation effect in the interstellar medium.

preprint2022arXiv

IndicXNLI: Evaluating Multilingual Inference for Indian Languages

While Indic NLP has made rapid advances recently in terms of the availability of corpora and pre-trained models, benchmark datasets on standard NLU tasks are limited. To this end, we introduce IndicXNLI, an NLI dataset for 11 Indic languages. It has been created by high-quality machine translation of the original English XNLI dataset and our analysis attests to the quality of IndicXNLI. By finetuning different pre-trained LMs on this IndicXNLI, we analyze various cross-lingual transfer techniques with respect to the impact of the choice of language models, languages, multi-linguality, mix-language input, etc. These experiments provide us with useful insights into the behaviour of pre-trained models for a diverse set of languages.

preprint2022arXiv

Is My Model Using The Right Evidence? Systematic Probes for Examining Evidence-Based Tabular Reasoning

Neural models command state-of-the-art performance across NLP tasks, including ones involving "reasoning". Models claiming to reason about the evidence presented to them should attend to the correct parts of the input avoiding spurious patterns therein, be self-consistent in their predictions across inputs, and be immune to biases derived from their pre-training in a nuanced, context-sensitive fashion. {\em Do the prevalent *BERT-family of models do so?} In this paper, we study this question using the problem of reasoning on tabular data. Tabular inputs are especially well-suited for the study -- they admit systematic probes targeting the properties listed above. Our experiments demonstrate that a RoBERTa-based model, representative of the current state-of-the-art, fails at reasoning on the following counts: it (a) ignores relevant parts of the evidence, (b) is over-sensitive to annotation artifacts, and (c) relies on the knowledge encoded in the pre-trained language model rather than the evidence presented in its tabular inputs. Finally, through inoculation experiments, we show that fine-tuning the model on perturbed data does not help it overcome the above challenges.

preprint2022arXiv

The ultra narrow FRB20191107B, and the origins of FRB scattering

We report the detection of FRB20191107B with the UTMOST radio telescope at a dispersion measure (DM) of 714.9 ${\rm pc~cm^{-3}}$. The burst consists of three components, the brightest of which has an intrinsic width of only 11.3 $μ$s and a scattering tail with an exponentially decaying time-scale of 21.4 $μ$s measured at 835 MHz. We model the sensitivity of UTMOST and other major FRB surveys to such narrow events. We find that $>60\%$ of FRBs like FRB20191107B are being missed, and that a significant population of very narrow FRBs probably exists and remains underrepresented in these surveys. The high DM and small scattering timescale of FRB20191107B allows us to place an upper limit on the strength of turbulence in the Intergalactic Medium (IGM), quantified as scattering measure (SM), of ${\rm SM_{IGM} < 8.4 \times 10^{-7} ~kpc~m^{-20/3}}$. Almost all UTMOST FRBs have full phase information due to real-time voltage capture which provides us with the largest sample of coherently dedispersed single burst FRBs. Our 10.24 $μ$s time resolution data yields accurately measured FRB scattering timescales. We combine the UTMOST FRBs with 10 FRBs from the literature and find no obvious evidence for a DM-scattering relation, suggesting that IGM is not the dominant source of scattering in FRBs. We support the results of previous studies and identify the local environment of the source in the host galaxy as the most likely region which dominates the observed scattering of our FRBs.

preprint2021arXiv

Fast radio bursts as probes of feedback from active galactic nuclei

Fast Radio Bursts (FRBs) are a promising tool for studying the low-density universe as their dispersion measures (DM) are extremely sensitive probes of electron column density. Active Galactic Nuclei (AGN) inject energy into the intergalactic medium, affecting the DM and its scatter. To determine the effectiveness of FRBs as a probe of AGN feedback, we analysed three different AGN models from the EAGLE simulation series. We measured the mean DM-redshift relation, and the scatter around it, using $2.56 \times 10^8$ sightlines at 131 redshift ($z$) bins between $0 \leq z \leq 3$. While the DM-redshift relation itself is highly robust against different AGN feedback models, significant differences are detected in the scatter around the mean: weaker feedback leads to more scatter. We find $\sim 10^4$ localised FRBs are needed to discriminate between the scatter in standard feedback and stronger, more intermittent feedback models. The number of FRBs required is dependent on the redshift distribution of the detected population. A log-normal redshift distribution at $z=0.5$ requires approximately 50% fewer localised FRBs than a distribution centred at $z=1$. With the Square Kilometre Array expected to detect $>10^3$ FRBs per day, in the future, FRBs will be able to provide constraints on AGN feedback.

preprint2020arXiv

DeepSumm -- Deep Code Summaries using Neural Transformer Architecture

Source code summarizing is a task of writing short, natural language descriptions of source code behavior during run time. Such summaries are extremely useful for software development and maintenance but are expensive to manually author,hence it is done for small fraction of the code that is produced and is often ignored. Automatic code documentation can possibly solve this at a low cost. This is thus an emerging research field with further applications to program comprehension, and software maintenance. Traditional methods often relied on cognitive models that were built in the form of templates and by heuristics and had varying degree of adoption by the developer community. But with recent advancements, end to end data-driven approaches based on neural techniques have largely overtaken the traditional techniques. Much of the current landscape employs neural translation based architectures with recurrence and attention which is resource and time intensive training procedure. In this paper, we employ neural techniques to solve the task of source code summarizing and specifically compare NMT based techniques to more simplified and appealing Transformer architecture on a dataset of Java methods and comments. We bring forth an argument to dispense the need of recurrence in the training procedure. To the best of our knowledge, transformer based models have not been used for the task before. With supervised samples of more than 2.1m comments and code, we reduce the training time by more than 50% and achieve the BLEU score of 17.99 for the test set of examples.

preprint2020arXiv

INFOTABS: Inference on Tables as Semi-structured Data

In this paper, we observe that semi-structured tabulated text is ubiquitous; understanding them requires not only comprehending the meaning of text fragments, but also implicit relationships between them. We argue that such data can prove as a testing ground for understanding how we reason about information. To study this, we introduce a new dataset called INFOTABS, comprising of human-written textual hypotheses based on premises that are tables extracted from Wikipedia info-boxes. Our analysis shows that the semi-structured, multi-domain and heterogeneous nature of the premises admits complex, multi-faceted reasoning. Experiments reveal that, while human annotators agree on the relationships between a table-hypothesis pair, several standard modeling strategies are unsuccessful at the task, suggesting that reasoning about tables can pose a difficult modeling challenge.

preprint2020arXiv

On Dimensional Linguistic Properties of the Word Embedding Space

Word embeddings have become a staple of several natural language processing tasks, yet much remains to be understood about their properties. In this work, we analyze word embeddings in terms of their principal components and arrive at a number of novel and counterintuitive observations. In particular, we characterize the utility of variance explained by the principal components as a proxy for downstream performance. Furthermore, through syntactic probing of the principal embedding space, we show that the syntactic information captured by a principal component does not correlate with the amount of variance it explains. Consequently, we investigate the limitations of variance based embedding post-processing and demonstrate that such post-processing is counter-productive in sentence classification and machine translation tasks. Finally, we offer a few precautionary guidelines on applying variance based embedding post-processing and explain why non-isotropic geometry might be integral to word embedding performance.

preprint2020arXiv

P-SIF: Document Embeddings Using Partition Averaging

Simple weighted averaging of word vectors often yields effective representations for sentences which outperform sophisticated seq2seq neural models in many tasks. While it is desirable to use the same method to represent documents as well, unfortunately, the effectiveness is lost when representing long documents involving multiple sentences. One of the key reasons is that a longer document is likely to contain words from many different topics; hence, creating a single vector while ignoring all the topical structure is unlikely to yield an effective document representation. This problem is less acute in single sentences and other short text fragments where the presence of a single topic is most likely. To alleviate this problem, we present P-SIF, a partitioned word averaging model to represent long documents. P-SIF retains the simplicity of simple weighted word averaging while taking a document&#39;s topical structure into account. In particular, P-SIF learns topic-specific vectors from a document and finally concatenates them all to represent the overall document. We provide theoretical justifications on the correctness of P-SIF. Through a comprehensive set of experiments, we demonstrate P-SIF&#39;s effectiveness compared to simple weighted averaging and many other baselines.

preprint2020arXiv

The UTMOST pulsar timing programme II: Timing noise across the pulsar population

While pulsars possess exceptional rotational stability, large scale timing studies have revealed at least two distinct types of irregularities in their rotation: red timing noise and glitches. Using modern Bayesian techniques, we investigated the timing noise properties of 300 bright southern-sky radio pulsars that have been observed over 1.0-4.8 years by the upgraded Molonglo Observatory Synthesis Telescope (MOST). We reanalysed the spin and spin-down changes associated with nine previously reported pulsar glitches, report the discovery of three new glitches and four unusual glitch-like events in the rotational evolution of PSR J1825$-$0935. We develop a refined Bayesian framework for determining how red noise strength scales with pulsar spin frequency ($ν$) and spin-down frequency ($\dotν$), which we apply to a sample of 280 non-recycled pulsars. With this new method and a simple power-law scaling relation, we show that red noise strength scales across the non-recycled pulsar population as $ν^{a} |\dotν|^{b}$, where $a = -0.84^{+0.47}_{-0.49}$ and $b = 0.97^{+0.16}_{-0.19}$. This method can be easily adapted to utilise more complex, astrophysically motivated red noise models. Lastly, we highlight our timing of the double neutron star PSR J0737$-$3039, and the rediscovery of a bright radio pulsar originally found during the first Molonglo pulsar surveys with an incorrectly catalogued position.