Researcher profile

Rishabh Jain

Rishabh Jain contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

From Hype to Insight: Rethinking Large Language Model Integration in Visual Speech Recognition

Advances in self-supervised encoders have improved Visual Speech Recognition (VSR). Recent approaches integrating these encoders with LLM decoders improves transcription accuracy; however, it remains unclear whether these gains stem from visual understanding or stronger language modeling. In this work, we systematically evaluate LLM decoders by freezing or selectively updating the visual encoder, scaling decoder size, comparing adaptation strategies and architectures, and varying training data across LRS2, LRS3, and their combination. Evaluation on LRS2, LRS3, and WildVSR shows that scaling and adaptation yield limited improvements, while combining datasets enhances generalization. Semantic analysis reveals that gains arise primarily from lexical rather than semantic processing. Our Llama-2-13B model trained on the combined set achieves 24.7% WER on LRS3 and 47.0% on WildVSR, establishing SOTA among models trained without additional supervision. Our findings indicate LLM decoders refine contextual reasoning rather than visual features, emphasizing the need for stronger visual encoders to drive meaningful progress.

preprint2025arXiv

A Super-Learner with Large Language Models for Medical Emergency Advising

Medical decision-support and advising systems are critical for emergency physicians to quickly and accurately assess patients' conditions and make diagnosis. Artificial Intelligence (AI) has emerged as a transformative force in healthcare in recent years and Large Language Models (LLMs) have been employed in various fields of medical decision-support systems. We studied responses of a group of different LLMs to real cases in emergency medicine. The results of our study on five most renown LLMs showed significant differences in capabilities of Large Language Models for diagnostics acute diseases in medical emergencies with accuracy ranging between 58% and 65%. This accuracy significantly exceeds the reported accuracy of human doctors. We built a super-learner MEDAS (Medical Emergency Diagnostic Advising System) of five major LLMs - Gemini, Llama, Grok, GPT, and Claude). The super-learner produces higher diagnostic accuracy, 70%, even with a quite basic meta-learner. However, at least one of the integrated LLMs in the same super-learner produces 85% correct diagnoses. The super-learner integrates a cluster of LLMs using a meta-learner capable of learning different capabilities of each LLM to leverage diagnostic accuracy of the model by collective capabilities of all LLMs in the cluster. The results of our study showed that aggregated diagnostic accuracy provided by a meta-learning approach exceeds that of any individual LLM, suggesting that the super-learner can take advantage of the combined knowledge of the medical datasets used to train the group of LLMs.

preprint2022arXiv

A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95 for speech intelligibility, 3.89 for voice naturalness, and 3.96 for voice consistency. Objective evaluation using a pretrained MOSNet showed a strong correlation between real and synthetic child voices. Speaker similarity was also verified by calculating the cosine similarity between the embeddings of utterances. An automatic speech recognition (ASR) model is also used to provide a word error rate (WER) comparison between the real and synthetic child voices. The final trained TTS model was able to synthesize child-like speech from reference audio samples as short as 5 seconds.

preprint2022arXiv

Detecting Heavy Higgs Bosons from Natural SUSY at a 100 TeV Hadron Collider

Supersymmetric models with radiatively-driven naturalness (RNS) enjoy low electroweak fine-tuning whilst respecting LHC search limits on gluinos and top squarks and allowing for $m_h\simeq 125$ GeV. While the heavier Higgs bosons $H,\ A$ may have TeV-scale masses, the SUSY conserving $μ$ parameter must lie in the few hundred GeV range. Thus, in natural SUSY models there should occur large heavy Higgs boson branching fractions to electroweakinos, with Higgs boson decays to higgsino plus gaugino dominating when they are kinematically accessible. These SUSY decays can open up new avenues for discovery. We investigate the prospects of discovering heavy neutral Higgs bosons $H$ and $A$ decaying into light plus heavy chargino pairs which can yield a four isolated lepton plus missing transverse energy signature at the LHC and at a future 100 TeV $pp$ collider. We find that discovery of heavy Higgs decay to electroweakinos via its $4\ell$ decay mode is very difficult at HL-LHC. For FCC-hh or SPPC, we study the $H,\ A \to $ SUSY reaction along with dominant physics backgrounds from the Standard Model and devise suitable selection requirements to extract a clean signal for FCC-hh or SPPC with $\sqrt{s}=100$ TeV, assuming an integrated luminosity of 15 $ab^{-1}$. We find that while a conventional cut-and-count analysis yields a signal statistical significance greater than $5σ$ for $m_{A,H}\sim 1.1-1.65$ TeV, a boosted-decision-tree analysis allows for heavy Higgs signal discovery at FCC-hh or SPPC for $m_{A,H}\sim 1-2$ TeV.

preprint2022arXiv

On Hybrid Quantum and Classical Computing Algorithms for Mixed-Integer Programming

Quantum computing is emerging as a new computing resource that could be superior to conventional computing for certain classes of optimization problems. However, in principle, most existing approaches to quantum optimization are intended to solve unconstrained binary programming problems, while mixed-integer linear programming is of most interest in practice. We attempt to bridge the gap between the capability of quantum computing and real-world applications by developing a new approach for mixed-integer programming. The approach applies Benders decomposition to decompose the mixed-integer programming into binary programming and linear programming sub-problems, which are solved by a noisy intermediate-scale quantum processor and conventional processor, respectively. The algorithm is provably able to reach the optimal solution of the original mixed-integer programming problem. The algorithm is tested on a D-Wave 2000Q quantum processing unit and is shown to be effective for small-scaled test cases. We also test the algorithm on a mixed-integer programming inspired by power system applications. Many insights are drawn from the numerical results for both the capabilities and limitations of the proposed algorithm.

preprint2022arXiv

Searching for Charged Higgs Bosons via $e^+ e^- \to H^+ H^- \to c\bar{b} \bar{c}b $ at Linear Colliders

We study a search for the charged Higgs boson via $e^+e^- \to H^+H^- \to c\bar{b}\bar{c}b$ at the 500 GeV ILC. In a general two Higgs doublet model without $Z_2$ symmetry, extra Yukawa couplings $ρ_{tt}$ and $ρ_{tc}$ can drive baryogenesis, but searches at the HL-LHC may still go empty-handed if the couplings are relatively weak. Taking $m_{H^+ } \simeq m_H \simeq m_A \simeq 200$ GeV, with $ρ_{tt}$, $ρ_{tc}\sim 0.1$ and no $h(125)$-$H$ mixing, $H^+ \to c\bar b$ decay is dominant, and the $c\bar{b}\bar{c}b$ final state is likely overwhelmed by QCD background at the LHC. We show that the electroweak production of $H^+ H^-$ at the ILC is discoverable with integrated luminosity of 1 ab$^{-1}$. Furthermore, we show that $m_{H^+}$ can be extracted by requiring the two pairs of $b$ and light jets be roughly equal in mass, without assuming the mass value. Thus, ILC can probe low mass Higgs bosons in multijet final states to complement HL-LHC in the future

preprint2020arXiv

A Search for Technosignatures Around 31 Sun-like Stars with the Green Bank Telescope at 1.15-1.73 GHz

We conducted a search for technosignatures in April of 2018 and 2019 with the L-band receiver (1.15-1.73 GHz) of the 100 m diameter Green Bank Telescope. These observations focused on regions surrounding 31 Sun-like stars near the plane of the Galaxy. We present the results of our search for narrowband signals in this data set as well as improvements to our data processing pipeline. Specifically, we applied an improved candidate signal detection procedure that relies on the topographic prominence of the signal power, which nearly doubles the signal detection count of some previously analyzed data sets. We also improved the direction-of-origin filters that remove most radio frequency interference (RFI) to ensure that they uniquely link signals observed in separate scans. We performed a preliminary signal injection and recovery analysis to test the performance of our pipeline. We found that our pipeline recovers 93% of the injected signals over the usable frequency range of the receiver and 98% if we exclude regions with dense RFI. In this analysis, 99.73% of the recovered signals were correctly classified as technosignature candidates. Our improved data processing pipeline classified over 99.84% of the ~26 million signals detected in our data as RFI. Of the remaining candidates, 4539 were detected outside of known RFI frequency regions. The remaining candidates were visually inspected and verified to be of anthropogenic nature. Our search compares favorably to other recent searches in terms of end-to-end sensitivity, frequency drift rate coverage, and signal detection count per unit bandwidth per unit integration time.

preprint2020arXiv

Dialog without Dialog Data: Learning Visual Dialog Agents from VQA Data

Can we develop visually grounded dialog agents that can efficiently adapt to new tasks without forgetting how to talk to people? Such agents could leverage a larger variety of existing data to generalize to new tasks, minimizing expensive data collection and annotation. In this work, we study a setting we call "Dialog without Dialog", which requires agents to develop visually grounded dialog models that can adapt to new tasks without language level supervision. By factorizing intention and language, our model minimizes linguistic drift after fine-tuning for new tasks. We present qualitative results, automated metrics, and human studies that all show our model can adapt to new tasks and maintain language quality. Baselines either fail to perform well at new tasks or experience language drift, becoming unintelligible to humans. Code has been made available at https://github.com/mcogswell/dialog_without_dialog

preprint2019arXiv

nocaps: novel object captioning at scale

Image captioning models have achieved impressive results on datasets containing limited visual concepts and large amounts of paired image-caption training data. However, if these models are to ever function in the wild, a much larger variety of visual concepts must be learned, ideally from less supervision. To encourage the development of image captioning models that can learn visual concepts from alternative data sources, such as object detection datasets, we present the first large-scale benchmark for this task. Dubbed 'nocaps', for novel object captioning at scale, our benchmark consists of 166,100 human-generated captions describing 15,100 images from the OpenImages validation and test sets. The associated training data consists of COCO image-caption pairs, plus OpenImages image-level labels and object bounding boxes. Since OpenImages contains many more classes than COCO, nearly 400 object classes seen in test images have no or very few associated training captions (hence, nocaps). We extend existing novel object captioning models to establish strong baselines for this benchmark and provide analysis to guide future work on this task.