Researcher profile

Vikas Yadav

Vikas Yadav contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - UnverifiedVerification L1Unclaimed author
4works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2026arXiv

R2V Agent: Teaching SLMs When to Ask for Help

Efficient agentic systems should incur expensive frontier-model costs only on decisions where a cheaper local model is likely to fail. Existing LLM cascades usually route whole queries before execution, but task difficulty shifts mid-trajectory - after flaky tool calls, truncated observations, or compounding local errors - making pre-execution routing brittle. We introduce \textbf{R2V-Agent}, a risk-calibrated SLM-LLM routing framework for interactive agents. R2V combines four components: a distilled small language model (SLM) policy, a stronger teacher LLM, a lightweight process verifier that scores candidate actions at each step, and a calibrated step-level router. The router is our central contribution: after the SLM is trained, it estimates residual failure risk at each step and escalates only when teacher intervention is warranted. To make the routing problem well-defined, we first train a stable local SLM using a standard offline pipeline: behavioral cloning (BC) on teacher trajectories, followed by verifier-guided Direct Preference Optimization (DPO) with consistency regularization. The router is then trained on this fixed policy's residual failures using Brier-calibrated probability estimation and a Conditional Value-at-Risk (CVaR)-constrained objective that penalizes worst-case failures across perturbation seeds. Across HumanEval+, TextWorld, and TerminalBench with four SLM backbones, R2V improves the reliability-cost frontier: it achieves $94.3\%$ HumanEval+ success with $0.60\%$ LLM escalation, recovers TextWorld from $64.6\%$ SLM-only success to $98.2\%$ at $41.7\%$ escalation, and reaches $93.3\%$ TerminalBench success at $33.9\%$ LLM calls, roughly half the heuristic-router cost.

preprint2022arXiv

${\mathscr {M}}$cTEQ (${\mathscr {M}}$ ${\bf c}$hiral perturbation theory-compatible deconfinement ${\bf T}$emperature and ${\bf E}$ntanglement Entropy up to terms ${\bf Q}$uartic in curvature) and FM (${\bf F}$lavor ${\bf M}$emory)

A holographic computation of $T_c$ at ${\it intermediate\ coupling}$ from M-theory dual of thermal QCD-like theories, has been missing in the literature. Filling this gap, we demonstrate a novel UV-IR mixing, (conjecture and provide evidence for) a non-renormalization beyond 1 loop of ${\bf M}-{\bf c}$hiral perturbation theory arXiv:2011.04660[hep-th]-compatible deconfinement ${\bf T}$emperature, and show equivalence with an ${\bf E}$ntanglement (as well as Wald) entropy arXiv:0709.2140[hep-th] computation, up to terms ${\bf Q}$uartic in curvature. We demonstrate a ${\bf F}$lavor-${\bf M}$emory (FM) effect in the M-theory uplifts of the gravity duals, wherein the no-braner M-theory uplift retains the "memory" of the flavor D7-branes of the parent type IIB dual in the sense that a specific combination of the aforementioned quartic corrections to the metric components precisely along the compact part of the non-compact four-cycle "wrapped" by the flavor D7-branes, is what determines, e.g., the Einstein-Hilbert action at O$(R^4)$. The same linear combination of O$(R^4)$ metric corrections, upon matching the phenomenological value of the coupling constant of one of the SU(3) NLO ChPT Lagrangian, is required to have a definite sign. Interestingly, in the decompactification limit of the spatial circle, we ${\it derive}$ this, and obtain the values of the relevant O$(R^4)$ metric corrections. Further, equivalence with Wald entropy for the black hole at ${\cal O}(R^4)$ imposes a linear constraint on the same linear combination of metric corrections. Remarkably, when evaluating $T_c$ from an entanglement entropy computation in the thermal gravity dual, due to a delicate cancelation between the ${\cal O}(R^4)$ corrections from a subset of the abovementioned metric components, one sees that there are no corrections to $T_c$ at quartic order supporting the conjecture referred to above.

preprint2020arXiv

Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering

We propose an unsupervised strategy for the selection of justification sentences for multi-hop question answering (QA) that (a) maximizes the relevance of the selected sentences, (b) minimizes the overlap between the selected facts, and (c) maximizes the coverage of both question and answer. This unsupervised sentence selection method can be coupled with any supervised QA approach. We show that the sentences selected by our method improve the performance of a state-of-the-art supervised QA model on two multi-hop QA datasets: AI2's Reasoning Challenge (ARC) and Multi-Sentence Reading Comprehension (MultiRC). We obtain new state-of-the-art performance on both datasets among approaches that do not use external resources for training the QA system: 56.82% F1 on ARC (41.24% on Challenge and 64.49% on Easy) and 26.1% EM0 on MultiRC. Our justification sentences have higher quality than the justifications selected by a strong information retrieval baseline, e.g., by 5.4% F1 in MultiRC. We also show that our unsupervised selection of justification sentences is more stable across domains than a state-of-the-art supervised sentence selection method.

preprint2020arXiv

Unsupervised Alignment-based Iterative Evidence Retrieval for Multi-hop Question Answering

Evidence retrieval is a critical stage of question answering (QA), necessary not only to improve performance, but also to explain the decisions of the corresponding QA method. We introduce a simple, fast, and unsupervised iterative evidence retrieval method, which relies on three ideas: (a) an unsupervised alignment approach to soft-align questions and answers with justification sentences using only GloVe embeddings, (b) an iterative process that reformulates queries focusing on terms that are not covered by existing justifications, which (c) a stopping criterion that terminates retrieval when the terms in the given question and candidate answers are covered by the retrieved justifications. Despite its simplicity, our approach outperforms all the previous methods (including supervised methods) on the evidence selection task on two datasets: MultiRC and QASC. When these evidence sentences are fed into a RoBERTa answer classification component, we achieve state-of-the-art QA performance on these two datasets.