Researcher profile

Vipin Chaudhary

Vipin Chaudhary contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2026arXiv

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)

Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in the existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) compact, hierarchical lookup tables (LUTs) that fit within GPU SRAM for efficient decoding, (ii) a two-phase GPU kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on Llama 3.3, Qwen 3, Mistral 3, FLUX.1, and others validate our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit identical outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 2.3--46.2x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.7--14.9x longer generation lengths than uncompressed models. Notably, our method enables lossless inference of Llama 3.1 405B, an 810GB model, on a single node equipped with 8x80GB GPUs.

preprint2026arXiv

AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models

Reasoning-capable large language models (LLMs) achieve strong performance on complex tasks but often exhibit overthinking after distillation, generating unnecessarily long chain-of-thought (CoT) reasoning even for simple inputs and incurring high inference cost. However, naively shortening reasoning length can degrade reasoning accuracy, as concise reasoning may be insufficient for certain inputs and lacks explicit supervision. We propose Auto Long-Short Reasoning (AutoL2S), a distillation framework that empowers non-reasoning LLMs to think thoroughly but only when necessary. AutoL2S first learns a lightweight switching token with verified long-short CoTs to enable instance-wise long-short reasoning selection. Then it leverages long-short reasoning rollouts induced by a switching token in a GRPO-style loss to improve reasoning efficiency while maintaining accuracy. Experiments demonstrate that AutoL2S effectively reduces reasoning length up to 71% with minimal accuracy loss, yielding markedly better trade-off in token length and inference time while preserving accuracy.

preprint2026arXiv

Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers

Hybrid reasoning language models are commonly controlled through high-level Think/No-think instructions to regulate reasoning behavior, yet we found that such mode switching is largely driven by a small set of trigger tokens rather than the instructions themselves. Through attention analysis and controlled prompting experiments, we show that a leading ``Okay&#39;&#39; token induces reasoning behavior, while the newline pattern following ``</think>&#39;&#39; suppresses it. Based on this observation, we propose Mid-Think, a simple training-free prompting format that combines these triggers to achieve intermediate-budget reasoning, consistently outperforming fixed-token and prompt-based baselines in terms of the accuracy-length trade-off. Furthermore, applying Mid-Think to RL training after SFT reduces training time by approximately 15% while improving final performance of Qwen3-8B on AIME from 69.8% to 72.4% and on GPQA from 58.5% to 61.1%, demonstrating its effectiveness for both inference-time control and RL-based reasoning training.

preprint2026arXiv

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.

preprint2026arXiv

Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation

Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Existing work reduces this issue through better data curation and multi-stage training, yet leakage remains because both modes are still encoded in the same feed-forward parameters. We propose Path-Lock Expert (PLE), an architecture-level solution that replaces the single MLP in each decoder layer with two semantically locked experts, one for think and one for no-think, while keeping attention, embeddings, normalization, and the language-model head shared. A deterministic control-token router selects exactly one expert path for the entire sequence, so inference preserves the dense model's per-token computation pattern and each expert receives mode-pure updates during supervised fine-tuning. Across math and science reasoning benchmarks, PLE maintains strong think performance while producing a substantially stronger no-think mode that is more accurate, more concise, and far less prone to reasoning leakage. On Qwen3-4B, for example, PLE reduces no-think reflective tokens on AIME24 from 2.54 to 0.39 and improves no-think accuracy from 20.67% to 40.00%, all while preserving think-mode performance. These results suggest that controllable hybrid thinking is fundamentally an architectural problem, and separating mode-specific feed-forward pathways is a simple and effective solution.

preprint2026arXiv

Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation

Standard PII filters often miss contextual data leakage in RAG systems, such as non-regulated attribute clusters that collectively identify individuals. We introduce a Privacy Policy Enforcement (PPE) framework using dual one-class density estimators with fused text embeddings and a calibrated abstain region for out-of-distribution inputs. Using an axis-stratified, multi-LLM synthetic data pipeline across medicine, finance, and law, we found that traditional Gaussian Mixture baselines fail on borderline-safe stress tests by focusing on linguistic register rather than content. Our proposed T3+OCSVM detector, trained on safe and borderline-safe data, achieves a borderline AUROC of 0.93+ while reducing false positives by 44-55 percentage points and maintaining millisecond latency. Compared to supervised MLP classifiers or 14B-parameter LLM judges, our framework offers superior operational suitability, as the former suffers from high abstention rates and the latter from latency and calibration issues. This methodology provides a robust stress-testing standard for any synthetic-data-trained classifier.

preprint2026arXiv

Reliability-Gated Source Anchoring for Continual Test-Time Adaptation

Continual test-time adaptation (CTTA) updates a pretrained model online on an unlabeled, non-stationary stream while anchoring it to a frozen source checkpoint. This anchor is useful only when the source remains reliable. On CCC-Hard, however, a ResNet-50 source falls to approximately $1.3\%$ top-$1$ accuracy, while existing source-anchored CTTA methods continue applying the same anchor strength. We call this failure mode blind anchoring and propose RMemSafe, a reliability-gated extension of ROID that uses the frozen source's normalized predictive entropy to attenuate all explicit source-coupled uses in the objective. When the source posterior approaches uniformity, the gate closes: the source anchor and agreement filter vanish, and the objective reduces to a source-agnostic fallback comprising ROID's base losses plus marginal calibration. Combined with ASR, RMemSafe achieves the lowest error on $8$ of $9$ matched-split continual-corruption cells and is the best reset-based method on all $9$, improving ROID+ASR by $1.05$~pp on ResNet-50 and $0.48$~pp on ViT-B/16. A controlled source-degradation sweep shows a $1.13{\times}$ shallower harm slope than ROID+ASR, consistent with the graceful-decay prediction. The entropy gate detects high-entropy source collapse, not confidently wrong low-entropy sources; this scope is explicitly evaluated and discussed.

preprint2023arXiv

Report on 2023 CyberTraining PI Meeting, 26-27 September 2023

This document describes a two-day meeting held for the Principal Investigators (PIs) of NSF CyberTraining grants. The report covers invited talks, panels, and six breakout sessions. The meeting involved over 80 PIs and NSF program managers (PMs). The lessons recorded in detail in the report are a wealth of information that could help current and future PIs, as well as NSF PMs, understand the future directions suggested by the PI community. The meeting was held simultaneously with that of the PIs of the NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) program. This co-location led to two joint sessions: one with NSF speakers and the other on broader impact. Further, the joint poster and refreshment sessions benefited from the interactions between CSSI and CyberTraining PIs.

preprint2022arXiv

Give me a knee radiograph, I will tell you where the knee joint area is: a deep convolutional neural network adventure

Knee pain is undoubtedly the most common musculoskeletal symptom that impairs quality of life, confines mobility and functionality across all ages. Knee pain is clinically evaluated by routine radiographs, where the widespread adoption of radiographic images and their availability at low cost, make them the principle component in the assessment of knee pain and knee pathologies, such as arthritis, trauma, and sport injuries. However, interpretation of the knee radiographs is still highly subjective, and overlapping structures within the radiographs and the large volume of images needing to be analyzed on a daily basis, make interpretation challenging for both naive and experienced practitioners. There is thus a need to implement an artificial intelligence strategy to objectively and automatically interpret knee radiographs, facilitating triage of abnormal radiographs in a timely fashion. The current work proposes an accurate and effective pipeline for autonomous detection, localization, and classification of knee joint area in plain radiographs combining the You Only Look Once (YOLO v3) deep convolutional neural network with a large and fully-annotated knee radiographs dataset. The present work is expected to stimulate more interest from the deep learning computer vision community to this pragmatic and clinical application.

preprint2022arXiv

Irrelevant Pixels are Everywhere: Find and Exclude Them for More Efficient Computer Vision

Computer vision is often performed using Convolutional Neural Networks (CNNs). CNNs are compute-intensive and challenging to deploy on power-contrained systems such as mobile and Internet-of-Things (IoT) devices. CNNs are compute-intensive because they indiscriminately compute many features on all pixels of the input image. We observe that, given a computer vision task, images often contain pixels that are irrelevant to the task. For example, if the task is looking for cars, pixels in the sky are not very useful. Therefore, we propose that a CNN be modified to only operate on relevant pixels to save computation and energy. We propose a method to study three popular computer vision datasets, finding that 48% of pixels are irrelevant. We also propose the focused convolution to modify a CNN&#39;s convolutional layers to reject the pixels that are marked irrelevant. On an embedded device, we observe no loss in accuracy, while inference latency, energy consumption, and multiply-add count are all reduced by about 45%.