Source author record

Vipin Chaudhary

Vipin Chaudhary appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Artificial Intelligence Computation and Language Computer Vision Networking and Internet Architecture Cryptography and Security cs.CY Distributed, Parallel, and Cluster Computing eess.IV Robotics

Catalog footprint

What is connected

14works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)

Large-scale AI models, such as Large Language Models (LLMs) and Diffusion Models (DMs), have grown rapidly in size, creating significant challenges for efficient deployment on resource-constrained hardware. In this paper, we introduce Dynamic-Length Float (DFloat11), a lossless compression framework that reduces LLM and DM size by 30% while preserving outputs that are bit-for-bit identical to the original model. DFloat11 is motivated by the low entropy in the BFloat16 weight representation of LLMs, which reveals significant inefficiency in the existing storage format. By applying entropy coding, DFloat11 assigns dynamic-length encodings to weights based on frequency, achieving near information-optimal compression without any loss of precision. To facilitate efficient inference with dynamic-length encodings, we develop a custom GPU kernel for fast online decompression. Our design incorporates the following: (i) compact, hierarchical lookup tables (LUTs) that fit within GPU SRAM for efficient decoding, (ii) a two-phase GPU kernel for coordinating thread read/write positions using lightweight auxiliary variables, and (iii) transformer-block-level decompression to minimize latency. Experiments on Llama 3.3, Qwen 3, Mistral 3, FLUX.1, and others validate our hypothesis that DFloat11 achieves around 30% model size reduction while preserving bit-for-bit identical outputs. Compared to a potential alternative of offloading parts of an uncompressed model to the CPU to meet memory constraints, DFloat11 achieves 2.3--46.2x higher throughput in token generation. With a fixed GPU memory budget, DFloat11 enables 5.7--14.9x longer generation lengths than uncompressed models. Notably, our method enables lossless inference of Llama 3.1 405B, an 810GB model, on a single node equipped with 8x80GB GPUs.

preprint2026arXiv

AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models

Reasoning-capable large language models (LLMs) achieve strong performance on complex tasks but often exhibit overthinking after distillation, generating unnecessarily long chain-of-thought (CoT) reasoning even for simple inputs and incurring high inference cost. However, naively shortening reasoning length can degrade reasoning accuracy, as concise reasoning may be insufficient for certain inputs and lacks explicit supervision. We propose Auto Long-Short Reasoning (AutoL2S), a distillation framework that empowers non-reasoning LLMs to think thoroughly but only when necessary. AutoL2S first learns a lightweight switching token with verified long-short CoTs to enable instance-wise long-short reasoning selection. Then it leverages long-short reasoning rollouts induced by a switching token in a GRPO-style loss to improve reasoning efficiency while maintaining accuracy. Experiments demonstrate that AutoL2S effectively reduces reasoning length up to 71% with minimal accuracy loss, yielding markedly better trade-off in token length and inference time while preserving accuracy.

preprint2026arXiv

Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers

Hybrid reasoning language models are commonly controlled through high-level Think/No-think instructions to regulate reasoning behavior, yet we found that such mode switching is largely driven by a small set of trigger tokens rather than the instructions themselves. Through attention analysis and controlled prompting experiments, we show that a leading ``Okay'' token induces reasoning behavior, while the newline pattern following ``</think>'' suppresses it. Based on this observation, we propose Mid-Think, a simple training-free prompting format that combines these triggers to achieve intermediate-budget reasoning, consistently outperforming fixed-token and prompt-based baselines in terms of the accuracy-length trade-off. Furthermore, applying Mid-Think to RL training after SFT reduces training time by approximately 15% while improving final performance of Qwen3-8B on AIME from 69.8% to 72.4% and on GPQA from 58.5% to 61.1%, demonstrating its effectiveness for both inference-time control and RL-based reasoning training.

preprint2026arXiv

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

Vision-Language-Action (VLA) models achieve remarkable flexibility and generalization beyond classical control paradigms. However, most prevailing VLAs are trained under a single-frame observation paradigm, which leaves them structurally blind to temporal dynamics. Consequently, these models degrade severely in non-stationary scenarios, even when trained or finetuned on dynamic datasets. Existing approaches either require expensive retraining or suffer from latency bottlenecks and poor temporal consistency across action chunks. We propose Pace-and-Path Correction, a training-free, closed-form inference-time operator that wraps any chunked-action VLA. From a single quadratic cost, joint minimization yields a unified solution that decomposes orthogonally into two distinct channels. The pace channel compresses execution along the planned direction, while the path channel applies an orthogonal spatial offset, jointly absorbing the perceived dynamics within the chunk window. We evaluate our approach on a comprehensive diagnostic benchmark MoveBench designed to isolate motion as the sole controlled variable. Empirical results demonstrate that our framework consistently outperforms state-of-the-art training-free wrappers and dynamic-adaptive methods and improves success rates by up to 28.8% and 25.9% in absolute terms over foundational VLA models in dynamic-only and static-dynamic mixed environments, respectively.

preprint2026arXiv

Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation

Hybrid-thinking language models expose explicit think and no-think modes, but current designs do not separate them cleanly. Even in no-think mode, models often emit long and self-reflective responses, causing reasoning leakage. Existing work reduces this issue through better data curation and multi-stage training, yet leakage remains because both modes are still encoded in the same feed-forward parameters. We propose Path-Lock Expert (PLE), an architecture-level solution that replaces the single MLP in each decoder layer with two semantically locked experts, one for think and one for no-think, while keeping attention, embeddings, normalization, and the language-model head shared. A deterministic control-token router selects exactly one expert path for the entire sequence, so inference preserves the dense model's per-token computation pattern and each expert receives mode-pure updates during supervised fine-tuning. Across math and science reasoning benchmarks, PLE maintains strong think performance while producing a substantially stronger no-think mode that is more accurate, more concise, and far less prone to reasoning leakage. On Qwen3-4B, for example, PLE reduces no-think reflective tokens on AIME24 from 2.54 to 0.39 and improves no-think accuracy from 20.67% to 40.00%, all while preserving think-mode performance. These results suggest that controllable hybrid thinking is fundamentally an architectural problem, and separating mode-specific feed-forward pathways is a simple and effective solution.

preprint2026arXiv

Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation

Standard PII filters often miss contextual data leakage in RAG systems, such as non-regulated attribute clusters that collectively identify individuals. We introduce a Privacy Policy Enforcement (PPE) framework using dual one-class density estimators with fused text embeddings and a calibrated abstain region for out-of-distribution inputs. Using an axis-stratified, multi-LLM synthetic data pipeline across medicine, finance, and law, we found that traditional Gaussian Mixture baselines fail on borderline-safe stress tests by focusing on linguistic register rather than content. Our proposed T3+OCSVM detector, trained on safe and borderline-safe data, achieves a borderline AUROC of 0.93+ while reducing false positives by 44-55 percentage points and maintaining millisecond latency. Compared to supervised MLP classifiers or 14B-parameter LLM judges, our framework offers superior operational suitability, as the former suffers from high abstention rates and the latter from latency and calibration issues. This methodology provides a robust stress-testing standard for any synthetic-data-trained classifier.

preprint2026arXiv

Reliability-Gated Source Anchoring for Continual Test-Time Adaptation

Continual test-time adaptation (CTTA) updates a pretrained model online on an unlabeled, non-stationary stream while anchoring it to a frozen source checkpoint. This anchor is useful only when the source remains reliable. On CCC-Hard, however, a ResNet-50 source falls to approximately $1.3\%$ top-$1$ accuracy, while existing source-anchored CTTA methods continue applying the same anchor strength. We call this failure mode blind anchoring and propose RMemSafe, a reliability-gated extension of ROID that uses the frozen source's normalized predictive entropy to attenuate all explicit source-coupled uses in the objective. When the source posterior approaches uniformity, the gate closes: the source anchor and agreement filter vanish, and the objective reduces to a source-agnostic fallback comprising ROID's base losses plus marginal calibration. Combined with ASR, RMemSafe achieves the lowest error on $8$ of $9$ matched-split continual-corruption cells and is the best reset-based method on all $9$, improving ROID+ASR by $1.05$~pp on ResNet-50 and $0.48$~pp on ViT-B/16. A controlled source-degradation sweep shows a $1.13{\times}$ shallower harm slope than ROID+ASR, consistent with the graceful-decay prediction. The entropy gate detects high-entropy source collapse, not confidently wrong low-entropy sources; this scope is explicitly evaluated and discussed.

preprint2023arXiv

Report on 2023 CyberTraining PI Meeting, 26-27 September 2023

This document describes a two-day meeting held for the Principal Investigators (PIs) of NSF CyberTraining grants. The report covers invited talks, panels, and six breakout sessions. The meeting involved over 80 PIs and NSF program managers (PMs). The lessons recorded in detail in the report are a wealth of information that could help current and future PIs, as well as NSF PMs, understand the future directions suggested by the PI community. The meeting was held simultaneously with that of the PIs of the NSF Cyberinfrastructure for Sustained Scientific Innovation (CSSI) program. This co-location led to two joint sessions: one with NSF speakers and the other on broader impact. Further, the joint poster and refreshment sessions benefited from the interactions between CSSI and CyberTraining PIs.

preprint2022arXiv

Give me a knee radiograph, I will tell you where the knee joint area is: a deep convolutional neural network adventure

Knee pain is undoubtedly the most common musculoskeletal symptom that impairs quality of life, confines mobility and functionality across all ages. Knee pain is clinically evaluated by routine radiographs, where the widespread adoption of radiographic images and their availability at low cost, make them the principle component in the assessment of knee pain and knee pathologies, such as arthritis, trauma, and sport injuries. However, interpretation of the knee radiographs is still highly subjective, and overlapping structures within the radiographs and the large volume of images needing to be analyzed on a daily basis, make interpretation challenging for both naive and experienced practitioners. There is thus a need to implement an artificial intelligence strategy to objectively and automatically interpret knee radiographs, facilitating triage of abnormal radiographs in a timely fashion. The current work proposes an accurate and effective pipeline for autonomous detection, localization, and classification of knee joint area in plain radiographs combining the You Only Look Once (YOLO v3) deep convolutional neural network with a large and fully-annotated knee radiographs dataset. The present work is expected to stimulate more interest from the deep learning computer vision community to this pragmatic and clinical application.

preprint2022arXiv

Irrelevant Pixels are Everywhere: Find and Exclude Them for More Efficient Computer Vision

Computer vision is often performed using Convolutional Neural Networks (CNNs). CNNs are compute-intensive and challenging to deploy on power-contrained systems such as mobile and Internet-of-Things (IoT) devices. CNNs are compute-intensive because they indiscriminately compute many features on all pixels of the input image. We observe that, given a computer vision task, images often contain pixels that are irrelevant to the task. For example, if the task is looking for cars, pixels in the sky are not very useful. Therefore, we propose that a CNN be modified to only operate on relevant pixels to save computation and energy. We propose a method to study three popular computer vision datasets, finding that 48% of pixels are irrelevant. We also propose the focused convolution to modify a CNN's convolutional layers to reject the pixels that are marked irrelevant. On an embedded device, we observe no loss in accuracy, while inference latency, energy consumption, and multiply-add count are all reduced by about 45%.

preprint2016arXiv

Quantified Spectrum Sharing: Motivation, Approach, and Benefits

A significant portion of the radio frequency spectrum remains underutilized with exclusive and static allocation of spectrum. The growing demand for spectrum has spurred a need for dynamic spectrum sharing paradigm. While the new dynamic spectrum sharing paradigm helps to improve utilization of the precious spectrum resource, there exist several obstacles on the technical, regulatory, and business fronts for the adoption of the new paradigm. In this paper, we investigate the limitations of the existing techniques and argue for quantified approach to dynamic spectrum sharing and management. We introduce a quantified approach to spectrum sharing based on defining and enforcing quantified spectrum-access rights. By discretizing the spectrum-space in the time, space, frequency dimensions, this approach enables quantifying the spectrum consumed by individual transceivers. It enables defining and enforcing a quantified spectrum-access policy in real-time. The proposed quantified approach brings in simplicity, precision and efficiency in terms of spectrum commerce and operations while addressing the key technical and regulatory challenges.

preprint2015arXiv

MUSE: A Methodology for Characterizing and Quantifying the Use of Spectrum

Dynamic spectrum sharing paradigm is envisaged to meet the growing demand for the Radio Frequency (RF) spectrum. There exist several technical, regulatory, and business impediments for adopting the new paradigm. In this regard, we underscore the need of characterizing and quantifying the use of spectrum by each of the individual transmitters and receivers. We propose MUSE, a methodology to characterize and quantify the use of spectrum in the space, time, and frequency dimensions. MUSE characterizes the use of spectrum by a transmitter at a point in terms of the RF power occupied by the transmitter. It characterizes the use of spectrum by a receiver at a point in terms of the constraints on the RF-power that can be occupied by any of the transmitters in the system in order to ensure successful reception. It divides the spectrum-space into discrete unit-spectrum-spaces and quantifies the spectrum used by the individual transceivers in the discretized spectrum space. We characterize the performance of the spectrum management functions in the discretized spectrum-space and illustrate maximizing the use of spectrum. In order to address the challenges for the dynamic spectrum sharing paradigm, we emphasize on articulating, defining, and enforcing the spectrum-access rights in the discretized spectrum-space.

preprint2014arXiv

Going Towards Discretized Spectrum Space: Quantification of Spectrum Consumption Spaces and a Quantified Spectrum Access Paradigm

Dynamic spectrum sharing approach is a paradigm shift from the conventional static and exclusive approach to spectrum allocation. The existing methodologies to define use of the spectrum and quantify its efficiency are based on the static spectrum assignment paradigm and not suitable for the dynamic spectrum sharing paradigm. There is a need to separately quantify the spectrum consumed by the individual transmitters and receivers when multiple heterogeneous wireless networks are sharing the spectrum in time, space, and frequency dimensions. By discretizing the spectrum dimensions, we define a methodology for quantifying the spectrum consumption spaces. This is an attempt to adopt the discretized signal processing principle and apply it to spectrum management functions that would bring in simplicity, flexibility, and precision among other advantages.

preprint2014arXiv

Maximizing Spectrum Availability and Exploitation: How to Maximize Spectrum Sharing Benefits to the Incumbents?

A significant portion of the radio frequency spectrum remains underutilized due to exclusive and static allocation of spectrum. Provisioning secondary access to the underutilized spectrum could be beneficial to the incumbents if they could gain significant value out of the fallow spectrum while ensuring protection of their primary services. From an incumbent perspective, the spectrum sharing approach needs to be non-harmful as well as efficient. In order to make spectrum sharing efficient, it is necessary to maximize the spectrum available for secondary access as well as maximize its exploitation. We examine the impact of the conservative assumptions that lead to lesser availability of spectrum for secondary access. The problem of joint scheduling and spectrum-access footprint allocation is at the heart of maximizing the exploitation. This problem is NP-hard and we present a suboptimal approach based on the minimal spectrum consumption cost of a spectrum-access request. In order to improve the spectrum sharing potential, we investigate the impact of the various design choices for a spectrum access mechanism. The experiments demonstrate the significance of the active role of the incumbents, the benefits of fine granular spectrum access, and the need for transceiver standards for accomplishing efficient usage of the spectrum.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint

Fields this researcher appears in

Source provenance

Where this author record came from

arxivconfidence 95%

external id: arxiv:2504.11651:author:4:vipin-chaudhary

Imported May 21, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.14063:author:11:vipin-chaudhary

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.11459:author:9:vipin-chaudhary

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2605.17034:author:7:vipin-chaudhary

Imported May 20, 2026Synced May 21, 2026

arxivconfidence 95%

external id: arxiv:2604.27201:author:9:vipin-chaudhary

Imported May 20, 2026Synced May 20, 2026

5 works

Vikash Singh

Researcher

Vikash Singh contributes to research discovery and scholarly infrastructure.

Open to collaborate

4 works

Debargha Ganguly

Researcher

Debargha Ganguly contributes to research discovery and scholarly infrastructure.

Open to collaborate

4 works

Nilesh Khambekar

Researcher

Nilesh Khambekar contributes to research discovery and scholarly infrastructure.

Open to collaborate

3 works

Chad M. Spooner

Researcher

Chad M. Spooner contributes to research discovery and scholarly infrastructure.

Open to collaborate

Vipin Chaudhary

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11)

AutoL2S: Auto Long-Short Reasoning for Efficient Large Language Models

Mid-Think: Training-Free Intermediate-Budget Reasoning via Token-Level Triggers

Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

Path-Lock Expert: Separating Reasoning Mode in Hybrid Thinking via Architecture-Level Separation

Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation

Reliability-Gated Source Anchoring for Continual Test-Time Adaptation

Report on 2023 CyberTraining PI Meeting, 26-27 September 2023

Give me a knee radiograph, I will tell you where the knee joint area is: a deep convolutional neural network adventure

Irrelevant Pixels are Everywhere: Find and Exclude Them for More Efficient Computer Vision

Quantified Spectrum Sharing: Motivation, Approach, and Benefits

MUSE: A Methodology for Characterizing and Quantifying the Use of Spectrum

Going Towards Discretized Spectrum Space: Quantification of Spectrum Consumption Spaces and a Quantified Spectrum Access Paradigm

Maximizing Spectrum Availability and Exploitation: How to Maximize Spectrum Sharing Benefits to the Incumbents?