Source author record

Nirmal Patel

Nirmal Patel appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence gr-qc Machine Learning physics.comp-ph

Catalog footprint

What is connected

2works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

The alignment of Large Language Models (LLMs) utilizes Reinforcement Learning from AI Feedback (RLAIF) for non-verifiable domains such as long-form question answering and open-ended instruction following. These domains often rely on LLM based auto-raters to provide granular, multi-tier discrete rewards (e.g., 1-10 rubrics) that are inherently stochastic due to prompt sensitivity and sampling randomness. We empirically verify the stochasticity of auto-raters that can propagate and corrupt standard advantage estimators like GRPO and MaxRL, as a noisy reward samples can skew normalization statistics and degrade the global learning signal. Empirically, sampling more rewards and taking majority voting may reduce the noise and improve performance, but this approach is computationally expensive. To address this bottleneck, we introduce $\textbf{O}$rdinal $\textbf{D}$ecomposition for $\textbf{R}$obust $\textbf{P}$olicy $\textbf{O}$ptimization ($\textbf{ODRPO}$), a framework that structurally isolates evaluation noise by decomposing discrete rewards into a sequence of ordinal binary indicators. By independently computing and accumulating advantages across these progressively challenging success thresholds, ODRPO prevents outlier evaluations from corrupting the global update while establishing an implicit, variance-aware learning curriculum. Empirically, ODRPO achieves robust performance on Qwen2.5-7B and Qwen3-4B models, outperforming baselines with relative improvements of upto 14.8% on FACTS-grounding-v2 and 7.5% on Alpaca-Evals. Critically, these gains are achieved with negligible training-time overhead, as ODRPO requires no additional compute per step compared to standard estimators. Supported by theoretical analysis confirming its optimization stability, ODRPO provides a scalable and robust framework for aligning models within the noisy, discrete evaluation landscape of modern RLAIF.

preprint2024arXiv

Calculating Quasi-Normal Modes of Schwarzschild Black Holes with Physics Informed Neural Networks

Machine learning, particularly neural networks, has rapidly permeated most activities and work where data has a story to tell. Recently, deep learning has started to be used for solving differential equations with input from physics, also known as Physics Informed Neural Networks (PINNs). We present a study showing the efficacy of PINNs for solving the Zerilli and the Regge-Wheeler equations in the time domain to calculate the quasi-normal oscillation modes of a Schwarzschild black hole. We compare the extracted modes with those obtained with finite difference methods. Although the PINN results are competitive, with a few percent differences in the quasi-normal modes estimates relative to those computed with finite difference methods, the real power of PINNs will emerge when applied to large dimensionality problems.