Researcher profile

Dileep Kalathil

Dileep Kalathil contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2026arXiv

Adaptive Outer-Loop Control of Quadrotors via Reinforcement Learning

Deep Reinforcement Learning (DRL) for quadrotor flight control typically relies on Domain Randomization (DR) for sim-to-real transfer, resulting in overly conservative policies that struggle with dynamic disturbances. To overcome this, we propose a novel adaptive control architecture that actively perceives and reacts to instantaneous perturbations. First, we train an optimal outer-loop policy, then replace its reliance on ground-truth disturbance data with a Residual Dynamics Predictor (RDP). The RDP estimates the external forces and moments acting on the aircraft in flight online using only the history of states and control actions. For seamless hardware transfer, we introduce a data-efficient linear calibration bridge and an online thrust correction mechanism that align the simulated latent space with reality using mere seconds of flight data. Real-world validations on a Crazyflie micro-quadrotor demonstrate that our adaptive controller significantly outperforms baselines, maintaining precise trajectory tracking under severe uncertainties including mass variations, asymmetric payloads, and dynamic slung loads

preprint2026arXiv

Robust LLM Alignment via Distributionally Robust Direct Preference Optimization

A major challenge in aligning large language models (LLMs) with human preferences is the issue of distribution shift. LLM alignment algorithms rely on static preference datasets, assuming that they accurately represent real-world user preferences. However, user preferences vary significantly across geographical regions, demographics, linguistic patterns, and evolving cultural trends. This preference distribution shift leads to catastrophic alignment failures in many real-world applications. We address this problem using the principled framework of distributionally robust optimization, and develop two novel distributionally robust direct preference optimization (DPO) algorithms, namely, Wasserstein DPO (WDPO) and Kullback-Leibler DPO (KLDPO). We characterize the sample complexity of learning the optimal policy parameters for WDPO and KLDPO. Moreover, we propose scalable gradient descent-style learning algorithms by developing suitable approximations for the challenging minimax loss functions of WDPO and KLDPO. Our empirical experiments using benchmark data sets and LLMs demonstrate the superior performance of WDPO and KLDPO in substantially improving the alignment when there is a preference distribution shift.

preprint2022arXiv

DETERRENT: Detecting Trojans using Reinforcement Learning

Insertion of hardware Trojans (HTs) in integrated circuits is a pernicious threat. Since HTs are activated under rare trigger conditions, detecting them using random logic simulations is infeasible. In this work, we design a reinforcement learning (RL) agent that circumvents the exponential search space and returns a minimal set of patterns that is most likely to detect HTs. Experimental results on a variety of benchmarks demonstrate the efficacy and scalability of our RL agent, which obtains a significant reduction ($169\times$) in the number of test patterns required while maintaining or improving coverage ($95.75\%$) compared to the state-of-the-art techniques.

preprint2022arXiv

Distributed Learning of Neural Lyapunov Functions for Large-Scale Networked Dissipative Systems

This paper considers the problem of characterizing the stability region of a large-scale networked system comprised of dissipative nonlinear subsystems, in a distributed and computationally tractable way. One standard approach to estimate the stability region of a general nonlinear system is to first find a Lyapunov function for the system and characterize its region of attraction as the stability region. However, classical approaches, such as sum-of-squares methods and quadratic approximation, for finding a Lyapunov function either do not scale to large systems or give very conservative estimates for the stability region. In this context, we propose a new distributed learning based approach by exploiting the dissipativity structure of the subsystems. Our approach has two parts: the first part is a distributed approach to learn the storage functions (similar to the Lyapunov functions) for all the subsystems, and the second part is a distributed optimization approach to find the Lyapunov function for the networked system using the learned storage functions of the subsystems. We demonstrate the superior performance of our proposed approach through extensive case studies in microgrid networks.

preprint2022arXiv

Meta-Learning Online Control for Linear Dynamical Systems

In this paper, we consider the problem of finding a meta-learning online control algorithm that can learn across the tasks when faced with a sequence of $N$ (similar) control tasks. Each task involves controlling a linear dynamical system for a finite horizon of $T$ time steps. The cost function and system noise at each time step are adversarial and unknown to the controller before taking the control action. Meta-learning is a broad approach where the goal is to prescribe an online policy for any new unseen task exploiting the information from other tasks and the similarity between the tasks. We propose a meta-learning online control algorithm for the control setting and characterize its performance by \textit{meta-regret}, the average cumulative regret across the tasks. We show that when the number of tasks are sufficiently large, our proposed approach achieves a meta-regret that is smaller by a factor $D/D^{*}$ compared to an independent-learning online control algorithm which does not perform learning across the tasks, where $D$ is a problem constant and $D^{*}$ is a scalar that decreases with increase in the similarity between tasks. Thus, when the sequence of tasks are similar the regret of the proposed meta-learning online control is significantly lower than that of the naive approaches without meta-learning. We also present experiment results to demonstrate the superior performance achieved by our meta-learning algorithm.

preprint2022arXiv

Policy Optimization for Constrained MDPs with Provable Fast Global Convergence

We address the problem of finding the optimal policy of a constrained Markov decision process (CMDP) using a gradient descent-based algorithm. Previous results have shown that a primal-dual approach can achieve an $\mathcal{O}(1/\sqrt{T})$ global convergence rate for both the optimality gap and the constraint violation. We propose a new algorithm called policy mirror descent-primal dual (PMD-PD) algorithm that can provably achieve a faster $\mathcal{O}(\log(T)/T)$ convergence rate for both the optimality gap and the constraint violation. For the primal (policy) update, the PMD-PD algorithm utilizes a modified value function and performs natural policy gradient steps, which is equivalent to a mirror descent step with appropriate regularization. For the dual update, the PMD-PD algorithm uses modified Lagrange multipliers to ensure a faster convergence rate. We also present two extensions of this approach to the settings with zero constraint violation and sample-based estimation. Experimental results demonstrate the faster convergence rate and the better performance of the PMD-PD algorithm compared with existing policy gradient-based algorithms.

preprint2022arXiv

Reinforcement Learning with Sparse Rewards using Guidance from Offline Demonstration

A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance.

preprint2022arXiv

Sample Complexity of Robust Reinforcement Learning with a Generative Model

The Robust Markov Decision Process (RMDP) framework focuses on designing control policies that are robust against the parameter uncertainties due to the mismatches between the simulator model and real-world settings. An RMDP problem is typically formulated as a max-min problem, where the objective is to find the policy that maximizes the value function for the worst possible model that lies in an uncertainty set around a nominal model. The standard robust dynamic programming approach requires the knowledge of the nominal model for computing the optimal robust policy. In this work, we propose a model-based reinforcement learning (RL) algorithm for learning an $ε$-optimal robust policy when the nominal model is unknown. We consider three different forms of uncertainty sets, characterized by the total variation distance, chi-square divergence, and KL divergence. For each of these uncertainty sets, we give a precise characterization of the sample complexity of our proposed algorithm. In addition to the sample complexity results, we also present a formal analytical argument on the benefit of using robust policies. Finally, we demonstrate the performance of our algorithm on two benchmark problems.

preprint2022arXiv

The Impact of Heavy-Duty Vehicle Electrification on Large Power Grids: a Synthetic Texas Case Study

The electrification of heavy-duty vehicles (HDEVs) is a nascent and rapidly emerging avenue for decarbonization of the transportation sector. In this paper, we examine the impacts of increased vehicle electrification on the power grid infrastructure, with particular focus on HDEVs. We utilize a synthetic representation of the 2000-bus Texas transmission grid, and realistic representations of multiple distribution grids in Travis county, Texas, as well as transit data pertaining to HDEVs, to uncover the consequences of HDEV electrification, and expose the limitations imposed by existing electric grid infrastructure. Our analysis reveals that grid-wide voltage problems that are spatiotemporally correlated with the mobility of HDEVs may occur even at modest penetration levels. In fact, we find that as little as 11% of heavy duty vehicles in Texas charging simultaneously can lead to significant voltage violations on the transmission network that compromise grid reliability. Furthermore, we find that just a few dozen EVs charging simultaneously can lead to voltage violations at the distribution level.

preprint2021arXiv

Learning with Safety Constraints: Sample Complexity of Reinforcement Learning for Constrained MDPs

Many physical systems have underlying safety considerations that require that the policy employed ensures the satisfaction of a set of constraints. The analytical formulation usually takes the form of a Constrained Markov Decision Process (CMDP). We focus on the case where the CMDP is unknown, and RL algorithms obtain samples to discover the model and compute an optimal constrained policy. Our goal is to characterize the relationship between safety constraints and the number of samples needed to ensure a desired level of accuracy -- both objective maximization and constraint satisfaction -- in a PAC sense. We explore two classes of RL algorithms, namely, (i) a generative model based approach, wherein samples are taken initially to estimate a model, and (ii) an online approach, wherein the model is updated as samples are obtained. Our main finding is that compared to the best known bounds of the unconstrained regime, the sample complexity of constrained RL algorithms are increased by a factor that is logarithmic in the number of constraints, which suggests that the approach may be easily utilized in real systems.

preprint2021arXiv

Reinforcement Learning for Mean Field Games with Strategic Complementarities

Mean Field Games (MFG) are the class of games with a very large number of agents and the standard equilibrium concept is a Mean Field Equilibrium (MFE). Algorithms for learning MFE in dynamic MFGs are unknown in general. Our focus is on an important subclass that possess a monotonicity property called Strategic Complementarities (MFG-SC). We introduce a natural refinement to the equilibrium concept that we call Trembling-Hand-Perfect MFE (T-MFE), which allows agents to employ a measure of randomization while accounting for the impact of such randomization on their payoffs. We propose a simple algorithm for computing T-MFE under a known model. We also introduce a model-free and a model-based approach to learning T-MFE and provide sample complexities of both algorithms. We also develop a fully online learning scheme that obviates the need for a simulator. Finally, we empirically evaluate the performance of the proposed algorithms via examples motivated by real-world applications.

preprint2021arXiv

Robust Reinforcement Learning using Least Squares Policy Iteration with Provable Performance Guarantees

This paper addresses the problem of model-free reinforcement learning for Robust Markov Decision Process (RMDP) with large state spaces. The goal of the RMDP framework is to find a policy that is robust against the parameter uncertainties due to the mismatch between the simulator model and real-world settings. We first propose the Robust Least Squares Policy Evaluation algorithm, which is a multi-step online model-free learning algorithm for policy evaluation. We prove the convergence of this algorithm using stochastic approximation techniques. We then propose Robust Least Squares Policy Iteration (RLSPI) algorithm for learning the optimal robust policy. We also give a general weighted Euclidean norm bound on the error (closeness to optimality) of the resulting policy. Finally, we demonstrate the performance of our RLSPI algorithm on some standard benchmark problems.

preprint2020arXiv

A Vision-Based Control Method for Autonomous Landing of Vertical Flight Aircraft On a Moving Platform Without Using GPS

The paper discusses a novel vision-based estimation and control approach to enable fully autonomous tracking and landing of vertical take-off and landing (VTOL) capable unmanned aerial vehicles (UAVs) on moving platforms without relying on a GPS signal. A unique feature of the present method is that it accomplishes this task without tracking the landing pad itself; however, by utilizing a standardized visual cue installed normal to the landing pad and parallel to the pilot's/vehicle's line of sight. A computer vision system using a single monocular camera is developed to detect the visual cue and then accurately estimate the heading of the UAV and its relative distances in all three directions to the landing pad. Through comparison with a Vicon-based motion capture system, the capability of the present vision system to measure distances in real-time within an accuracy of less than a centimeter and heading within a degree with the right visual cue, is demonstrated. A gain-scheduled proportional integral derivative (PID) control system is integrated with the vision system and then implemented on a quad-rotor-UAV dynamic model in a realistic simulation program called Gazebo. Extensive simulations are conducted to demonstrate the ability of the controller to achieve robust tracking and landing on platforms moving in arbitrary trajectories. Repeated flight tests, using both stationary and moving platforms are successfully conducted with less than 5 centimeters of landing error.

preprint2020arXiv

Creation of Synthetic Networked PMU Data: A Generative Adversarial Network Approach

This paper introduces a machine learning-based approach to synthetically creating realistic phasor measurement unit (PMU) data streams of multiple transient types. In contrast to the existing literature of transient simulation-based data generation methods, we propose a generative adversarial network (GAN) based approach to learning directly from the historical data and simultaneously reproduce multiple PMU data streams. The synthetic PMU data streams reflect meaningful dynamic characteristics which observe first principles such as Kirchhoff's laws. The efficacy of this approach is demonstrated by numerical studies on the IEEE 39-bus system. We validate the fidelity and flexibility of the synthetic data via statistical resemblance and modal analysis approaches. Finally we illustrate a practical application scenario for the usage of the synthetic PMU data, i.e. leverage the synthetic data to improve the performance of the event classification algorithms.

preprint2020arXiv

D2C 2.0: Decoupled Data-Based Approach for Learning to Control Stochastic Nonlinear Systems via Model-Free ILQR

In this paper, we propose a structured linear parameterization of a feedback policy to solve the model-free stochastic optimal control problem. This parametrization is corroborated by a decoupling principle that is shown to be near-optimal under a small noise assumption, both in theory and by empirical analyses. Further, we incorporate a model-free version of the Iterative Linear Quadratic Regulator (ILQR) in a sample-efficient manner into our framework. Simulations on systems over a range of complexities reveal that the resulting algorithm is able to harness the superior second-order convergence properties of ILQR. As a result, it is fast and is scalable to a wide variety of higher dimensional systems. Comparisons are made with a state-of-the-art reinforcement learning algorithm, the Deep Deterministic Policy Gradient (DDPG) technique, in order to demonstrate the significant merits of our approach in terms of training-efficiency.

preprint2020arXiv

Learning to Cache and Caching to Learn: Regret Analysis of Caching Algorithms

Crucial performance metrics of a caching algorithm include its ability to quickly and accurately learn a popularity distribution of requests. However, a majority of work on analytical performance analysis focuses on hit probability after an asymptotically large time has elapsed. We consider an online learning viewpoint, and characterize the "regret" in terms of the finite time difference between the hits achieved by a candidate caching algorithm with respect to a genie-aided scheme that places the most popular items in the cache. We first consider the Full Observation regime wherein all requests are seen by the cache. We show that the Least Frequently Used (LFU) algorithm is able to achieve order optimal regret, which is matched by an efficient counting algorithm design that we call LFU-Lite. We then consider the Partial Observation regime wherein only requests for items currently cached are seen by the cache, making it similar to an online learning problem related to the multi-armed bandit problem. We show how approaching this "caching bandit" using traditional approaches yields either high complexity or regret, but a simple algorithm design that exploits the structure of the distribution can ensure order optimal regret. We conclude by illustrating our insights using numerical simulations.

preprint2020arXiv

QFlow: A Learning Approach to High QoE Video Streaming at the Wireless Edge

The predominant use of wireless access networks is for media streaming applications, which are only gaining popularity as ever more devices become available for this purpose. However, current access networks treat all packets identically, and lack the agility to determine which clients are most in need of service at a given time. Software reconfigurability of networking devices has seen wide adoption, and this in turn implies that agile control policies can be now instantiated on access networks. The goal of this work is to design, develop and demonstrate QFlow, a learning approach to create a value chain from the application on one side, to algorithms operating over reconfigurable infrastructure on the other, so that applications are able to obtain necessary resources for optimal performance. Using YouTube video streaming as an example, we illustrate how QFlow is able to adaptively provide such resources and attain a high QoE for all clients at a wireless access point.