Researcher profile

Siva Theja Maguluri

Siva Theja Maguluri contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
16works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

16 published item(s)

preprint2026arXiv

Higher-Order Approximations of Sojourn Times in M/G/1 Queues via Stein's Method

We study the stationary sojourn time distribution in an M/G/1 queue operating under heavy traffic. It is known that the sojourn time converges to an exponential distribution in the limit. Our focus is on obtaining pre-asymptotic, higher-order approximations that go beyond the classical exponential limit. Using Stein's method, we develop an approach based on higher-order expansions of the generator of the underlying Markov process. The key technical step is to represent higher-order derivatives in terms of lower-order ones and control the resulting error via derivative bounds of the Stein equation. Under suitable moment-matching conditions on the service distribution, we show that the approximation error decays as a high-order power of the slack parameter $\varepsilon=1-ρ$. Error bounds are established in the Zolotarev metric, which further imply bounds on the Wasserstein distance as well as the moments. Our results demonstrate that the accuracy of the exponential approximation can be systematically improved by matching progressively more moments of the service distribution.

preprint2023arXiv

An Approximate Policy Iteration Viewpoint of Actor-Critic Algorithms

In this work, we consider policy-based methods for solving the reinforcement learning problem, and establish the sample complexity guarantees. A policy-based algorithm typically consists of an actor and a critic. We consider using various policy update rules for the actor, including the celebrated natural policy gradient. In contrast to the gradient ascent approach taken in the literature, we view natural policy gradient as an approximate way of implementing policy iteration, and show that natural policy gradient (without any regularization) enjoys geometric convergence when using increasing stepsizes. As for the critic, we consider using TD-learning with linear function approximation and off-policy sampling. Since it is well-known that in this setting TD-learning can be unstable, we propose a stable generic algorithm (including two specific algorithms: the $λ$-averaged $Q$-trace and the two-sided $Q$-trace) that uses multi-step return and generalized importance sampling factors, and provide the finite-sample analysis. Combining the geometric convergence of the actor with the finite-sample analysis of the critic, we establish for the first time an overall $\mathcal{O}(ε^{-2})$ sample complexity for finding an optimal policy (up to a function approximation error) using policy-based methods under off-policy sampling and linear function approximation.

preprint2023arXiv

Transportation Polytope and its Applications in Parallel Server Systems

A parallel server system is a stochastic processing network with applications in manufacturing, supply chain, ride-hailing, call centers, etc. Heterogeneous customers arrive in the system, and only a subset of servers can serve any customer type given by the flexibility graph. The goal of the system operator is to minimize the delay that depends on the scheduling policy and the flexibility graph. A long line of literature focuses on designing near-optimal scheduling policies given a flexibility graph. On the contrary, we fix the scheduling policy to be the so-called MaxWeight scheduling given its superior delay performance and focus on designing near-optimal, sparse flexibility graphs. Our contributions are threefold. First, we analyze the expected delay in the heavy-traffic asymptotic regime in terms of the properties of the flexibility graph and use this result to translate the design question in terms of transportation polytope, the deterministic equivalent of parallel server queues. Second, we design the sparsest flexibility graph that achieves a given delay performance and shows the robustness of the design to demand uncertainty. Third, given the budget to add edges arrives sequentially in time, we present the optimal schedule for adding them to the flexibility graph. These results are obtained by proving new results for transportation polytopes and are of independent interest. In particular, translating the difficulties to a simpler model, i.e. transportation polytope, allows us to develop a unified framework to answer several design questions.

preprint2022arXiv

Finite Sample Analysis of Two-Time-Scale Natural Actor-Critic Algorithm

Actor-critic style two-time-scale algorithms are one of the most popular methods in reinforcement learning, and have seen great empirical success. However, their performance is not completely understood theoretically. In this paper, we characterize the \emph{global} convergence of an online natural actor-critic algorithm in the tabular setting using a single trajectory of samples. Our analysis applies to very general settings, as we only assume ergodicity of the underlying Markov decision process. In order to ensure enough exploration, we employ an $ε$-greedy sampling of the trajectory. For a fixed and small enough exploration parameter $ε$, we show that the two-time-scale natural actor-critic algorithm has a rate of convergence of $\tilde{\mathcal{O}}(1/T^{1/4})$, where $T$ is the number of samples, and this leads to a sample complexity of $\Tilde{\mathcal{O}}(1/δ^{8})$ samples to find a policy that is within an error of $δ$ from the \emph{global optimum}. Moreover, by carefully decreasing the exploration parameter $ε$ as the iterations proceed, we present an improved sample complexity of $\Tilde{\mathcal{O}}(1/δ^{6})$ for convergence to the global optimum.

preprint2022arXiv

Finite-Sample Analysis of Nonlinear Stochastic Approximation with Applications in Reinforcement Learning

Motivated by applications in reinforcement learning (RL), we study a nonlinear stochastic approximation (SA) algorithm under Markovian noise, and establish its finite-sample convergence bounds under various stepsizes. Specifically, we show that when using constant stepsize (i.e., $α_k\equiv α$), the algorithm achieves exponential fast convergence to a neighborhood (with radius $O(α\log(1/α))$) around the desired limit point. When using diminishing stepsizes with appropriate decay rate, the algorithm converges with rate $O(\log(k)/k)$. Our proof is based on Lyapunov drift arguments, and to handle the Markovian noise, we exploit the fast mixing of the underlying Markov chain. To demonstrate the generality of our theoretical results on Markovian SA, we use it to derive the finite-sample bounds of the popular $Q$-learning with linear function approximation algorithm, under a condition on the behavior policy. Importantly, we do not need to make the assumption that the samples are i.i.d., and do not require an artificial projection step in the algorithm to maintain the boundedness of the iterates. Numerical simulations corroborate our theoretical results.

preprint2022arXiv

Finite-Sample Analysis of Off-Policy Natural Actor-Critic with Linear Function Approximation

In this paper, we develop a novel variant of off-policy natural actor-critic algorithm with linear function approximation and we establish a sample complexity of $\mathcal{O}(ε^{-3})$, outperforming all the previously known convergence bounds of such algorithms. In order to overcome the divergence due to deadly triad in off-policy policy evaluation under function approximation, we develop a critic that employs $n$-step TD-learning algorithm with a properly chosen $n$. We present finite-sample convergence bounds on this critic under both constant and diminishing step sizes, which are of independent interest. Furthermore, we develop a variant of natural policy gradient under function approximation, with an improved convergence rate of $\mathcal{O}(1/T)$ after $T$ iterations. Combining the finite sample error bounds of actor and the critic, we obtain the $\mathcal{O}(ε^{-3})$ sample complexity. We derive our sample complexity bounds solely based on the assumption that the behavior policy sufficiently explores all the states and actions, which is a much lighter assumption compared to the related literature.

preprint2022arXiv

Target Network and Truncation Overcome The Deadly Triad in $Q$-Learning

$Q$-learning with function approximation is one of the most empirically successful while theoretically mysterious reinforcement learning (RL) algorithms, and was identified in Sutton (1999) as one of the most important theoretical open problems in the RL community. Even in the basic linear function approximation setting, there are well-known divergent examples. In this work, we show that \textit{target network} and \textit{truncation} together are enough to provably stabilize $Q$-learning with linear function approximation, and we establish the finite-sample guarantees. The result implies an $O(ε^{-2})$ sample complexity up to a function approximation error. Moreover, our results do not require strong assumptions or modifying the problem parameters as in existing literature.

preprint2021arXiv

Heavy-Traffic Insensitive Bounds for Weighted Proportionally Fair Bandwidth Sharing Policies

We consider a connection-level model proposed by Massoulié and Roberts for bandwidth sharing among file transfer flows in a communication network. We study weighted proportionally fair sharing policies and establish explicit-form bounds on the weighted sum of the expected numbers of flows on different routes in heavy traffic. The bounds are linear in the number of critically loaded links in the network, and they hold for a class of phase-type file-size distributions; i.e., the bounds are heavy-traffic insensitive to the distributions in this class. Our approach is Lyapunov-drift based, which is different from the widely used diffusion approximation approach. A key technique we develop is to construct a novel inner product in the state space, which then allows us to obtain a multiplicative type of state-space collapse in steady state. Furthermore, this state-space collapse result implies the interchange of limits as a by-product for the diffusion approximation of the equal-weight case under phase-type file-size distributions, demonstrating the heavy-traffic insensitivity of the stationary distribution.

preprint2021arXiv

Load balancing system under Join the Shortest Queue: Many-Server-Heavy-Traffic Asymptotics

We study the load balancing system operating under Join the Shortest Queue (JSQ) in the many-server heavy-traffic regime. If $N$ is the number of servers, we let the difference between the total service rate and the total arrival rate be $N^{1-α}$ with $α>0$. We show that for $α>4$ the average queue length behaves similarly to the classical heavy-traffic regime. Specifically, we prove that the distribution of the average queue length multiplied by $N^{1-α}$ converges to an exponential random variable. Moreover, we show a result analogous to state space collapse. We provide two proofs for our result: one using the one-sided Laplace transform, and one using Stein's method. We additionally obtain the rate of convergence in the Wasserstein's distance.

preprint2021arXiv

Optimal Pricing in Multi Server Systems

We study optimal service pricing in server farms where customers arrive according to a renewal process and have independent and identical ($i.i.d.$) exponential service times and $i.i.d.$ valuations of the service. The service provider charges a time varying service fee aiming at maximizing its revenue rate. The customers that find free servers and service fees lesser than their valuation join for the service else they leave without waiting. We consider both finite server and infinite server farms. We solve the optimal pricing problems using the framework of Markov decision problems. We show that the optimal prices depend on the number of free servers. We propose algorithms to compute the optimal prices. We also establish several properties of the optimal prices and the corresponding revenue rates in the case of Poisson customer arrivals. We illustrate all our findings via numerical results.

preprint2020arXiv

Finite-Time Performance of Distributed Temporal Difference Learning with Linear Function Approximation

We study the policy evaluation problem in multi-agent reinforcement learning, modeled by a Markov decision process. In this problem, the agents operate in a common environment under a fixed control policy, working together to discover the value (global discounted accumulative reward) associated with each environmental state. Over a series of time steps, the agents act, get rewarded, update their local estimate of the value function, then communicate with their neighbors. The local update at each agent can be interpreted as a distributed variant of the popular temporal difference learning methods {\sf TD}$ (λ)$. Our main contribution is to provide a finite-analysis on the performance of this distributed {\sf TD}$(λ)$ algorithm for both constant and time-varying step sizes. The key idea in our analysis is to use the geometric mixing time $τ$ of the underlying Markov chain, that is, although the "noise" in our algorithm is Markovian, its dependence is very weak at samples spaced out at every $τ$. We provide an explicit upper bound on the convergence rate of the proposed method as a function of the network topology, the discount factor, the constant $λ$, and the mixing time $τ$. Our results also provide a mathematical explanation for observations that have appeared previously in the literature about the choice of $λ$. Our upper bound illustrates the trade-off between approximation accuracy and convergence speed implicit in the choice of $λ$. When $λ=1$, the solution will correspond to the best possible approximation of the value function, while choosing $λ= 0$ leads to faster convergence when the noise in the algorithm has large variance.

preprint2020arXiv

Logarithmic Heavy Traffic Error Bounds in Generalized Switch and Load Balancing Systems

Motivated by application in wireless networks, cloud computing, data centers etc, Stochastic Processing Networks have been studied in the literature under various asymptotic regimes. In the heavy-traffic regime, the steady state mean queue length is proved to be $O(\frac{1}ε)$ where $ε$ is the heavy-traffic parameter, that goes to zero in the limit. The focus of this paper is on obtaining queue length bounds on prelimit systems, thus establishing the rate of convergence to the heavy traffic. In particular, we study the generalized switch model operating under the MaxWeight algorithm, and we show that the mean queue length of the prelimit system is only $O\left(\log \left(\frac{1}ε\right)\right)$ away from its heavy-traffic limit. We do this even when the so called complete resource pooling (CRP) condition is not satisfied. When the CRP condition is satisfied, in addition, we show that the MaxWeight algorithm is within $O\left(\log \left(\frac{1}ε\right)\right)$ of the optimal. Finally, we obtain similar results in load balancing systems operating under the join the shortest queue routing algorithm.

preprint2020arXiv

QPS-r: A Cost-Effective Crossbar Scheduling Algorithm and Its Stability and Delay Analysis

In an input-queued switch, a crossbar schedule, or a matching between the input ports and the output ports needs to be computed in each switching cycle, or time slot. Designing switching algorithms with very low computational complexity, that lead to high throughput and small delay is a challenging problem. There appears to be a fundamental tradeoff between the computational complexity of the switching algorithm and the resultants throughput and delay. Parallel maximal matching algorithms (adapted for switching) appear to have stricken a sweet spot in this tradeoff, and prior work has shown the following performance guarantees. Using maximal matchings in every time slot results in at least 50% switch throughput and order-optimal (i.e., independent of the switch size N) average delay bounds for various traffic arrival processes. On the other hand, their computational complexity can be as low as $O(log^2N)$ per port/processor, which is much lower than those of the algorithms such as maximum weighted matching which ensures better throughput performance. In this work, we propose QPS-r, a parallel iterative switching algorithm that has the lowest possible computational complexity: O(1) per port. Using Lyapunov stability analysis, we show that the throughput and delay performance is identical to that of maximal matching algorithm. Although QPS-r builds upon an existing technique called Queue-Proportional Sampling (QPS), in this paper, we provide analytical guarantees on its throughput and delay under i.i.d. traffic as well as a Markovian traffic model which can model many realistic traffic patterns. We also demonstrate that QPS-3 (running 3 iterations) has comparable empirical throughput and delay performances as iSLIP (running $log_2 N$ iterations), a refined and optimized representative maximal matching algorithm adapted for switching.

preprint2020arXiv

Throughput and Delay Optimality of Power-of-d Choices in Inhomogeneous Load Balancing Systems

It is well-known that the power-of-d choices routing algorithm maximizes throughput and is heavy-traffic optimal in load balancing systems with homogeneous servers. However, if the servers are heterogeneous, throughput optimality does not hold in general. We find necessary and sufficient conditions for throughput optimality of power-of-d choices when the servers are heterogeneous, and we prove that almost the same conditions are sufficient to show heavy-traffic optimality. Additionally, we generalize the sufficient condition for throughput optimality to a larger class of routing policies.

preprint2020arXiv

Transform Methods for Heavy-Traffic Analysis

The drift method was recently developed to study queueing systems in steady-state. It was successfully used to obtain bounds on the moments of the scaled queue lengths, that are asymptotically tight in heavy-traffic, in a wide variety of systems including generalized switches, input-queued switches, bandwidth sharing networks, etc. In this paper we develop the use of transform techniques for heavy-traffic analysis, with a special focus on the use of moment generating functions. This approach simplifies the proofs of the drift method, and provides a new perspective on the drift method. We present a general framework and then use the MGF method to obtain the stationary distribution of queue lengths in heavy-traffic in queueing systems that satisfy the Complete Resource Pooling condition. In particular, we study load balancing systems and generalized switches under general settings.