Source author record

Nicholas Bambos

Nicholas Bambos appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Science and Game Theory math.OC Networking and Internet Architecture Systems and Control Information Theory Machine Learning math.IT math.PR Performance Applications Artificial Intelligence cond-mat.stat-mech math.DS Multiagent Systems Multimedia Social and Information Networks

Catalog footprint

What is connected

14works

16topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Policy Gradient Methods for Non-Markovian Reinforcement Learning

We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.

preprint2022arXiv

Learning in Games with Quantized Payoff Observations

This paper investigates the impact of feedback quantization on multi-agent learning. In particular, we analyze the equilibrium convergence properties of the well-known "follow the regularized leader" (FTRL) class of algorithms when players can only observe a quantized (and possibly noisy) version of their payoffs. In this information-constrained setting, we show that coarser quantization triggers a qualitative shift in the convergence behavior of FTRL schemes. Specifically, if the quantization error lies below a threshold value (which depends only on the underlying game and not on the level of uncertainty entering the process or the specific FTRL variant under study), then (i) FTRL is attracted to the game's strict Nash equilibria with arbitrarily high probability; and (ii) the algorithm's asymptotic rate of convergence remains the same as in the non-quantized case. Otherwise, for larger quantization levels, these convergence properties are lost altogether: players may fail to learn anything beyond their initial state, even with full information on their payoff vectors. This is in contrast to the impact of quantization in continuous optimization problems, where the quality of the obtained solution degrades smoothly with the quantization level.

preprint2022arXiv

No Weighted-Regret Learning in Adversarial Bandits with Delays

Consider a scenario where a player chooses an action in each round $t$ out of $T$ rounds and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that in a non-cooperative game, the expected weighted ergodic distribution of play converges to the set of coarse correlated equilibria if players use algorithms that have "no weighted-regret" in the above scenario, even if they have linear regret due to too large delays. For a two-player zero-sum game, we show that no weighted-regret is sufficient for the weighted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves an expected regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves an expected regret of $O\left(\sqrt{\log K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that, under mild assumptions, provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that FKM and EXP3 have no weighted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, algorithms with no weighted-regret can be used to approximate a CCE of a finite or convex unknown game that can only be simulated with bandit feedback, even if the simulation involves significant delays.

preprint2020arXiv

My Fair Bandit: Distributed Learning of Max-Min Fairness with Multi-player Bandits

Consider N cooperative but non-communicating players where each plays one out of M arms for T turns. Players have different utilities for each arm, representable as an NxM matrix. These utilities are unknown to the players. In each turn players select an arm and receive a noisy observation of their utility for it. However, if any other players selected the same arm that turn, all colliding players will all receive zero utility due to the conflict. No other communication or coordination between the players is possible. Our goal is to design a distributed algorithm that learns the matching between players and arms that achieves max-min fairness while minimizing the regret. We present an algorithm and prove that it is regret optimal up to a $\log\log T$ factor. This is the first max-min fairness multi-player bandit algorithm with (near) order optimal regret.

preprint2016arXiv

Infinite Server Queueing Networks with Deadline Based Routing

Motivated by timeouts in Internet services, we consider networks of infinite server queues in which routing decisions are based on deadlines. Specifically, at each node in the network, the total service time equals the minimum of several independent service times (e.g. the minimum of the amount of time required to complete a transaction and a deadline). Furthermore, routing decisions depend on which of the independent service times achieves the minimum (e.g. exceeding a deadline will require the customer to be routed so they can re-attempt the transaction). Because current routing decisions are dependent on past service times, much of the existing theory on product-form queueing networks does not apply. In spite of this, we are able to show that such networks have product-form equilibrium distributions. We verify our analytic characterization with a simulation of a simple network. We also discuss extensions of this work to more general settings.

preprint2016arXiv

Myopic Policies for Non-Preemptive Scheduling of Jobs with Decaying Value

In many scheduling applications, minimizing delays is of high importance. One adverse effect of such delays is that the reward for completion of a job may decay over time. Indeed in healthcare settings, delays in access to care can result in worse outcomes, such as an increase in mortality risk. Motivated by managing hospital operations in disaster scenarios, as well as other applications in perishable inventory control and information services, we consider non-preemptive scheduling of jobs whose internal value decays over time. Because solving for the optimal scheduling policy is computationally intractable, we focus our attention on the performance of three intuitive heuristics: (1) a policy which maximizes the expected immediate reward, (2) a policy which maximizes the expected immediate reward rate, and (3) a policy which prioritizes jobs with imminent deadlines. We provide performance guarantees for all three policies and show that many of these performance bounds are tight. In addition, we provide numerical experiments and simulations to compare how the policies perform in a variety of scenarios. Our theoretical and numerical results allow us to establish rules-of-thumb for applying these heuristics in a variety of situations, including patient scheduling scenarios.

preprint2016arXiv

Power Control for Packet Streaming with Head-of-Line Deadlines

We consider a mathematical model for streaming media packets (as the motivating key example) from a transmitter buffer to a receiver over a wireless link while controlling the transmitter power (hence, the packet/job processing rate). When each packet comes to the head-of-line (HOL) in the buffer, it is given a deadline $D$ which is the maximum number of times the transmitter can attempt retransmission in order to successfully transmit the packet. If this number of transmission attempts is exhausted, the packet is ejected from the buffer and the next packet comes to the HOL. Costs are incurred in each time slot for holding packets in the buffer, expending transmitter power, and ejecting packets which exceed their deadlines. We investigate how transmission power should be chosen so as to minimize the total cost of transmitting the items in the buffer. We formulate the optimal power control problem in a dynamic programming framework and then hone in on the special case of fixed interference. For this special case, we are able to provide a precise analytic characterization of how the power control should vary with the backlog and how the power control should react to approaching deadlines. In particular, we show monotonicity results for how the transmitter should adapt power levels to the backlog and approaching deadlines. We leverage these analytic results from the special case to build a power control scheme for the general case. Monte Carlo simulations are used to evaluate the performance of the resulting power control scheme as compared to the optimal scheme. The resulting power control scheme is sub-optimal but it provides a low-complexity approximation of the optimal power control. Simulations show that our proposed schemes outperform benchmark algorithms. We also discuss applications of the model to other practical operational scenarios.

preprint2016arXiv

Predicting Pediatric Surgical Durations

Effective management of operating room resources relies on accurate predictions of surgical case durations. This prediction problem is known to be particularly difficult in pediatric hospitals due to the extreme variation in pediatric patient populations. We propose a novel metric for measuring accuracy of predictions which captures key issues relevant to hospital operations. With this metric in mind we propose several tree-based prediction models. Some are automated (they do not require input from surgeons) while others are semi-automated (they do require input from surgeons). We see that many of our automated methods generally outperform currently used algorithms and even achieve the same performance as surgeons. Our semi-automated methods can outperform surgeons by a significant margin. We gain insights into the predictive value of different features and suggest avenues of future work.

preprint2016arXiv

Service Rate Control For Jobs with Decaying Value

The task of completing jobs with decaying value arises in a number of application areas including healthcare operations, communications engineering, and perishable inventory control. We consider a system in which a single server completes a finite sequence of jobs in discrete time while a controller dynamically adjusts the service rate. During service, the value of the job decays so that a greater reward is received for having shorter service times. We incorporate a non-decreasing cost for holding jobs and a non-decreasing cost on the service rate. The controller aims to minimize the total cost of servicing the set of jobs. We show that the optimal policy is non-decreasing in the number of jobs remaining -- when there are more jobs in the system the controller should use a higher service rate. The optimal policy does not necessarily vary monotonically with the residual job value, but we give algebraic conditions which can be used to determine when it does. These conditions are then simplified in the case that the reward for completion is constant when the job has positive value and zero otherwise. These algebraic conditions are interesting because they can be verified without using algorithms like value iteration and policy iteration to explicitly compute the optimal policy. We also discuss some future modeling extensions.

preprint2013arXiv

Power Optimization in Random Wireless Networks

Consider a wireless network of transmitter-receiver pairs where the transmitters adjust their powers to maintain a target SINR level in the presence of interference. In this paper, we analyze the optimal power vector that achieves this target in large, random networks obtained by "erasing" a finite fraction of nodes from a regular lattice of transmitter-receiver pairs. We show that this problem is equivalent to the so-called Anderson model of electron motion in dirty metals which has been used extensively in the analysis of diffusion in random environments. A standard approximation to this model is the so-called coherent potential approximation (CPA) method which we apply to evaluate the first and second order intra-sample statistics of the optimal power vector in one- and two-dimensional systems. This approach is equivalent to traditional techniques from random matrix theory and free probability, but while generally accurate (and in agreement with numerical simulations), it fails to fully describe the system: in particular, results obtained in this way fail to predict when power control becomes infeasible. In this regard, we find that the infinite system is always unstable beyond a certain value of the target SINR, but any finite system only has a small probability of becoming unstable. This instability probability is proportional to the tails of the eigenvalue distribution of the system which are calculated to exponential accuracy using methodologies developed within the Anderson model and its ties with random walks in random media. Finally, using these techniques, we also calculate the tails of the system's power distribution under power control and the rate of convergence of the Foschini-Miljanic power control algorithm in the presence of random erasures. Overall, in the paper we try to strike a balance between intuitive arguments and formal proofs.

preprint2011arXiv

Cone Schedules for Processing Systems in Fluctuating Environments

We consider a generalized processing system having several queues, where the available service rate combinations are fluctuating over time due to reliability and availability variations. The objective is to allocate the available resources, and corresponding service rates, in response to both workload and service capacity considerations, in order to maintain the long term stability of the system. The service configurations are completely arbitrary, including negative service rates which represent forwarding and service-induced cross traffic. We employ a trace-based trajectory asymptotic technique, which requires minimal assumptions about the arrival dynamics of the system. We prove that cone schedules, which leverage the geometry of the queueing dynamics, maximize the system throughput for a broad class of processing systems, even under adversarial arrival processes. We study the impact of fluctuating service availability, where resources are available only some of the time, and the schedule must dynamically respond to the changing available service rates, establishing both the capacity of such systems and the class of schedules which will stabilize the system at full capacity. The rich geometry of the system dynamics leads to important insights for stability, performance and scalability, and substantially generalizes previous findings. The processing system studied here models a broad variety of computer, communication and service networks, including varying channel conditions and cross-traffic in wireless networking, and call centers with fluctuating capacity. The findings have implications for bandwidth and processor allocation in communication networks and workforce scheduling in congested call centers.

preprint2011arXiv

Fairness in overloaded parallel queues

Maximizing throughput for heterogeneous parallel server queues has received quite a bit of attention from the research community and the stability region for such systems is well understood. However, many real-world systems have periods where they are temporarily overloaded. Under such scenarios, the unstable queues often starve limited resources. This work examines what happens during periods of temporary overload. Specifically, we look at how to fairly distribute stress. We explore the dynamics of the queue workloads under the MaxWeight scheduling policy during long periods of stress and discuss how to tune this policy in order to achieve a target fairness ratio across these workloads.

preprint2010arXiv

Decentralized Admission Control for Power-Controlled Wireless Links

This paper deals with the problem of admission control/channel access in power-controlled decentralized wireless networks, in which the quality-of-service (QoS) is expressed in terms of the signal-to-interference ratio (SIR). We analyze a previously proposed admission control algorithm, which was designed to maintain the SIR of operational (active) links above some given threshold at all times (protection of active links). This protection property ensures that as new users attempt to join the network, the already established links sustain their quality. The considered scheme may be thus applicable in some cognitive radio networks, where the fundamental premise is that secondary users may be granted channel access only if it does not cause disturbance to primary users. The admission control algorithm was previously analyzed under the assumption of affine interference functions. This paper extends all the previous results to arbitrary standard interference functions, which capture many important receiver designs, including optimal linear reception in the sense of maximizing the SIR and the worst-case receiver design. Furthermore, we provide novel conditions for protection of active users under the considered control scheme when individual power constraints are imposed on each link. Finally, we consider the possibility of a joint optimization of transmitters and receivers in networks with linear transceivers, which includes linear beamforming in multiple antenna systems. Transmitter optimization is performed alternately with receiver optimization to generate non-decreasing sequences of SIRs. Numerical evaluations show that additional transmitter side optimization has potential for significant performance gains.

preprint2010arXiv

Packet Scheduling in Switches with Target Outflow Profiles

The problem of packet scheduling for traffic streams with target outflow profiles traversing input queued switches is formulated in this paper. Target outflow profiles specify the desirable inter-departure times of packets leaving the switch from each traffic stream. The goal of the switch scheduler is to dynamically select service configurations of the switch, so that actual outflow streams ("pulled" through the switch) adhere to their desired target profiles as accurately as possible. Dynamic service controls (schedules) are developed to minimize deviation of actual outflow streams from their targets and suppress stream "distortion". Using appropriately selected subsets of service configurations of the switch, efficient schedules are designed, which deliver high performance at relatively low complexity. Some of these schedules are provably shown to achieve 100% pull-throughput. Moreover, simulations demonstrate that for even substantial contention of streams through the switch, due to stringent/intense target outflow profiles, the proposed schedules achieve closely their target profiles and suppress stream distortion. The switch model investigated here deviates from the classical switching paradigm. In the latter, the goal of packet scheduling is primarily to "push" as much traffic load through the switch as possible, while controlling delay to traverse the switch and keeping congestion/backlogs from exploding. In the model presented here, however, the goal of packet scheduling is to "pull" traffic streams through the switch, maintaining desirable (target) outflow profiles.

Nicholas Bambos

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

Policy Gradient Methods for Non-Markovian Reinforcement Learning

Learning in Games with Quantized Payoff Observations

No Weighted-Regret Learning in Adversarial Bandits with Delays

My Fair Bandit: Distributed Learning of Max-Min Fairness with Multi-player Bandits

Infinite Server Queueing Networks with Deadline Based Routing

Myopic Policies for Non-Preemptive Scheduling of Jobs with Decaying Value

Power Control for Packet Streaming with Head-of-Line Deadlines

Predicting Pediatric Surgical Durations

Service Rate Control For Jobs with Decaying Value

Power Optimization in Random Wireless Networks

Cone Schedules for Processing Systems in Fluctuating Environments

Fairness in overloaded parallel queues

Decentralized Admission Control for Power-Controlled Wireless Links

Packet Scheduling in Switches with Target Outflow Profiles