Researcher profile

Nicholas Bambos

Nicholas Bambos contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

Policy Gradient Methods for Non-Markovian Reinforcement Learning

We study policy gradient methods for reinforcement learning in non-Markovian decision processes (NMDPs), where observations and rewards depend on the entire interaction history. To handle this dependence, the agent maintains an internal state that is recursively updated to provide a compact summary of past observations and actions. In contrast to approaches that treat the agent state dynamics as fixed or learn it via predictive objectives, we propose a reward-centric formulation that jointly optimizes the agent state dynamics and the control policy to maximize the expected cumulative reward. To this end, we consider a class of Agent State-Markov (ASM) policies, comprising an agent state dynamics and a control policy that maps the agent state to actions. We establish a novel policy gradient theorem for ASM policies, extending the classical policy gradient results from the Markovian setting to episodic and infinite-horizon discounted NMDPs. Building on this gradient expression, we propose the Agent State-Markov Policy Gradient (ASMPG) algorithm, which leverages the recursive structure of the agent state dynamics for efficient optimization. We establish finite-time and almost sure convergence guarantees, and empirically demonstrate that, on a range of non-Markovian tasks, ASMPG outperforms baselines that learn state representations via predictive objectives.

preprint2022arXiv

Learning in Games with Quantized Payoff Observations

This paper investigates the impact of feedback quantization on multi-agent learning. In particular, we analyze the equilibrium convergence properties of the well-known "follow the regularized leader" (FTRL) class of algorithms when players can only observe a quantized (and possibly noisy) version of their payoffs. In this information-constrained setting, we show that coarser quantization triggers a qualitative shift in the convergence behavior of FTRL schemes. Specifically, if the quantization error lies below a threshold value (which depends only on the underlying game and not on the level of uncertainty entering the process or the specific FTRL variant under study), then (i) FTRL is attracted to the game's strict Nash equilibria with arbitrarily high probability; and (ii) the algorithm's asymptotic rate of convergence remains the same as in the non-quantized case. Otherwise, for larger quantization levels, these convergence properties are lost altogether: players may fail to learn anything beyond their initial state, even with full information on their payoff vectors. This is in contrast to the impact of quantization in continuous optimization problems, where the quality of the obtained solution degrades smoothly with the quantization level.

preprint2022arXiv

No Weighted-Regret Learning in Adversarial Bandits with Delays

Consider a scenario where a player chooses an action in each round $t$ out of $T$ rounds and observes the incurred cost after a delay of $d_{t}$ rounds. The cost functions and the delay sequence are chosen by an adversary. We show that in a non-cooperative game, the expected weighted ergodic distribution of play converges to the set of coarse correlated equilibria if players use algorithms that have "no weighted-regret" in the above scenario, even if they have linear regret due to too large delays. For a two-player zero-sum game, we show that no weighted-regret is sufficient for the weighted ergodic average of play to converge to the set of Nash equilibria. We prove that the FKM algorithm with $n$ dimensions achieves an expected regret of $O\left(nT^{\frac{3}{4}}+\sqrt{n}T^{\frac{1}{3}}D^{\frac{1}{3}}\right)$ and the EXP3 algorithm with $K$ arms achieves an expected regret of $O\left(\sqrt{\log K\left(KT+D\right)}\right)$ even when $D=\sum_{t=1}^{T}d_{t}$ and $T$ are unknown. These bounds use a novel doubling trick that, under mild assumptions, provably retains the regret bound for when $D$ and $T$ are known. Using these bounds, we show that FKM and EXP3 have no weighted-regret even for $d_{t}=O\left(t\log t\right)$. Therefore, algorithms with no weighted-regret can be used to approximate a CCE of a finite or convex unknown game that can only be simulated with bandit feedback, even if the simulation involves significant delays.

preprint2020arXiv

My Fair Bandit: Distributed Learning of Max-Min Fairness with Multi-player Bandits

Consider N cooperative but non-communicating players where each plays one out of M arms for T turns. Players have different utilities for each arm, representable as an NxM matrix. These utilities are unknown to the players. In each turn players select an arm and receive a noisy observation of their utility for it. However, if any other players selected the same arm that turn, all colliding players will all receive zero utility due to the conflict. No other communication or coordination between the players is possible. Our goal is to design a distributed algorithm that learns the matching between players and arms that achieves max-min fairness while minimizing the regret. We present an algorithm and prove that it is regret optimal up to a $\log\log T$ factor. This is the first max-min fairness multi-player bandit algorithm with (near) order optimal regret.

preprint2011arXiv

Cone Schedules for Processing Systems in Fluctuating Environments

We consider a generalized processing system having several queues, where the available service rate combinations are fluctuating over time due to reliability and availability variations. The objective is to allocate the available resources, and corresponding service rates, in response to both workload and service capacity considerations, in order to maintain the long term stability of the system. The service configurations are completely arbitrary, including negative service rates which represent forwarding and service-induced cross traffic. We employ a trace-based trajectory asymptotic technique, which requires minimal assumptions about the arrival dynamics of the system. We prove that cone schedules, which leverage the geometry of the queueing dynamics, maximize the system throughput for a broad class of processing systems, even under adversarial arrival processes. We study the impact of fluctuating service availability, where resources are available only some of the time, and the schedule must dynamically respond to the changing available service rates, establishing both the capacity of such systems and the class of schedules which will stabilize the system at full capacity. The rich geometry of the system dynamics leads to important insights for stability, performance and scalability, and substantially generalizes previous findings. The processing system studied here models a broad variety of computer, communication and service networks, including varying channel conditions and cross-traffic in wireless networking, and call centers with fluctuating capacity. The findings have implications for bandwidth and processor allocation in communication networks and workforce scheduling in congested call centers.

preprint2011arXiv

Fairness in overloaded parallel queues

Maximizing throughput for heterogeneous parallel server queues has received quite a bit of attention from the research community and the stability region for such systems is well understood. However, many real-world systems have periods where they are temporarily overloaded. Under such scenarios, the unstable queues often starve limited resources. This work examines what happens during periods of temporary overload. Specifically, we look at how to fairly distribute stress. We explore the dynamics of the queue workloads under the MaxWeight scheduling policy during long periods of stress and discuss how to tune this policy in order to achieve a target fairness ratio across these workloads.

preprint2010arXiv

Decentralized Admission Control for Power-Controlled Wireless Links

This paper deals with the problem of admission control/channel access in power-controlled decentralized wireless networks, in which the quality-of-service (QoS) is expressed in terms of the signal-to-interference ratio (SIR). We analyze a previously proposed admission control algorithm, which was designed to maintain the SIR of operational (active) links above some given threshold at all times (protection of active links). This protection property ensures that as new users attempt to join the network, the already established links sustain their quality. The considered scheme may be thus applicable in some cognitive radio networks, where the fundamental premise is that secondary users may be granted channel access only if it does not cause disturbance to primary users. The admission control algorithm was previously analyzed under the assumption of affine interference functions. This paper extends all the previous results to arbitrary standard interference functions, which capture many important receiver designs, including optimal linear reception in the sense of maximizing the SIR and the worst-case receiver design. Furthermore, we provide novel conditions for protection of active users under the considered control scheme when individual power constraints are imposed on each link. Finally, we consider the possibility of a joint optimization of transmitters and receivers in networks with linear transceivers, which includes linear beamforming in multiple antenna systems. Transmitter optimization is performed alternately with receiver optimization to generate non-decreasing sequences of SIRs. Numerical evaluations show that additional transmitter side optimization has potential for significant performance gains.

preprint2010arXiv

Packet Scheduling in Switches with Target Outflow Profiles

The problem of packet scheduling for traffic streams with target outflow profiles traversing input queued switches is formulated in this paper. Target outflow profiles specify the desirable inter-departure times of packets leaving the switch from each traffic stream. The goal of the switch scheduler is to dynamically select service configurations of the switch, so that actual outflow streams ("pulled" through the switch) adhere to their desired target profiles as accurately as possible. Dynamic service controls (schedules) are developed to minimize deviation of actual outflow streams from their targets and suppress stream "distortion". Using appropriately selected subsets of service configurations of the switch, efficient schedules are designed, which deliver high performance at relatively low complexity. Some of these schedules are provably shown to achieve 100% pull-throughput. Moreover, simulations demonstrate that for even substantial contention of streams through the switch, due to stringent/intense target outflow profiles, the proposed schedules achieve closely their target profiles and suppress stream distortion. The switch model investigated here deviates from the classical switching paradigm. In the latter, the goal of packet scheduling is primarily to "push" as much traffic load through the switch as possible, while controlling delay to traverse the switch and keeping congestion/backlogs from exploding. In the model presented here, however, the goal of packet scheduling is to "pull" traffic streams through the switch, maintaining desirable (target) outflow profiles.