Source author record

Bruno Gaujal

Bruno Gaujal appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.OC Computer Science and Game Theory Artificial Intelligence Machine Learning math.PR Performance Discrete Mathematics Information Theory math.IT Multiagent Systems Networking and Internet Architecture Systems and Control

Catalog footprint

What is connected

6works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Decentralized model-free reinforcement learning in stochastic games with average-reward objective

We propose the first model-free algorithm that achieves low regret performance for decentralized learning in two-player zero-sum tabular stochastic games with infinite-horizon average-reward objective. In decentralized learning, the learning agent controls only one player and tries to achieve low regret performances against an arbitrary opponent. This contrasts with centralized learning where the agent tries to approximate the Nash equilibrium by controlling both players. In our infinite-horizon undiscounted setting, additional structure assumptions is needed to provide good behaviors of learning processes : here we assume for every strategy of the opponent, the agent has a way to go from any state to any other. This assumption is the analogous to the "communicating" assumption in the MDP setting. We show that our Decentralized Optimistic Nash Q-Learning (DONQ-learning) algorithm achieves both sublinear high probability regret of order $T^{3/4}$ and sublinear expected regret of order $T^{2/3}$. Moreover, our algorithm enjoys a low computational complexity and low memory space requirement compared to the previous works of (Wei et al. 2017) and (Jafarnia-Jahromi et al. 2021) in the same setting.

preprint2015arXiv

A stochastic approximation algorithm for stochastic semidefinite programming

Motivated by applications to multi-antenna wireless networks, we propose a distributed and asynchronous algorithm for stochastic semidefinite programming. This algorithm is a stochastic approximation of a continous- time matrix exponential scheme regularized by the addition of an entropy-like term to the problem's objective function. We show that the resulting algorithm converges almost surely to an $\varepsilon$-approximation of the optimal solution requiring only an unbiased estimate of the gradient of the problem's stochastic objective. When applied to throughput maximization in wireless multiple-input and multiple-output (MIMO) systems, the proposed algorithm retains its convergence properties under a wide array of mobility impediments such as user update asynchronicities, random delays and/or ergodically changing channels. Our theoretical analysis is complemented by extensive numerical simulations which illustrate the robustness and scalability of the proposed method in realistic network conditions.

preprint2014arXiv

Penalty-regulated dynamics and robust learning procedures in games

Starting from a heuristic learning scheme for N-person games, we derive a new class of continuous-time learning dynamics consisting of a replicator-like drift adjusted by a penalty term that renders the boundary of the game's strategy space repelling. These penalty-regulated dynamics are equivalent to players keeping an exponentially discounted aggregate of their on-going payoffs and then using a smooth best response to pick an action based on these performance scores. Owing to this inherent duality, the proposed dynamics satisfy a variant of the folk theorem of evolutionary game theory and they converge to (arbitrarily precise) approximations of Nash equilibria in potential games. Motivated by applications to traffic engineering, we exploit this duality further to design a discrete-time, payoff-based learning algorithm which retains these convergence properties and only requires players to observe their in-game payoffs: moreover, the algorithm remains robust in the presence of stochastic perturbations and observation errors, and it does not require any synchronization between players.

preprint2011arXiv

Mean field for Markov Decision Processes: from Discrete to Continuous Optimization

We study the convergence of Markov Decision Processes made of a large number of objects to optimization problems on ordinary differential equations (ODE). We show that the optimal reward of such a Markov Decision Process, satisfying a Bellman equation, converges to the solution of a continuous Hamilton-Jacobi-Bellman (HJB) equation based on the mean field approximation of the Markov Decision Process. We give bounds on the difference of the rewards, and a constructive algorithm for deriving an approximating solution to the Markov Decision Process from a solution of the HJB equations. We illustrate the method on three examples pertaining respectively to investment strategies, population dynamics control and scheduling in queues are developed. They are used to illustrate and justify the construction of the controlled ODE and to show the gain obtained by solving a continuous HJB equation rather than a large discrete Bellman equation.

preprint2011arXiv

Perfect Sampling of Markov Chains with Piecewise Homogeneous Events

Perfect sampling is a technique that uses coupling arguments to provide a sample from the stationary distribution of a Markov chain in a finite time without ever computing the distribution. This technique is very efficient if all the events in the system have monotonicity property. However, in the general (non-monotone) case, this technique needs to consider the whole state space, which limits its application only to chains with a state space of small cardinality. We propose here a new approach for the general case that only needs to consider two trajectories. Instead of the original chain, we use two bounding processes (envelopes) and we show that, whenever they couple, one obtains a sample under the stationary distribution of the original chain. We show that this new approach is particularly effective when the state space can be partitioned into pieces where envelopes can be easily computed. We further show that most Markovian queueing networks have this property and we propose efficient algorithms for some of them.

preprint2010arXiv

Performance bounds in wormhole routing, a network calculus approach

We present a model of performance bound calculus on feedforward networks where data packets are routed under wormhole routing discipline. We are interested in determining maximum end-to-end delays and backlogs of messages or packets going from a source node to a destination node, through a given virtual path in the network. Our objective here is to give a network calculus approach for calculating the performance bounds. First we propose a new concept of curves that we call packet curves. The curves permit to model constraints on packet lengths of a given data flow, when the lengths are allowed to be different. Second, we use this new concept to propose an approach for calculating residual services for data flows served under non preemptive service disciplines. Third, we model a binary switch (with two input ports and two output ports), where data is served under wormhole discipline. We present our approach for computing the residual services and deduce the worst case bounds for flows passing through a wormhole binary switch. Finally, we illustrate this approach in numerical examples, and show how to extend it to feedforward networks.