Researcher profile

Haifeng Zhang

Haifeng Zhang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2022arXiv

A Game-Theoretic Approach for Improving Generalization Ability of TSP Solvers

In this paper, we introduce a two-player zero-sum framework between a trainable \emph{Solver} and a \emph{Data Generator} to improve the generalization ability of deep learning-based solvers for Traveling Salesman Problem (TSP). Grounded in \textsl{Policy Space Response Oracle} (PSRO) methods, our two-player framework outputs a population of best-responding Solvers, over which we can mix and output a combined model that achieves the least exploitability against the Generator, and thereby the most generalizable performance on different TSP tasks. We conduct experiments on a variety of TSP instances with different types and sizes. Results suggest that our Solvers achieve the state-of-the-art performance even on tasks the Solver never meets, whilst the performance of other deep learning-based Solvers drops sharply due to over-fitting. To demonstrate the principle of our framework, we study the learning outcome of the proposed two-player game and demonstrate that the exploitability of the Solver population decreases during training, and it eventually approximates the Nash equilibrium along with the Generator.

preprint2022arXiv

GCS: Graph-based Coordination Strategy for Multi-Agent Reinforcement Learning

Many real-world scenarios involve a team of agents that have to coordinate their policies to achieve a shared goal. Previous studies mainly focus on decentralized control to maximize a common reward and barely consider the coordination among control policies, which is critical in dynamic and complicated environments. In this work, we propose factorizing the joint team policy into a graph generator and graph-based coordinated policy to enable coordinated behaviours among agents. The graph generator adopts an encoder-decoder framework that outputs directed acyclic graphs (DAGs) to capture the underlying dynamic decision structure. We also apply the DAGness-constrained and DAG depth-constrained optimization in the graph generator to balance efficiency and performance. The graph-based coordinated policy exploits the generated decision structure. The graph generator and coordinated policy are trained simultaneously to maximize the discounted return. Empirical evaluations on Collaborative Gaussian Squeeze, Cooperative Navigation, and Google Research Football demonstrate the superiority of the proposed method.

preprint2022arXiv

Joint Caching and Transmission in the Mobile Edge Network: A Multi-Agent Learning Approach

Joint caching and transmission optimization problem is challenging due to the deep coupling between decisions. This paper proposes an iterative distributed multi-agent learning approach to jointly optimize caching and transmission. The goal of this approach is to minimize the total transmission delay of all users. In this iterative approach, each iteration includes caching optimization and transmission optimization. A multi-agent reinforcement learning (MARL)-based caching network is developed to cache popular tasks, such as answering which files to evict from the cache and which files to storage. Based on the cached files of the caching network, the transmission network transmits cached files for users by single transmission (ST) or joint transmission (JT) with multi-agent Bayesian learning automaton (MABLA) method. And then users access the edge servers with the minimum transmission delay. The experimental results demonstrate the performance of the proposed multi-agent learning approach.

preprint2022arXiv

Learning to Identify Top Elo Ratings: A Dueling Bandits Approach

The Elo rating system is widely adopted to evaluate the skills of (chess) game and sports players. Recently it has been also integrated into machine learning algorithms in evaluating the performance of computerised AI agents. However, an accurate estimation of the Elo rating (for the top players) often requires many rounds of competitions, which can be expensive to carry out. In this paper, to improve the sample efficiency of the Elo evaluation (for top players), we propose an efficient online match scheduling algorithm. Specifically, we identify and match the top players through a dueling bandits framework and tailor the bandit algorithm to the gradient-based update of Elo. We show that it reduces the per-step memory and time complexity to constant, compared to the traditional likelihood maximization approaches requiring $O(t)$ time. Our algorithm has a regret guarantee of $\tilde{O}(\sqrt{T})$, sublinear in the number of competition rounds and has been extended to the multidimensional Elo ratings for handling intransitive games. We empirically demonstrate that our method achieves superior convergence speed and time efficiency on a variety of gaming tasks.

preprint2022arXiv

Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks

Offline reinforcement learning leverages previously-collected offline datasets to learn optimal policies with no necessity to access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the increased interactions among agents and with the enviroment. Yet, in MARL, the paradigm of offline pre-training with online fine-tuning has not been studied, nor datasets or benchmarks for offline MARL research are available. In this paper, we facilitate the research by providing large-scale datasets, and use them to examine the usage of the Decision Transformer in the context of MARL. We investigate the generalisation of MARL offline pre-training in the following three aspects: 1) between single agents and multiple agents, 2) from offline pretraining to the online fine-tuning, and 3) to that of multiple downstream tasks with few-shot and zero-shot capabilities. We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment, and then propose the novel architecture of multi-agent decision transformer (MADT) for effective offline learning. MADT leverages transformer's modelling ability of sequence modelling and integrates it seamlessly with both offline and online MARL tasks. A crucial benefit of MADT is that it learns generalisable policies that can transfer between different types of agents under different task scenarios. On StarCraft II offline dataset, MADT outperforms the state-of-the-art offline RL baselines. When applied to online tasks, the pre-trained MADT significantly improves sample efficiency, and enjoys strong performance both few-short and zero-shot cases. To our best knowledge, this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalisability enhancements in MARL.

preprint2022arXiv

Settling the Variance of Multi-Agent Policy Gradients

Policy gradient (PG) methods are popular reinforcement learning (RL) methods where a baseline is often applied to reduce the variance of gradient estimates. In multi-agent RL (MARL), although the PG theorem can be naturally extended, the effectiveness of multi-agent PG (MAPG) methods degrades as the variance of gradient estimates increases rapidly with the number of agents. In this paper, we offer a rigorous analysis of MAPG methods by, firstly, quantifying the contributions of the number of agents and agents' explorations to the variance of MAPG estimators. Based on this analysis, we derive the optimal baseline (OB) that achieves the minimal variance. In comparison to the OB, we measure the excess variance of existing MARL algorithms such as vanilla MAPG and COMA. Considering using deep neural networks, we also propose a surrogate version of OB, which can be seamlessly plugged into any existing PG methods in MARL. On benchmarks of Multi-Agent MuJoCo and StarCraft challenges, our OB technique effectively stabilises training and improves the performance of multi-agent PPO and COMA algorithms by a significant margin.

preprint2020arXiv

Bi-level Actor-Critic for Multi-agent Coordination

Coordination is one of the essential problems in multi-agent systems. Typically multi-agent reinforcement learning (MARL) methods treat agents equally and the goal is to solve the Markov game to an arbitrary Nash equilibrium (NE) when multiple equilibra exist, thus lacking a solution for NE selection. In this paper, we treat agents \emph{unequally} and consider Stackelberg equilibrium as a potentially better convergence point than Nash equilibrium in terms of Pareto superiority, especially in cooperative environments. Under Markov games, we formally define the bi-level reinforcement learning problem in finding Stackelberg equilibrium. We propose a novel bi-level actor-critic learning method that allows agents to have different knowledge base (thus intelligent), while their actions still can be executed simultaneously and distributedly. The convergence proof is given, while the resulting learning algorithm is tested against the state of the arts. We found that the proposed bi-level actor-critic algorithm successfully converged to the Stackelberg equilibria in matrix games and find an asymmetric solution in a highway merge environment.

preprint2020arXiv

Non-Markovian Majority-Vote model

Non-Markovian dynamics pervades human activity and social networks and it induces memory effects and burstiness in a wide range of processes including inter-event time distributions, duration of interactions in temporal networks and human mobility. Here we propose a non-Markovian Majority-Vote model (NMMV) that introduces non-Markovian effects in the standard (Markovian) Majority-Vote model (SMV). The SMV model is one of the simplest two-state stochastic models for studying opinion dynamics, and displays a continuous order-disorder phase transition at a critical noise. In the NMMV model we assume that the probability that an agent changes state is not only dependent on the majority state of his neighbors but it also depends on his {\em age}, i.e. how long the agent has been in his current state. The NMMV model has two regimes: the aging regime implies that the probability that an agent changes state is decreasing with his age, while in the anti-aging regime the probability that an agent changes state is increasing with his age. Interestingly, we find that the critical noise at which we observe the order-disorder phase transition is a non-monotonic function of the rate $β$ of the aging (anti-aging) process. In particular the critical noise in the aging regime displays a maximum as a function of $β$ while in the anti-aging regime displays a minimum. This implies that the aging/anti-aging dynamics can retard/anticipate the transition and that there is an optimal rate $β$ for maximally perturbing the value of the critical noise. The analytical results obtained in the framework of the heterogeneous mean-field approach are validated by extensive numerical simulations on a large variety of network topologies.

preprint2019arXiv

Measuring outcome correlation for spin-s Bell cat-state and geometric phase induced spin parity effect

In terms of quantum probability statistics the Bell inequality (BI) and its violation are extended to spin-$s$ entangled Schrödinger cat-state (called the Bell cat-state) with both parallel and antiparallel spin-polarizations. The BI is never ever violated for the measuring outcome probabilities evaluated over entire two-spin Hilbert space except the spin-$1/2$ entangled states. A universal Bell-type inequality (UBI) denoted by $p_{s}^{lc}\leq0$ is formulated with the local realistic model under the condition that the measuring outcomes are restricted in the subspace of spin coherent states. A spin parity effect is observed that the UBI can be violated only by the Bell cat-states of half-integer but not the integer spins. The violation of UBI is seen to be a direct result of non-trivial Berry phase between the spin coherent states of south- and north-pole gauges for half-integer spin, while the geometric phase is trivial for the integer spins. A maximum violation bound of UBI is found as $p_{s}^{\max}$=1, which is valid for arbitrary half-integer spin-$s$ states.