Source author record

Haifeng Zhang

Haifeng Zhang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Multiagent Systems Machine Learning Artificial Intelligence cond-mat.stat-mech physics.soc-ph astro-ph.IM Networking and Internet Architecture physics.space-ph quant-ph

Catalog footprint

What is connected

11works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Game-Theoretic Approach for Improving Generalization Ability of TSP Solvers

In this paper, we introduce a two-player zero-sum framework between a trainable \emph{Solver} and a \emph{Data Generator} to improve the generalization ability of deep learning-based solvers for Traveling Salesman Problem (TSP). Grounded in \textsl{Policy Space Response Oracle} (PSRO) methods, our two-player framework outputs a population of best-responding Solvers, over which we can mix and output a combined model that achieves the least exploitability against the Generator, and thereby the most generalizable performance on different TSP tasks. We conduct experiments on a variety of TSP instances with different types and sizes. Results suggest that our Solvers achieve the state-of-the-art performance even on tasks the Solver never meets, whilst the performance of other deep learning-based Solvers drops sharply due to over-fitting. To demonstrate the principle of our framework, we study the learning outcome of the proposed two-player game and demonstrate that the exploitability of the Solver population decreases during training, and it eventually approximates the Nash equilibrium along with the Generator.

preprint2022arXiv

GCS: Graph-based Coordination Strategy for Multi-Agent Reinforcement Learning

Many real-world scenarios involve a team of agents that have to coordinate their policies to achieve a shared goal. Previous studies mainly focus on decentralized control to maximize a common reward and barely consider the coordination among control policies, which is critical in dynamic and complicated environments. In this work, we propose factorizing the joint team policy into a graph generator and graph-based coordinated policy to enable coordinated behaviours among agents. The graph generator adopts an encoder-decoder framework that outputs directed acyclic graphs (DAGs) to capture the underlying dynamic decision structure. We also apply the DAGness-constrained and DAG depth-constrained optimization in the graph generator to balance efficiency and performance. The graph-based coordinated policy exploits the generated decision structure. The graph generator and coordinated policy are trained simultaneously to maximize the discounted return. Empirical evaluations on Collaborative Gaussian Squeeze, Cooperative Navigation, and Google Research Football demonstrate the superiority of the proposed method.

preprint2022arXiv

Joint Caching and Transmission in the Mobile Edge Network: A Multi-Agent Learning Approach

Joint caching and transmission optimization problem is challenging due to the deep coupling between decisions. This paper proposes an iterative distributed multi-agent learning approach to jointly optimize caching and transmission. The goal of this approach is to minimize the total transmission delay of all users. In this iterative approach, each iteration includes caching optimization and transmission optimization. A multi-agent reinforcement learning (MARL)-based caching network is developed to cache popular tasks, such as answering which files to evict from the cache and which files to storage. Based on the cached files of the caching network, the transmission network transmits cached files for users by single transmission (ST) or joint transmission (JT) with multi-agent Bayesian learning automaton (MABLA) method. And then users access the edge servers with the minimum transmission delay. The experimental results demonstrate the performance of the proposed multi-agent learning approach.

preprint2022arXiv

Learning to Identify Top Elo Ratings: A Dueling Bandits Approach

The Elo rating system is widely adopted to evaluate the skills of (chess) game and sports players. Recently it has been also integrated into machine learning algorithms in evaluating the performance of computerised AI agents. However, an accurate estimation of the Elo rating (for the top players) often requires many rounds of competitions, which can be expensive to carry out. In this paper, to improve the sample efficiency of the Elo evaluation (for top players), we propose an efficient online match scheduling algorithm. Specifically, we identify and match the top players through a dueling bandits framework and tailor the bandit algorithm to the gradient-based update of Elo. We show that it reduces the per-step memory and time complexity to constant, compared to the traditional likelihood maximization approaches requiring $O(t)$ time. Our algorithm has a regret guarantee of $\tilde{O}(\sqrt{T})$, sublinear in the number of competition rounds and has been extended to the multidimensional Elo ratings for handling intransitive games. We empirically demonstrate that our method achieves superior convergence speed and time efficiency on a variety of gaming tasks.

preprint2022arXiv

Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks

Offline reinforcement learning leverages previously-collected offline datasets to learn optimal policies with no necessity to access the real environment. Such a paradigm is also desirable for multi-agent reinforcement learning (MARL) tasks, given the increased interactions among agents and with the enviroment. Yet, in MARL, the paradigm of offline pre-training with online fine-tuning has not been studied, nor datasets or benchmarks for offline MARL research are available. In this paper, we facilitate the research by providing large-scale datasets, and use them to examine the usage of the Decision Transformer in the context of MARL. We investigate the generalisation of MARL offline pre-training in the following three aspects: 1) between single agents and multiple agents, 2) from offline pretraining to the online fine-tuning, and 3) to that of multiple downstream tasks with few-shot and zero-shot capabilities. We start by introducing the first offline MARL dataset with diverse quality levels based on the StarCraftII environment, and then propose the novel architecture of multi-agent decision transformer (MADT) for effective offline learning. MADT leverages transformer's modelling ability of sequence modelling and integrates it seamlessly with both offline and online MARL tasks. A crucial benefit of MADT is that it learns generalisable policies that can transfer between different types of agents under different task scenarios. On StarCraft II offline dataset, MADT outperforms the state-of-the-art offline RL baselines. When applied to online tasks, the pre-trained MADT significantly improves sample efficiency, and enjoys strong performance both few-short and zero-shot cases. To our best knowledge, this is the first work that studies and demonstrates the effectiveness of offline pre-trained models in terms of sample efficiency and generalisability enhancements in MARL.

preprint2022arXiv

Settling the Variance of Multi-Agent Policy Gradients

Policy gradient (PG) methods are popular reinforcement learning (RL) methods where a baseline is often applied to reduce the variance of gradient estimates. In multi-agent RL (MARL), although the PG theorem can be naturally extended, the effectiveness of multi-agent PG (MAPG) methods degrades as the variance of gradient estimates increases rapidly with the number of agents. In this paper, we offer a rigorous analysis of MAPG methods by, firstly, quantifying the contributions of the number of agents and agents' explorations to the variance of MAPG estimators. Based on this analysis, we derive the optimal baseline (OB) that achieves the minimal variance. In comparison to the OB, we measure the excess variance of existing MARL algorithms such as vanilla MAPG and COMA. Considering using deep neural networks, we also propose a surrogate version of OB, which can be seamlessly plugged into any existing PG methods in MARL. On benchmarks of Multi-Agent MuJoCo and StarCraft challenges, our OB technique effectively stabilises training and improves the performance of multi-agent PPO and COMA algorithms by a significant margin.

preprint2020arXiv

Bi-level Actor-Critic for Multi-agent Coordination

Coordination is one of the essential problems in multi-agent systems. Typically multi-agent reinforcement learning (MARL) methods treat agents equally and the goal is to solve the Markov game to an arbitrary Nash equilibrium (NE) when multiple equilibra exist, thus lacking a solution for NE selection. In this paper, we treat agents \emph{unequally} and consider Stackelberg equilibrium as a potentially better convergence point than Nash equilibrium in terms of Pareto superiority, especially in cooperative environments. Under Markov games, we formally define the bi-level reinforcement learning problem in finding Stackelberg equilibrium. We propose a novel bi-level actor-critic learning method that allows agents to have different knowledge base (thus intelligent), while their actions still can be executed simultaneously and distributedly. The convergence proof is given, while the resulting learning algorithm is tested against the state of the arts. We found that the proposed bi-level actor-critic algorithm successfully converged to the Stackelberg equilibria in matrix games and find an asymmetric solution in a highway merge environment.

preprint2020arXiv

Non-Markovian Majority-Vote model

Non-Markovian dynamics pervades human activity and social networks and it induces memory effects and burstiness in a wide range of processes including inter-event time distributions, duration of interactions in temporal networks and human mobility. Here we propose a non-Markovian Majority-Vote model (NMMV) that introduces non-Markovian effects in the standard (Markovian) Majority-Vote model (SMV). The SMV model is one of the simplest two-state stochastic models for studying opinion dynamics, and displays a continuous order-disorder phase transition at a critical noise. In the NMMV model we assume that the probability that an agent changes state is not only dependent on the majority state of his neighbors but it also depends on his {\em age}, i.e. how long the agent has been in his current state. The NMMV model has two regimes: the aging regime implies that the probability that an agent changes state is decreasing with his age, while in the anti-aging regime the probability that an agent changes state is increasing with his age. Interestingly, we find that the critical noise at which we observe the order-disorder phase transition is a non-monotonic function of the rate $β$ of the aging (anti-aging) process. In particular the critical noise in the aging regime displays a maximum as a function of $β$ while in the anti-aging regime displays a minimum. This implies that the aging/anti-aging dynamics can retard/anticipate the transition and that there is an optimal rate $β$ for maximally perturbing the value of the critical noise. The analytical results obtained in the framework of the heterogeneous mean-field approach are validated by extensive numerical simulations on a large variety of network topologies.

preprint2019arXiv

Measuring outcome correlation for spin-s Bell cat-state and geometric phase induced spin parity effect

In terms of quantum probability statistics the Bell inequality (BI) and its violation are extended to spin-$s$ entangled Schrödinger cat-state (called the Bell cat-state) with both parallel and antiparallel spin-polarizations. The BI is never ever violated for the measuring outcome probabilities evaluated over entire two-spin Hilbert space except the spin-$1/2$ entangled states. A universal Bell-type inequality (UBI) denoted by $p_{s}^{lc}\leq0$ is formulated with the local realistic model under the condition that the measuring outcomes are restricted in the subspace of spin coherent states. A spin parity effect is observed that the UBI can be violated only by the Bell cat-states of half-integer but not the integer spins. The violation of UBI is seen to be a direct result of non-trivial Berry phase between the spin coherent states of south- and north-pole gauges for half-integer spin, while the geometric phase is trivial for the integer spins. A maximum violation bound of UBI is found as $p_{s}^{\max}$=1, which is valid for arbitrary half-integer spin-$s$ states.

preprint2016arXiv

Critical noise of majority-vote model on complex networks

The majority-vote model with noise is one of the simplest nonequilibrium statistical model that has been extensively studied in the context of complex networks. However, the relationship between the critical noise where the order-disorder phase transition takes place and the topology of the underlying networks is still lacking. In the paper, we use the heterogeneous mean-field theory to derive the rate equation for governing the model's dynamics that can analytically determine the critical noise $f_c$ in the limit of infinite network size $N\rightarrow \infty$. The result shows that $f_c$ depends on the ratio of ${\left\langle k \right\rangle }$ to ${\left\langle k^{3/2} \right\rangle }$, where ${\left\langle k \right\rangle }$ and ${\left\langle k^{3/2} \right\rangle }$ are the average degree and the $3/2$ order moment of degree distribution, respectively. Furthermore, we consider the finite size effect where the stochastic fluctuation should be involved. To the end, we derive the Langevin equation and obtain the potential of the corresponding Fokker-Planck equation. This allows us to calculate the effective critical noise $f_c(N)$ at which the susceptibility is maximal in finite size networks. We find that the $f_c-f_c(N)$ decays with $N$ in a power-law way and vanishes for $N\rightarrow \infty$. All the theoretical results are confirmed by performing the extensive Monte Carlo simulations in random $k$-regular networks, Erdös-Rényi random networks and scale-free networks.

preprint2014arXiv

Experiment of Diffuse Reflection Laser Ranging to Space Debris and Data Analysis

Space debris has been posing a serious threat to human space activities and is needed to be measured and cataloged. As a new technology of space target surveillance, the measurement accuracy of DRLR (Diffuse Reflection Laser Ranging) is much higher than that of microwave radar and electro-optical measurement. Based on laser ranging data of space debris from DRLR system collected at SHAO (Shanghai Astronomical Observatory) in March-April 2013, the characteristics and precision of the laser ranging data are analyzed and its applications in OD (Orbit Determination) of space debris are discussed in this paper, which is implemented for the first time in China. The experiment indicates that the precision of laser ranging data can reach 39cm-228cm. When the data is sufficient enough (4 arcs of 3 days), the orbit accuracy of space debris can be up to 50m.

Haifeng Zhang

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

A Game-Theoretic Approach for Improving Generalization Ability of TSP Solvers

GCS: Graph-based Coordination Strategy for Multi-Agent Reinforcement Learning

Joint Caching and Transmission in the Mobile Edge Network: A Multi-Agent Learning Approach

Learning to Identify Top Elo Ratings: A Dueling Bandits Approach

Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks

Settling the Variance of Multi-Agent Policy Gradients

Bi-level Actor-Critic for Multi-agent Coordination

Non-Markovian Majority-Vote model

Measuring outcome correlation for spin-s Bell cat-state and geometric phase induced spin parity effect

Critical noise of majority-vote model on complex networks

Experiment of Diffuse Reflection Laser Ranging to Space Debris and Data Analysis