Researcher profile

Baoxiang Wang

Baoxiang Wang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
12works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

12 published item(s)

preprint2026arXiv

The Reciprocity Gradient

Communication is fundamental to sustaining reciprocity and cooperation in strategic interactions. We identify and formulate the influence attribution problem as the central optimization difficulty inherent in such dynamics for a learning agent: any action or signal the agent emits reshapes the reputations of many third parties along combinatorially branching paths before feeding back into its own future rewards, forcing the agent to account for all of these indirect channels at once when choosing every action. To address this, we introduce the reciprocity gradient, which explicitly backpropagates reward gradients through private estimators of opponents' policies trained from public observations. The gradient flows through the reputation chain itself analytically, rather than being estimated from sampled returns. It jointly optimizes actions and evaluative signals without intrinsic rewards or reward shaping. Empirically, the method recovers near-optimal context-sensitive policies, while sample-based baselines collapse into constant-output policies.

preprint2022arXiv

CGIBNet: Bandwidth-constrained Communication with Graph Information Bottleneck in Multi-Agent Reinforcement Learning

Communication is one of the core components for cooperative multi-agent reinforcement learning (MARL). The communication bandwidth, in many real applications, is always subject to certain constraints. To improve communication efficiency, in this article, we propose to simultaneously optimize whom to communicate with and what to communicate for each agent in MARL. By initiating the communication between agents with a directed complete graph, we propose a novel communication model, named Communicative Graph Information Bottleneck Network (CGIBNet), to simultaneously compress the graph structure and the node information with the graph information bottleneck principle. The graph structure compression is designed to cut the redundant edges for determining whom to communicate with. The node information compression aims to address the problem of what to communicate via learning compact node representations. Moreover, CGIBNet is the first universal module for bandwidth-constrained communication, which can be applied to various training frameworks (i.e., policy-based and value-based MARL frameworks) and communication modes (i.e., single-round and multi-round communication). Extensive experiments are conducted in Traffic Control and StarCraft II environments. The results indicate that our method can achieve better performance in bandwidth-constrained settings compared with state-of-the-art algorithms.

preprint2022arXiv

Complex valued semi-linear heat equations in super-critical spaces $E^s_σ$

We consider the Cauchy problem for the complex valued semi-linear heat equation $$ \partial_t u - Δu - u^m =0, \ \ u (0,x) = u_0(x), $$ where $m\geq 2$ is an integer and the initial data belong to super-critical spaces $E^s_σ$ for which the norms are defined by $$ \|f\|_{E^s_σ} = \|\langle ξ\rangle^σ2^{s|ξ|}\hat{f}(ξ)\|_{L^2}, \ \ σ\in \mathbb{R}, \ s<0. $$ If $s<0$, then any Sobolev space $H^{r}$ is a subspace of $E^s_σ$, i.e., $\cup_{r \in \mathbb{R}} H^r \subset E^s_σ$. We obtain the global existence and uniqueness of the solutions if the initial data belong to $E^s_σ$ ($s<0, \ σ\geq d/2-2/(m-1)$) and their Fourier transforms are supported in the first octant, the smallness conditions on the initial data in $E^s_σ$ are not required for the global solutions. Moreover, we show that the error between the solution $u$ and the iteration solution $u^{(j)}$ is $C^j/(j\,!)^2$. Similar results also hold if the nonlinearity $u^m$ is replaced by an exponential function $e^u-1$.

preprint2022arXiv

Differentially Private Temporal Difference Learning with Stochastic Nonconvex-Strongly-Concave Optimization

Temporal difference (TD) learning is a widely used method to evaluate policies in reinforcement learning. While many TD learning methods have been developed in recent years, little attention has been paid to preserving privacy and most of the existing approaches might face the concerns of data privacy from users. To enable complex representative abilities of policies, in this paper, we consider preserving privacy in TD learning with nonlinear value function approximation. This is challenging because such a nonlinear problem is usually studied in the formulation of stochastic nonconvex-strongly-concave optimization to gain finite-sample analysis, which would require simultaneously preserving the privacy on primal and dual sides. To this end, we employ a momentum-based stochastic gradient descent ascent to achieve a single-timescale algorithm, and achieve a good trade-off between meaningful privacy and utility guarantees of both the primal and dual sides by perturbing the gradients on both sides using well-calibrated Gaussian noises. As a result, our DPTD algorithm could provide $(ε,δ)$-differential privacy (DP) guarantee for the sensitive information encoded in transitions and retain the original power of TD learning, with the utility upper bounded by $\widetilde{\mathcal{O}}(\frac{(d\log(1/δ))^{1/8}}{(nε)^{1/4}})$ (The tilde in this paper hides the log factor.), where $n$ is the trajectory length and $d$ is the dimension. Extensive experiments conducted in OpenAI Gym show the advantages of our proposed algorithm.

preprint2022arXiv

Edge Rewiring Goes Neural: Boosting Network Resilience without Rich Features

Improving the resilience of a network is a fundamental problem in network science, which protects the underlying system from natural disasters and malicious attacks. This is traditionally achieved via successive degree-preserving edge rewiring operations, with the major limitation of being transductive. Inductively solving graph-related tasks with sequential actions is accomplished by adopting graph neural networks (GNNs) coupled with reinforcement learning under the scenario with rich graph features. However, such frameworks cannot be directly applied to resilience tasks where only pure topological structure is available. In this case, GNNs can barely learn useful information, resulting in prohibitive difficulty in making actions for successively rewiring edges under a reinforcement learning context. In this paper, we study in depth the reasons why typical GNNs cause such failure. Based on this investigation, we propose ResiNet, the first end-to-end trainable inductive framework to discover resilient network topologies while balancing network utility. To this end, we reformulate resilience optimization as an MDP equipped with edge rewiring action space, and propose a pure topology-oriented variant of GNN called filtration enhanced graph neural network (FireGNN), which can learn from graphs without rich features. Extensive experiments demonstrate that ResiNet achieves a near-optimal resilience gain on various graphs while balancing the utility, and outperforms existing approaches by a large margin.

preprint2022arXiv

Global Cauchy problems for the nonlocal (derivative) NLS in $E^s_σ$

We consider the Cauchy problem for the (derivative) nonlocal NLS in super-critical function spaces $E^s_σ$ for which the norms are defined by $$ \|f\|_{E^s_σ} = \|\langleξ\rangle^σ2^{s|ξ|}\hat{f}(ξ)\|_{L^2}, \ s<0, \ σ\in \mathbb{R}. $$ Any Sobolev space $H^{r}$ is a subspace of $E^s_σ$, i.e., $H^r \subset E^s_σ$ for any $ r,σ\in \mathbb{R}$ and $s<0$. Let $s<0$ and $σ>-1/2$ ($σ>0$) for the nonlocal NLS (for the nonlocal derivative NLS). We show the global existence and uniqueness of the solutions if the initial data belong to $E^s_σ$ and their Fourier transforms are supported in $(0, \infty)$, the smallness conditions on the initial data in $E^s_σ$ are not required for the global solutions.

preprint2022arXiv

Provably Efficient Convergence of Primal-Dual Actor-Critic with Nonlinear Function Approximation

We study the convergence of the actor-critic algorithm with nonlinear function approximation under a nonconvex-nonconcave primal-dual formulation. Stochastic gradient descent ascent is applied with an adaptive proximal term for robust learning rates. We show the first efficient convergence result with primal-dual actor-critic with a convergence rate of $\mathcal{O}\left(\sqrt{\frac{\ln \left(N d G^2 \right)}{N}}\right)$ under Markovian sampling, where $G$ is the element-wise maximum of the gradient, $N$ is the number of iterations, and $d$ is the dimension of the gradient. Our result is presented with only the Polyak-Łojasiewicz condition for the dual variables, which is easy to verify and applicable to a wide range of reinforcement learning (RL) scenarios. The algorithm and analysis are general enough to be applied to other RL settings, like multi-agent RL. Empirical results on OpenAI Gym continuous control tasks corroborate our theoretical findings.

preprint2022arXiv

Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning

Centralized Training with Decentralized Execution (CTDE) has been a popular paradigm in cooperative Multi-Agent Reinforcement Learning (MARL) settings and is widely used in many real applications. One of the major challenges in the training process is credit assignment, which aims to deduce the contributions of each agent according to the global rewards. Existing credit assignment methods focus on either decomposing the joint value function into individual value functions or measuring the impact of local observations and actions on the global value function. These approaches lack a thorough consideration of the complicated interactions among multiple agents, leading to an unsuitable assignment of credit and subsequently mediocre results on MARL. We propose Shapley Counterfactual Credit Assignment, a novel method for explicit credit assignment which accounts for the coalition of agents. Specifically, Shapley Value and its desired properties are leveraged in deep MARL to credit any combinations of agents, which grants us the capability to estimate the individual credit for each agent. Despite this capability, the main technical difficulty lies in the computational complexity of Shapley Value who grows factorially as the number of agents. We instead utilize an approximation method via Monte Carlo sampling, which reduces the sample complexity while maintaining its effectiveness. We evaluate our method on StarCraft II benchmarks across different scenarios. Our method outperforms existing cooperative MARL algorithms significantly and achieves the state-of-the-art, with especially large margins on tasks with more severe difficulties.

preprint2020arXiv

Learning and Testing Variable Partitions

$ $Let $F$ be a multivariate function from a product set $Σ^n$ to an Abelian group $G$. A $k$-partition of $F$ with cost $δ$ is a partition of the set of variables $\mathbf{V}$ into $k$ non-empty subsets $(\mathbf{X}_1, \dots, \mathbf{X}_k)$ such that $F(\mathbf{V})$ is $δ$-close to $F_1(\mathbf{X}_1)+\dots+F_k(\mathbf{X}_k)$ for some $F_1, \dots, F_k$ with respect to a given error metric. We study algorithms for agnostically learning $k$ partitions and testing $k$-partitionability over various groups and error metrics given query access to $F$. In particular we show that $1.$ Given a function that has a $k$-partition of cost $δ$, a partition of cost $\mathcal{O}(k n^2)(δ+ ε)$ can be learned in time $\tilde{\mathcal{O}}(n^2 \mathrm{poly} (1/ε))$ for any $ε> 0$. In contrast, for $k = 2$ and $n = 3$ learning a partition of cost $δ+ ε$ is NP-hard. $2.$ When $F$ is real-valued and the error metric is the 2-norm, a 2-partition of cost $\sqrt{δ^2 + ε}$ can be learned in time $\tilde{\mathcal{O}}(n^5/ε^2)$. $3.$ When $F$ is $\mathbb{Z}_q$-valued and the error metric is Hamming weight, $k$-partitionability is testable with one-sided error and $\mathcal{O}(kn^3/ε)$ non-adaptive queries. We also show that even two-sided testers require $Ω(n)$ queries when $k = 2$. This work was motivated by reinforcement learning control tasks in which the set of control variables can be partitioned. The partitioning reduces the task into multiple lower-dimensional ones that are relatively easier to learn. Our second algorithm empirically increases the scores attained over previous heuristic partitioning methods applied in this context.

preprint2020arXiv

Resonant Decompositions and Global Well-posedness for 2D Zakharov-Kuznetsov Equation in Sobolev spaces of Negative Indices

The Cauchy problem for Zakharov-Kuznetsov equation on $\mathbb{R}^2$ is shown to be global well-posed for the initial date in $H^{s}$ provided $s>-\frac{1}{13}$. As conservation laws are invalid in Sobolev spaces below $L^2$, we construct an almost conserved quantity using multilinear correction term following the $I$-method introduced by Colliander, Keel, Staffilani, Takaoka and Tao. In contrast to KdV equation, the main difficulty is to handle the resonant interactions which are significant due to the multidimensional and multilinear setting of the problem. The proof relies upon the bilinear Strichartz estimate and the nonlinear Loomis-Whitney inequality.

preprint2020arXiv

Scaling limit of Modulation Spaces and Their Applications

Modulation spaces $M^s_{p,q}$ were introduced by Feichtinger \cite{Fei83} in 1983. By resorting to the wavelet basis, Bényi and Oh \cite{BeOh20} defined a modified version to Feichtinger&#39;s modulation spaces for which the symmetry scalings are emphasized for its possible applications in PDE. By carefully investigating the scaling properties of modulation spaces and their connections with the wavelet basis, we will introduce a class of generalized modulation spaces, which contain both Feichtinger&#39;s and Bényi and Oh&#39;s modulation spaces. As their applications, we will give a local well-posedness and a (small data) global well-posedness results for NLS in some rougher generalized modulation spaces, which generalize the well posedness results of \cite{BeOk09} and \cite{WaHud07}, and certain super-critical initial data in $H^s$ or in $L^p$ are involved in these spaces.

preprint2020arXiv

The Gambler&#39;s Problem and Beyond

We analyze the Gambler&#39;s problem, a simple reinforcement learning problem where the gambler has the chance to double or lose the bets until the target is reached. This is an early example introduced in the reinforcement learning textbook by Sutton and Barto (2018), where they mention an interesting pattern of the optimal value function with high-frequency components and repeating non-smooth points. It is however without further investigation. We provide the exact formula for the optimal value function for both the discrete and the continuous cases. Though simple as it might seem, the value function is pathological: fractal, self-similar, derivative taking either zero or infinity, and not written as elementary functions. It is in fact one of the generalized Cantor functions, where it holds a complexity that has been uncharted thus far. Our analyses could provide insights into improving value function approximation, gradient-based algorithms, and Q-learning, in real applications and implementations.