Researcher profile

Longbo Huang

Longbo Huang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2026arXiv

Real-Time Parallel Counterfactual Regret Minimization

Counterfactual Regret Minimization (CFR) is the dominant algorithmic family for solving large imperfect-information games, underpinning breakthroughs such as Libratus and Pluribus in No-Limit Texas Hold'em poker. In real-time game-playing systems, the solver must compute a near-equilibrium strategy within a strict time budget of only a few seconds per decision, and the number of CFR iterations completed in this window directly determines play strength. We present \textbf{Parallel CFR}, the first parallelization framework for real-time depth-limited CFR solving that seamlessly integrates pruning, abstraction, and advanced CFR variants. We decompose each CFR iteration into a pipeline of seven stages and identify two orthogonal dimensions of parallelism: \emph{by information set} and \emph{by tree node}. Leaf node evaluation is offloaded to GPUs via batched neural network inference, creating a heterogeneous CPU--GPU pipeline. Experiments on Heads-Up No-Limit Texas Hold'em demonstrate that Parallel CFR achieves $3.3$--$3.4\times$ speedup over the single-threaded baseline on postflop streets, with per-iteration time of ${\sim}47$--$54$~ms on a depth-limited game tree with over $1$ billion histories. All experiments run on a single desktop-class device (NVIDIA DGX Spark), enabling hundreds of CFR iterations within a typical real-time decision budget without requiring datacenter-scale infrastructure.

preprint2022arXiv

Adaptive Best-of-Both-Worlds Algorithm for Heavy-Tailed Multi-Armed Bandits

In this paper, we generalize the concept of heavy-tailed multi-armed bandits to adversarial environments, and develop robust best-of-both-worlds algorithms for heavy-tailed multi-armed bandits (MAB), where losses have $α$-th ($1<α\le 2$) moments bounded by $σ^α$, while the variances may not exist. Specifically, we design an algorithm \texttt{HTINF}, when the heavy-tail parameters $α$ and $σ$ are known to the agent, \texttt{HTINF} simultaneously achieves the optimal regret for both stochastic and adversarial environments, without knowing the actual environment type a-priori. When $α,σ$ are unknown, \texttt{HTINF} achieves a $\log T$-style instance-dependent regret in stochastic cases and $o(T)$ no-regret guarantee in adversarial cases. We further develop an algorithm \texttt{AdaTINF}, achieving $\mathcal O(σK^{1-\nicefrac 1α}T^{\nicefrac{1}α})$ minimax optimal regret even in adversarial settings, without prior knowledge on $α$ and $σ$. This result matches the known regret lower-bound (Bubeck et al., 2013), which assumed a stochastic environment and $α$ and $σ$ are both known. To our knowledge, the proposed \texttt{HTINF} algorithm is the first to enjoy a best-of-both-worlds regret guarantee, and \texttt{AdaTINF} is the first algorithm that can adapt to both $α$ and $σ$ to achieve optimal gap-indepedent regret bound in classical heavy-tailed stochastic MAB setting and our novel adversarial formulation.

preprint2022arXiv

Effective Multi-User Delay-Constrained Scheduling with Deep Recurrent Reinforcement Learning

Multi-user delay constrained scheduling is important in many real-world applications including wireless communication, live streaming, and cloud computing. Yet, it poses a critical challenge since the scheduler needs to make real-time decisions to guarantee the delay and resource constraints simultaneously without prior information of system dynamics, which can be time-varying and hard to estimate. Moreover, many practical scenarios suffer from partial observability issues, e.g., due to sensing noise or hidden correlation. To tackle these challenges, we propose a deep reinforcement learning (DRL) algorithm, named Recurrent Softmax Delayed Deep Double Deterministic Policy Gradient ($\mathtt{RSD4}$), which is a data-driven method based on a Partially Observed Markov Decision Process (POMDP) formulation. $\mathtt{RSD4}$ guarantees resource and delay constraints by Lagrangian dual and delay-sensitive queues, respectively. It also efficiently tackles partial observability with a memory mechanism enabled by the recurrent neural network (RNN) and introduces user-level decomposition and node-level merging to ensure scalability. Extensive experiments on simulated/real-world datasets demonstrate that $\mathtt{RSD4}$ is robust to system dynamics and partially observable environments, and achieves superior performances over existing DRL and non-DRL-based methods.

preprint2022arXiv

Modality Competition: What Makes Joint Training of Multi-modal Network Fail in Deep Learning? (Provably)

Despite the remarkable success of deep multi-modal learning in practice, it has not been well-explained in theory. Recently, it has been observed that the best uni-modal network outperforms the jointly trained multi-modal network, which is counter-intuitive since multiple signals generally bring more information. This work provides a theoretical explanation for the emergence of such performance gap in neural networks for the prevalent joint training framework. Based on a simplified data distribution that captures the realistic property of multi-modal data, we prove that for the multi-modal late-fusion network with (smoothed) ReLU activation trained jointly by gradient descent, different modalities will compete with each other. The encoder networks will learn only a subset of modalities. We refer to this phenomenon as modality competition. The losing modalities, which fail to be discovered, are the origins where the sub-optimality of joint training comes from. Experimentally, we illustrate that modality competition matches the intrinsic behavior of late-fusion joint training.

preprint2022arXiv

Plan Better Amid Conservatism: Offline Multi-Agent Reinforcement Learning with Actor Rectification

Conservatism has led to significant progress in offline reinforcement learning (RL) where an agent learns from pre-collected datasets. However, as many real-world scenarios involve interaction among multiple agents, it is important to resolve offline RL in the multi-agent setting. Given the recent success of transferring online RL algorithms to the multi-agent setting, one may expect that offline RL algorithms will also transfer to multi-agent settings directly. Surprisingly, we empirically observe that conservative offline RL algorithms do not work well in the multi-agent setting -- the performance degrades significantly with an increasing number of agents. Towards mitigating the degradation, we identify a key issue that non-concavity of the value function makes the policy gradient improvements prone to local optima. Multiple agents exacerbate the problem severely, since the suboptimal policy by any agent can lead to uncoordinated global failure. Following this intuition, we propose a simple yet effective method, Offline Multi-Agent RL with Actor Rectification (OMAR), which combines the first-order policy gradients and zeroth-order optimization methods to better optimize the conservative value functions over the actor parameters. Despite the simplicity, OMAR achieves state-of-the-art results in a variety of multi-agent control tasks.

preprint2022arXiv

Provable Generalization of Overparameterized Meta-learning Trained with SGD

Despite the superior empirical success of deep meta-learning, theoretical understanding of overparameterized meta-learning is still limited. This paper studies the generalization of a widely used meta-learning approach, Model-Agnostic Meta-Learning (MAML), which aims to find a good initialization for fast adaptation to new tasks. Under a mixed linear regression model, we analyze the generalization properties of MAML trained with SGD in the overparameterized regime. We provide both upper and lower bounds for the excess risk of MAML, which captures how SGD dynamics affect these generalization bounds. With such sharp characterizations, we further explore how various learning parameters impact the generalization capability of overparameterized MAML, including explicitly identifying typical data and task distributions that can achieve diminishing generalization error with overparameterization, and characterizing the impact of adaptation learning rate on both excess risk and the early stopping time. Our theoretical findings are further validated by experiments.

preprint2020arXiv

Combinatorial Pure Exploration of Dueling Bandit

In this paper, we study combinatorial pure exploration for dueling bandits (CPE-DB): we have multiple candidates for multiple positions as modeled by a bipartite graph, and in each round we sample a duel of two candidates on one position and observe who wins in the duel, with the goal of finding the best candidate-position matching with high probability after multiple rounds of samples. CPE-DB is an adaptation of the original combinatorial pure exploration for multi-armed bandit (CPE-MAB) problem to the dueling bandit setting. We consider both the Borda winner and the Condorcet winner cases. For Borda winner, we establish a reduction of the problem to the original CPE-MAB setting and design PAC and exact algorithms that achieve both the sample complexity similar to that in the CPE-MAB setting (which is nearly optimal for a subclass of problems) and polynomial running time per round. For Condorcet winner, we first design a fully polynomial time approximation scheme (FPTAS) for the offline problem of finding the Condorcet winner with known winning probabilities, and then use the FPTAS as an oracle to design a novel pure exploration algorithm ${\sf CAR}$-${\sf Cond}$ with sample complexity analysis. ${\sf CAR}$-${\sf Cond}$ is the first algorithm with polynomial running time per round for identifying the Condorcet winner in CPE-DB.

preprint2020arXiv

Multi-Path Policy Optimization

Recent years have witnessed a tremendous improvement of deep reinforcement learning. However, a challenging problem is that an agent may suffer from inefficient exploration, particularly for on-policy methods. Previous exploration methods either rely on complex structure to estimate the novelty of states, or incur sensitive hyper-parameters causing instability. We propose an efficient exploration method, Multi-Path Policy Optimization (MPPO), which does not incur high computation cost and ensures stability. MPPO maintains an efficient mechanism that effectively utilizes a population of diverse policies to enable better exploration, especially in sparse environments. We also give a theoretical guarantee of the stable performance. We build our scheme upon two widely-adopted on-policy methods, the Trust-Region Policy Optimization algorithm and Proximal Policy Optimization algorithm. We conduct extensive experiments on several MuJoCo tasks and their sparsified variants to fairly evaluate the proposed method. Results show that MPPO significantly outperforms state-of-the-art exploration methods in terms of both sample efficiency and final performance.

preprint2010arXiv

Dynamic Product Assembly and Inventory Control for Maximum Profit

We consider a manufacturing plant that purchases raw materials for product assembly and then sells the final products to customers. There are M types of raw materials and K types of products, and each product uses a certain subset of raw materials for assembly. The plant operates in slotted time, and every slot it makes decisions about re-stocking materials and pricing the existing products in reaction to (possibly time-varying) material costs and consumer demands. We develop a dynamic purchasing and pricing policy that yields time average profit within epsilon of optimality, for any given epsilon>0, with a worst case storage buffer requirement that is O(1/epsilon). The policy can be implemented easily for large M, K, yields fast convergence times, and is robust to non-ergodic system dynamics.

preprint2010arXiv

Max-Weight Achieves the Exact $[O(1/V), O(V)]$ Utility-Delay Tradeoff Under Markov Dynamics

In this paper, we show that the Quadratic Lyapunov function based Algorithm (QLA, also known as MaxWeight or Backpressure) achieves an exact $[O(1/V), O(V)]$ utility-delay tradeoff in stochastic network optimization problems with Markovian network dynamics. Note that though the QLA algorithm has been extensively studied, most of the performance results are obtained under i.i.d. network radnomness, and it has not been formally proven that QLA achieves the exact $[O(1/V), O(V)]$ utility-delay tradeoff under Markov dynamics. Our analysis uses a combination of duality theory and a variable multi-slot Lyapunov drift argument. The variable multi-slot Lapunov drift argument here is different from previous multi-slot drift analysis, in that the slot number is a random variable corresponding to the renewal time of the network randomness. This variable multi-slot drift argument not only allows us to obtain an exact $[O(1/V), O(V)]$ tradeoff, but also allows us to state the performance of QLA in terms of explicit parameters of the network dynamic process.