Researcher profile

Xiaohan Wei

Xiaohan Wei contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2026arXiv

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

preprint2022arXiv

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the practical performance of those designs can vary from dataset to dataset, even when the order of interactions claimed to be captured is the same. That indicates different designs may have different advantages and the interactions captured by them have non-overlapping information. Motivated by this observation, we propose DHEN - a deep and hierarchical ensemble architecture that can leverage strengths of heterogeneous interaction modules and learn a hierarchy of the interactions under different orders. To overcome the challenge brought by DHEN's deeper and multi-layer structure in training, we propose a novel co-designed training system that can further improve the training efficiency of DHEN. Experiments of DHEN on large-scale dataset from CTR prediction tasks attained 0.27\% improvement on the Normalized Entropy (NE) of prediction and 1.2x better training throughput than state-of-the-art baseline, demonstrating their effectiveness in practice.

preprint2022arXiv

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{K})$ regret bounds after $K$ episodes in a two-agent competitive game scenario. The regret of each agent is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{K})$.

preprint2021arXiv

I. Asynchronous Optimization over weakly Coupled Renewal Systems

A renewal system divides the slotted timeline into back to back time periods called renewal frames. At the beginning of each frame, it chooses a policy from a set of options for that frame. The policy determines the duration of the frame, the penalty incurred during the frame (such as energy expenditure), and a vector of performance metrics (such as instantaneous number of jobs served). The starting points of this line of research are Chapter 7 of the book [Nee10a], the seminal work [Nee13a], and Chapter 5 of the PhD thesis of Chih-ping Li [Li11]. These works consider stochastic optimization over a single renewal system. By way of contrast, this thesis considers optimization over multiple parallel renewal systems, which is computationally more challenging and yields much more applications. The goal is to minimize the time average overall penalty subject to time average overall constraints on the corresponding performance metrics. The main difficulty, which is not present in earlier works, is that these systems act asynchronously due to the fact that the renewal frames of different renewal systems are not aligned. The goal of the thesis is to resolve this difficulty head-on via a new asynchronous algorithm and a novel supermartingale stopping time analysis which shows our algorithms not only converge to the optimal solution but also enjoy fast convergence rates. Based on this general theory, we further develop novel algorithms for data center server provision problems with performance guarantees as well as new heuristics for the multi-user file downloading problems.

preprint2020arXiv

II. High Dimensional Estimation under Weak Moment Assumptions: Structured Recovery and Matrix Estimation

The purpose of this thesis is to develop new theories on high-dimensional structured signal recovery under a rather weak assumption on the measurements that only a finite number of moments exists. High-dimensional recovery has been one of the emerging topics in the last decade partly due to the celebrated work of Candes, Romberg and Tao (e.g. [CRT06, CRT04]). The original analysis there (and the works thereafter) necessitates a strong concentration argument (namely, the restricted isometry property), which only holds for a rather restricted class of measurements with light-tailed distributions. It had long been conjectured that high-dimensional recovery is possible even if restricted isometry type conditions do not hold, but the general theory was beyond the grasp until very recently, when the works [Men14a, KM15] propose a new small-ball method. In these two papers, the authors initiated a new analysis framework for general empirical risk minimization (ERM) problems with respect to the square loss, which is robust and can potentially allow heavy-tailed loss functions. The materials in this thesis are partly inspired by [Men14a], but are of a different mindset: rather than directly analyzing the existing ERMs for signal recovery for which it is difficult to avoid strong moment assumptions, we show that, in many circumstances, by carefully re-designing the ERMs to start with, one can still achieve the minimax optimal statistical rate of signal recovery with very high probability under much weaker assumptions than existing works.

preprint2020arXiv

Robust One-Bit Recovery via ReLU Generative Networks: Near-Optimal Statistical Rate and Global Landscape Analysis

We study the robust one-bit compressed sensing problem whose goal is to design an algorithm that faithfully recovers any sparse target vector $θ_0\in\mathbb{R}^d$ \textit{uniformly} via $m$ quantized noisy measurements. Specifically, we consider a new framework for this problem where the sparsity is implicitly enforced via mapping a low dimensional representation $x_0 \in \mathbb{R}^k$ through a known $n$-layer ReLU generative network $G:\mathbb{R}^k\rightarrow\mathbb{R}^d$ such that $θ_0 = G(x_0)$. Such a framework poses low-dimensional priors on $θ_0$ without a known sparsity basis. We propose to recover the target $G(x_0)$ solving an unconstrained empirical risk minimization (ERM). Under a weak \textit{sub-exponential measurement assumption}, we establish a joint statistical and computational analysis. In particular, we prove that the ERM estimator in this new framework achieves a statistical rate of $m=\widetilde{\mathcal{O}}(kn \log d /\varepsilon^2)$ recovering any $G(x_0)$ uniformly up to an error $\varepsilon$. When the network is shallow (i.e., $n$ is small), we show this rate matches the information-theoretic lower bound up to logarithm factors of $\varepsilon^{-1}$. From the lens of computation, we prove that under proper conditions on the network weights, our proposed empirical risk, despite non-convexity, has no stationary point outside of small neighborhoods around the true representation $x_0$ and its negative multiple; furthermore, we show that the global minimizer of the empirical risk stays within the neighborhood around $x_0$ rather than its negative multiple under further assumptions on the network weights.

preprint2020arXiv

Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning

Temporal-Difference (TD) learning with nonlinear smooth function approximation for policy evaluation has achieved great success in modern reinforcement learning. It is shown that such a problem can be reformulated as a stochastic nonconvex-strongly-concave optimization problem, which is challenging as naive stochastic gradient descent-ascent algorithm suffers from slow convergence. Existing approaches for this problem are based on two-timescale or double-loop stochastic gradient algorithms, which may also require sampling large-batch data. However, in practice, a single-timescale single-loop stochastic algorithm is preferred due to its simplicity and also because its step-size is easier to tune. In this paper, we propose two single-timescale single-loop algorithms which require only one data point each step. Our first algorithm implements momentum updates on both primal and dual variables achieving an $O(\varepsilon^{-4})$ sample complexity, which shows the important role of momentum in obtaining a single-timescale algorithm. Our second algorithm improves upon the first one by applying variance reduction on top of momentum, which matches the best known $O(\varepsilon^{-3})$ sample complexity in existing works. Furthermore, our variance-reduction algorithm does not require a large-batch checkpoint. Moreover, our theoretical results for both algorithms are expressed in a tighter form of simultaneous primal and dual side convergence.

preprint2015arXiv

Delay Optimal Power Aware Opportunistic Scheduling with Mutual Information Accumulation

This paper considers optimization of power and delay in a time-varying wireless link using rateless codes. The link serves a sequence of variable-length packets. Each packet is coded and transmitted over multiple slots. Channel conditions can change from slot to slot and are unknown to the transmitter. The amount of mutual information accumulated on each slot depends on the random channel realization and the power used. The goal is to minimize average service delay subject to an average power constraint. We formulate this problem as a frame-based stochastic optimization problem and solve it via an online algorithm. We show that the subproblem within each frame is a simple integer program which can be effectively solved using a dynamic program. The optimality of this online algorithm is proved using the frame-based Lyapunov drift analysis.

preprint2015arXiv

Power Aware Wireless File Downloading: A Lyapunov Indexing Approach to A Constrained Restless Bandit Problem

This paper treats power-aware throughput maxi-mization in a multi-user file downloading system. Each user can receive a new file only after its previous file is finished. The file state processes for each user act as coupled Markov chains that form a generalized restless bandit system. First, an optimal algorithm is derived for the case of one user. The algorithm maximizes throughput subject to an average power constraint. Next, the one-user algorithm is extended to a low complexity heuristic for the multi-user problem. The heuristic uses a simple online index policy. In a special case with no power-constraint, the multi-user heuristic is shown to be throughput optimal. Simulations are used to demonstrate effectiveness of the heuristic in the general case. For simple cases where the optimal solution can be computed offline, the heuristic is shown to be near-optimal for a wide range of parameters.

preprint2014arXiv

Power Aware Wireless File Downloading: A Constrained Restless Bandit Approach

This paper treats power-aware throughput maximization in a multi-user file downloading system. Each user can receive a new file only after its previous file is finished. The file state processes for each user act as coupled Markov chains that form a generalized restless bandit system. First, an optimal algorithm is derived for the case of one user. The algorithm maximizes throughput subject to an average power constraint. Next, the one-user algorithm is extended to a low complexity heuristic for the multi-user problem. The heuristic uses a simple online index policy and its effectiveness is shown via simulation. For simple 3-user cases where the optimal solution can be computed offline, the heuristic is shown to be near-optimal for a wide range of parameters.