Source author record

Xiaohan Wei

Xiaohan Wei appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning math.OC Performance Artificial Intelligence math.ST Statistics Theory Computer Science and Game Theory Distributed, Parallel, and Cluster Computing Information Retrieval Information Theory math.IT math.PR

Catalog footprint

What is connected

13works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

preprint2022arXiv

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Learning feature interactions is important to the model performance of online advertising services. As a result, extensive efforts have been devoted to designing effective architectures to learn feature interactions. However, we observe that the practical performance of those designs can vary from dataset to dataset, even when the order of interactions claimed to be captured is the same. That indicates different designs may have different advantages and the interactions captured by them have non-overlapping information. Motivated by this observation, we propose DHEN - a deep and hierarchical ensemble architecture that can leverage strengths of heterogeneous interaction modules and learn a hierarchy of the interactions under different orders. To overcome the challenge brought by DHEN's deeper and multi-layer structure in training, we propose a novel co-designed training system that can further improve the training efficiency of DHEN. Experiments of DHEN on large-scale dataset from CTR prediction tasks attained 0.27\% improvement on the Normalized Entropy (NE) of prediction and 1.2x better training throughput than state-of-the-art baseline, demonstrating their effectiveness in practice.

preprint2022arXiv

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

While single-agent policy optimization in a fixed environment has attracted a lot of research attention recently in the reinforcement learning community, much less is known theoretically when there are multiple agents playing in a potentially competitive environment. We take steps forward by proposing and analyzing new fictitious play policy optimization algorithms for zero-sum Markov games with structured but unknown transitions. We consider two classes of transition structures: factored independent transition and single-controller transition. For both scenarios, we prove tight $\widetilde{\mathcal{O}}(\sqrt{K})$ regret bounds after $K$ episodes in a two-agent competitive game scenario. The regret of each agent is measured against a potentially adversarial opponent who can choose a single best policy in hindsight after observing the full policy sequence. Our algorithms feature a combination of Upper Confidence Bound (UCB)-type optimism and fictitious play under the scope of simultaneous policy optimization in a non-stationary environment. When both players adopt the proposed algorithms, their overall optimality gap is $\widetilde{\mathcal{O}}(\sqrt{K})$.

preprint2021arXiv

I. Asynchronous Optimization over weakly Coupled Renewal Systems

A renewal system divides the slotted timeline into back to back time periods called renewal frames. At the beginning of each frame, it chooses a policy from a set of options for that frame. The policy determines the duration of the frame, the penalty incurred during the frame (such as energy expenditure), and a vector of performance metrics (such as instantaneous number of jobs served). The starting points of this line of research are Chapter 7 of the book [Nee10a], the seminal work [Nee13a], and Chapter 5 of the PhD thesis of Chih-ping Li [Li11]. These works consider stochastic optimization over a single renewal system. By way of contrast, this thesis considers optimization over multiple parallel renewal systems, which is computationally more challenging and yields much more applications. The goal is to minimize the time average overall penalty subject to time average overall constraints on the corresponding performance metrics. The main difficulty, which is not present in earlier works, is that these systems act asynchronously due to the fact that the renewal frames of different renewal systems are not aligned. The goal of the thesis is to resolve this difficulty head-on via a new asynchronous algorithm and a novel supermartingale stopping time analysis which shows our algorithms not only converge to the optimal solution but also enjoy fast convergence rates. Based on this general theory, we further develop novel algorithms for data center server provision problems with performance guarantees as well as new heuristics for the multi-user file downloading problems.

preprint2020arXiv

II. High Dimensional Estimation under Weak Moment Assumptions: Structured Recovery and Matrix Estimation

The purpose of this thesis is to develop new theories on high-dimensional structured signal recovery under a rather weak assumption on the measurements that only a finite number of moments exists. High-dimensional recovery has been one of the emerging topics in the last decade partly due to the celebrated work of Candes, Romberg and Tao (e.g. [CRT06, CRT04]). The original analysis there (and the works thereafter) necessitates a strong concentration argument (namely, the restricted isometry property), which only holds for a rather restricted class of measurements with light-tailed distributions. It had long been conjectured that high-dimensional recovery is possible even if restricted isometry type conditions do not hold, but the general theory was beyond the grasp until very recently, when the works [Men14a, KM15] propose a new small-ball method. In these two papers, the authors initiated a new analysis framework for general empirical risk minimization (ERM) problems with respect to the square loss, which is robust and can potentially allow heavy-tailed loss functions. The materials in this thesis are partly inspired by [Men14a], but are of a different mindset: rather than directly analyzing the existing ERMs for signal recovery for which it is difficult to avoid strong moment assumptions, we show that, in many circumstances, by carefully re-designing the ERMs to start with, one can still achieve the minimax optimal statistical rate of signal recovery with very high probability under much weaker assumptions than existing works.

preprint2020arXiv

Robust One-Bit Recovery via ReLU Generative Networks: Near-Optimal Statistical Rate and Global Landscape Analysis

We study the robust one-bit compressed sensing problem whose goal is to design an algorithm that faithfully recovers any sparse target vector $θ_0\in\mathbb{R}^d$ \textit{uniformly} via $m$ quantized noisy measurements. Specifically, we consider a new framework for this problem where the sparsity is implicitly enforced via mapping a low dimensional representation $x_0 \in \mathbb{R}^k$ through a known $n$-layer ReLU generative network $G:\mathbb{R}^k\rightarrow\mathbb{R}^d$ such that $θ_0 = G(x_0)$. Such a framework poses low-dimensional priors on $θ_0$ without a known sparsity basis. We propose to recover the target $G(x_0)$ solving an unconstrained empirical risk minimization (ERM). Under a weak \textit{sub-exponential measurement assumption}, we establish a joint statistical and computational analysis. In particular, we prove that the ERM estimator in this new framework achieves a statistical rate of $m=\widetilde{\mathcal{O}}(kn \log d /\varepsilon^2)$ recovering any $G(x_0)$ uniformly up to an error $\varepsilon$. When the network is shallow (i.e., $n$ is small), we show this rate matches the information-theoretic lower bound up to logarithm factors of $\varepsilon^{-1}$. From the lens of computation, we prove that under proper conditions on the network weights, our proposed empirical risk, despite non-convexity, has no stationary point outside of small neighborhoods around the true representation $x_0$ and its negative multiple; furthermore, we show that the global minimizer of the empirical risk stays within the neighborhood around $x_0$ rather than its negative multiple under further assumptions on the network weights.

preprint2020arXiv

Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning

Temporal-Difference (TD) learning with nonlinear smooth function approximation for policy evaluation has achieved great success in modern reinforcement learning. It is shown that such a problem can be reformulated as a stochastic nonconvex-strongly-concave optimization problem, which is challenging as naive stochastic gradient descent-ascent algorithm suffers from slow convergence. Existing approaches for this problem are based on two-timescale or double-loop stochastic gradient algorithms, which may also require sampling large-batch data. However, in practice, a single-timescale single-loop stochastic algorithm is preferred due to its simplicity and also because its step-size is easier to tune. In this paper, we propose two single-timescale single-loop algorithms which require only one data point each step. Our first algorithm implements momentum updates on both primal and dual variables achieving an $O(\varepsilon^{-4})$ sample complexity, which shows the important role of momentum in obtaining a single-timescale algorithm. Our second algorithm improves upon the first one by applying variance reduction on top of momentum, which matches the best known $O(\varepsilon^{-3})$ sample complexity in existing works. Furthermore, our variance-reduction algorithm does not require a large-batch checkpoint. Moreover, our theoretical results for both algorithms are expressed in a tighter form of simultaneous primal and dual side convergence.

preprint2016arXiv

A Probabilistic Sample Path Convergence Time Analysis of Drift-Plus-Penalty Algorithm for Stochastic Optimization

This paper considers the problem of minimizing the time average of a controlled stochastic process subject to multiple time average constraints on other related processes. The probability distribution of the random events in the system is unknown to the controller. A typical application is time average power minimization subject to network throughput constraints for different users in a network with time varying channel conditions. We show that with probability at least $1-2δ$, the classical drift-plus-penalty algorithm provides a sample path $\mathcal{O}(\varepsilon)$ approximation to optimality with a convergence time $\mathcal{O}(\frac{1}{\varepsilon^2}\max\left\{\log^2\frac1\varepsilon\log\frac2δ,~\log^3\frac2δ\right\})$, where $\varepsilon>0$ is a parameter related to the algorithm. When there is only one constraint, we further show that the convergence time can be improved to $\mathcal{O}\left(\frac{1}{\varepsilon^2}\log^2\frac1δ\right)$.

preprint2016arXiv

Data Center Server Provision: Distributed Asynchronous Control for Coupled Renewal Systems

This paper considers a cost minimization problem for data centers with N servers and randomly arriving service requests. A central router decides which server to use for each new request. Each server has three types of states (active, idle, setup) with different costs and time durations. The servers operate asynchronously over their own states and can choose one of multiple sleep modes when idle. We develop an online distributed control algorithm so that each server makes its own decisions, the request queues are bounded and the overall time average cost is near optimal with probability 1. The algorithm does not need probability information for the arrival rate or job sizes. Next, an improved algorithm that uses a single queue is developed via a "virtualization" technique which is shown to provide the same (near optimal) costs. Simulation experiments on a real data center traffic trace demonstrate the efficiency of our algorithm compared to other existing algorithms.

preprint2016arXiv

Structured signal recovery from non-linear and heavy-tailed measurements

We study high-dimensional signal recovery from non-linear measurements with design vectors having elliptically symmetric distribution. Special attention is devoted to the situation when the unknown signal belongs to a set of low statistical complexity, while both the measurements and the design vectors are heavy-tailed. We propose and analyze a new estimator that adapts to the structure of the problem, while being robust both to the possible model misspecification characterized by arbitrary non-linearity of the measurements as well as to data corruption modeled by the heavy-tailed distributions. Moreover, this estimator has low computational complexity. Our results are expressed in the form of exponential concentration inequalities for the error of the proposed estimator. On the technical side, our proofs rely on the generic chaining methods, and illustrate the power of this approach for statistical applications. Theory is supported by numerical experiments demonstrating that our estimator outperforms existing alternatives when data is heavy-tailed.

preprint2015arXiv

Delay Optimal Power Aware Opportunistic Scheduling with Mutual Information Accumulation

This paper considers optimization of power and delay in a time-varying wireless link using rateless codes. The link serves a sequence of variable-length packets. Each packet is coded and transmitted over multiple slots. Channel conditions can change from slot to slot and are unknown to the transmitter. The amount of mutual information accumulated on each slot depends on the random channel realization and the power used. The goal is to minimize average service delay subject to an average power constraint. We formulate this problem as a frame-based stochastic optimization problem and solve it via an online algorithm. We show that the subproblem within each frame is a simple integer program which can be effectively solved using a dynamic program. The optimality of this online algorithm is proved using the frame-based Lyapunov drift analysis.

preprint2015arXiv

Power Aware Wireless File Downloading: A Lyapunov Indexing Approach to A Constrained Restless Bandit Problem

This paper treats power-aware throughput maxi-mization in a multi-user file downloading system. Each user can receive a new file only after its previous file is finished. The file state processes for each user act as coupled Markov chains that form a generalized restless bandit system. First, an optimal algorithm is derived for the case of one user. The algorithm maximizes throughput subject to an average power constraint. Next, the one-user algorithm is extended to a low complexity heuristic for the multi-user problem. The heuristic uses a simple online index policy. In a special case with no power-constraint, the multi-user heuristic is shown to be throughput optimal. Simulations are used to demonstrate effectiveness of the heuristic in the general case. For simple cases where the optimal solution can be computed offline, the heuristic is shown to be near-optimal for a wide range of parameters.

preprint2014arXiv

Power Aware Wireless File Downloading: A Constrained Restless Bandit Approach

This paper treats power-aware throughput maximization in a multi-user file downloading system. Each user can receive a new file only after its previous file is finished. The file state processes for each user act as coupled Markov chains that form a generalized restless bandit system. First, an optimal algorithm is derived for the case of one user. The algorithm maximizes throughput subject to an average power constraint. Next, the one-user algorithm is extended to a low complexity heuristic for the multi-user problem. The heuristic uses a simple online index policy and its effectiveness is shown via simulation. For simple 3-user cases where the optimal solution can be computed offline, the heuristic is shown to be near-optimal for a wide range of parameters.

Xiaohan Wei

What is connected

Connect this record

See the researcher in context

Building this map preview

13 published item(s)

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

DHEN: A Deep and Hierarchical Ensemble Network for Large-Scale Click-Through Rate Prediction

Provably Efficient Fictitious Play Policy Optimization for Zero-Sum Markov Games with Structured Transitions

I. Asynchronous Optimization over weakly Coupled Renewal Systems

II. High Dimensional Estimation under Weak Moment Assumptions: Structured Recovery and Matrix Estimation

Robust One-Bit Recovery via ReLU Generative Networks: Near-Optimal Statistical Rate and Global Landscape Analysis

Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth Nonlinear TD Learning

A Probabilistic Sample Path Convergence Time Analysis of Drift-Plus-Penalty Algorithm for Stochastic Optimization

Data Center Server Provision: Distributed Asynchronous Control for Coupled Renewal Systems

Structured signal recovery from non-linear and heavy-tailed measurements

Delay Optimal Power Aware Opportunistic Scheduling with Mutual Information Accumulation

Power Aware Wireless File Downloading: A Lyapunov Indexing Approach to A Constrained Restless Bandit Problem

Power Aware Wireless File Downloading: A Constrained Restless Bandit Approach