Source author record

Wenjun Yu

Wenjun Yu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Information Retrieval Machine Learning quant-ph

Catalog footprint

What is connected

2works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

One Pool, Two Caches: Adaptive HBM Partitioning for Accelerating Generative Recommender Serving

Generative Recommender (GR) inference places embedding hot caches (EMB) and KV caches in direct competition for limited GPU HBM: allocating more memory to one improves its efficiency but degrades the other. Existing systems optimize them in isolation, overlooking that the optimal EMB-KV allocation ratio can shift by up to 0.35 across workload regimes, leaving 20-30\% latency improvement unrealized. While online reallocation is required to close this gap, naive approaches introduce H2D refill traffic on the critical path, causing P99 SLO violations. To address this, we present HELM, which jointly manages HBM allocation and request routing at runtime through two key components: (1) Adaptive Memory Allocation, a three-layer PPO-based controller (frozen base policy, online residual adapter, and burst-aware recovery controller) that achieves $32\,\mathrm{μs}$ decision latency while staying within 0.024-0.029 of the offline-optimal ratio; and (2) EMB-KV-Aware Scheduling, which routes requests by jointly considering KV residency, embedding locality, and node load to avoid routing inefficiencies under heterogeneous allocations. Evaluations on three production-scale datasets over a 32-node A100 cluster show that HELM reduces P99 latency by 24-38\% over the best static policy and achieves 93.5-99.6\% SLO satisfaction across Steady, Trend, and Burst workloads, significantly outperforming state-of-the-art baselines without sacrificing throughput.

preprint2020arXiv

Fast Estimation of Sparse Quantum Noise

As quantum computers approach the fault tolerance threshold, diagnosing and characterizing the noise on large scale quantum devices is increasingly important. One of the most important classes of noise channels is the class of Pauli channels, for reasons of both theoretical tractability and experimental relevance. Here we present a practical algorithm for estimating the $s$ nonzero Pauli error rates in an $s$-sparse, $n$-qubit Pauli noise channel, or more generally the $s$ largest Pauli error rates. The algorithm comes with rigorous recovery guarantees and uses only $O(n^2)$ measurements, $O(s n^2)$ classical processing time, and Clifford quantum circuits. We experimentally validate a heuristic version of the algorithm that uses simplified Clifford circuits on data from an IBM 14-qubit superconducting device and our open source implementation. These data show that accurate and precise estimation of the probability of arbitrary-weight Pauli errors is possible even when the signal is two orders of magnitude below the measurement noise floor.