Source author record

Yinbin Ma

Yinbin Ma appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Information Theory Machine Learning math.IT Information Retrieval

Catalog footprint

What is connected

4works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

preprint2022arXiv

AutoShard: Automated Embedding Table Sharding for Recommender Systems

Embedding learning is an important technique in deep recommendation models to map categorical features to dense vectors. However, the embedding tables often demand an extremely large number of parameters, which become the storage and efficiency bottlenecks. Distributed training solutions have been adopted to partition the embedding tables into multiple devices. However, the embedding tables can easily lead to imbalances if not carefully partitioned. This is a significant design challenge of distributed systems named embedding table sharding, i.e., how we should partition the embedding tables to balance the costs across devices, which is a non-trivial task because 1) it is hard to efficiently and precisely measure the cost, and 2) the partition problem is known to be NP-hard. In this work, we introduce our novel practice in Meta, namely AutoShard, which uses a neural cost model to directly predict the multi-table costs and leverages deep reinforcement learning to solve the partition problem. Experimental results on an open-sourced large-scale synthetic dataset and Meta's production dataset demonstrate the superiority of AutoShard over the heuristics. Moreover, the learned policy of AutoShard can transfer to sharding tasks with various numbers of tables and different ratios of the unseen tables without any fine-tuning. Furthermore, AutoShard can efficiently shard hundreds of tables in seconds. The effectiveness, transferability, and efficiency of AutoShard make it desirable for production use. Our algorithms have been deployed in Meta production environment. A prototype is available at https://github.com/daochenzha/autoshard

preprint2022arXiv

On Coded Caching Systems with Offline Users

Coded caching is a technique that leverages locally cached contents at the users to reduce the network's peak-time communication load. Coded caching achieves significant performance gains compared to uncoded caching schemes and is thus a promising technique to boost performance in future networks. In the original model introduced by Maddah-Ali and Niesen (MAN), a server stores multiple files and is connected to multiple cache-aided users through an error-free shared link; once the local caches have been filled and all users have sent their demand to the server, the server can start sending coded multicast messages to satisfy all users' demands. A practical limitation of the original MAN model is that it halts if the server does not receive all users' demands, which is the limiting case of asynchronous coded caching when the requests of some users arrive with infinite delay. In this paper we formally define a coded caching system where some users are offline. We propose achievable and converse bounds for this novel setting and show under which conditions they meet, thus providing an optimal solution, and when they are to within a constant multiplicative gap of two. Interestingly, when optimality can be be shown, the optimal load-memory tradeoff only depends on the number active users, and not on the total (active plus offline) number of users.

preprint2021arXiv

A General Coded Caching Scheme for Scalar Linear Function Retrieval

Coded caching aims to minimize the network's peak-time communication load by leveraging the information pre-stored in the local caches at the users. The original single file retrieval setting by Maddah-Ali and Niesen has been recently extended to general Scalar Linear Function Retrieval (SLFR) by Wan et al., who proposed a linear scheme that surprisingly achieves the same optimal load (under the constraint of uncoded cache placement) as in single file retrieval. This paper's goal is to characterize the conditions under which a general SLFR linear scheme is optimal and gain practical insights into why the specific choices made by Wan et al. work. This paper shows that the optimal decoding coefficients are necessarily the product of two terms, one only involving the encoding coefficients and the other only the demands. In addition, the relationships among the encoding coefficients are shown to be captured by the cycles of certain graphs. Thus, a general linear scheme for SLFR can be found by solving a spanning tree problem.