Researcher profile

Yinbin Ma

Yinbin Ma contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - UnverifiedVerification L1Unclaimed author
4works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2026arXiv

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

Recent GPU generations deliver significantly higher FLOPs using lower-precision arithmetic, such as FP8. While successfully applied to large language models (LLMs), its adoption in large recommendation models (LRMs) has been limited. This is because LRMs are numerically sensitive, dominated by small matrix multiplications (GEMMs) followed by normalization, and trained in communication-intensive environments. Applying FP8 directly to LRMs often degrades model quality and prolongs training time. These challenges are inherent to LRM workloads and cannot be resolved merely by introducing better FP8 kernels. Instead, a system-model co-design approach is needed to successfully integrate FP8. We present LoKA (Low-precision Kernel Applications), a framework that makes FP8 practical for LRMs through three principles: profile under realistic distributions to know where low precision is safe, co-design model components with hardware to expand where it is safe, and orchestrate across kernel libraries to maximize the gains. Concretely, LoKA Probe is a statistically grounded, online benchmarking method that learns activation and weight statistics, and quantifies per-layer errors. This process pinpoints safe and unsafe, fast and slow sites for FP8 adoption. LoKA Mods is a set of reusable model adaptations that improve both numerical stability and execution efficiency with FP8. LoKA Dispatch is a runtime that leverages the statistical insights from LoKA Probe to select the fastest FP8 kernel that satisfies the accuracy requirements.

preprint2022arXiv

AutoShard: Automated Embedding Table Sharding for Recommender Systems

Embedding learning is an important technique in deep recommendation models to map categorical features to dense vectors. However, the embedding tables often demand an extremely large number of parameters, which become the storage and efficiency bottlenecks. Distributed training solutions have been adopted to partition the embedding tables into multiple devices. However, the embedding tables can easily lead to imbalances if not carefully partitioned. This is a significant design challenge of distributed systems named embedding table sharding, i.e., how we should partition the embedding tables to balance the costs across devices, which is a non-trivial task because 1) it is hard to efficiently and precisely measure the cost, and 2) the partition problem is known to be NP-hard. In this work, we introduce our novel practice in Meta, namely AutoShard, which uses a neural cost model to directly predict the multi-table costs and leverages deep reinforcement learning to solve the partition problem. Experimental results on an open-sourced large-scale synthetic dataset and Meta's production dataset demonstrate the superiority of AutoShard over the heuristics. Moreover, the learned policy of AutoShard can transfer to sharding tasks with various numbers of tables and different ratios of the unseen tables without any fine-tuning. Furthermore, AutoShard can efficiently shard hundreds of tables in seconds. The effectiveness, transferability, and efficiency of AutoShard make it desirable for production use. Our algorithms have been deployed in Meta production environment. A prototype is available at https://github.com/daochenzha/autoshard

preprint2022arXiv

On Coded Caching Systems with Offline Users

Coded caching is a technique that leverages locally cached contents at the users to reduce the network's peak-time communication load. Coded caching achieves significant performance gains compared to uncoded caching schemes and is thus a promising technique to boost performance in future networks. In the original model introduced by Maddah-Ali and Niesen (MAN), a server stores multiple files and is connected to multiple cache-aided users through an error-free shared link; once the local caches have been filled and all users have sent their demand to the server, the server can start sending coded multicast messages to satisfy all users' demands. A practical limitation of the original MAN model is that it halts if the server does not receive all users' demands, which is the limiting case of asynchronous coded caching when the requests of some users arrive with infinite delay. In this paper we formally define a coded caching system where some users are offline. We propose achievable and converse bounds for this novel setting and show under which conditions they meet, thus providing an optimal solution, and when they are to within a constant multiplicative gap of two. Interestingly, when optimality can be be shown, the optimal load-memory tradeoff only depends on the number active users, and not on the total (active plus offline) number of users.

preprint2021arXiv

A General Coded Caching Scheme for Scalar Linear Function Retrieval

Coded caching aims to minimize the network's peak-time communication load by leveraging the information pre-stored in the local caches at the users. The original single file retrieval setting by Maddah-Ali and Niesen has been recently extended to general Scalar Linear Function Retrieval (SLFR) by Wan et al., who proposed a linear scheme that surprisingly achieves the same optimal load (under the constraint of uncoded cache placement) as in single file retrieval. This paper's goal is to characterize the conditions under which a general SLFR linear scheme is optimal and gain practical insights into why the specific choices made by Wan et al. work. This paper shows that the optimal decoding coefficients are necessarily the product of two terms, one only involving the encoding coefficients and the other only the demands. In addition, the relationships among the encoding coefficients are shown to be captured by the cycles of certain graphs. Thus, a general linear scheme for SLFR can be found by solving a spanning tree problem.