Source author record

Kezhao Huang

Kezhao Huang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language Databases Distributed, Parallel, and Cluster Computing Machine Learning Performance

Catalog footprint

What is connected

3works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models

While Mixture-of-Experts (MoE) scales capacity via conditional computation, Transformers lack a native primitive for knowledge lookup, forcing them to inefficiently simulate retrieval through computation. To address this, we introduce conditional memory as a complementary sparsity axis, instantiated via Engram, a module that modernizes classic $N$-gram embedding for O(1) lookup. By formulating the Sparsity Allocation problem, we uncover a U-shaped scaling law that optimizes the trade-off between neural computation (MoE) and static memory (Engram). Guided by this law, we scale Engram to 27B parameters, achieving superior performance over a strictly iso-parameter and iso-FLOPs MoE baseline. Most notably, while the memory module is expected to aid knowledge retrieval (e.g., MMLU +3.4; CMMLU +4.0), we observe even larger gains in general reasoning (e.g., BBH +5.0; ARC-Challenge +3.7) and code/math domains~(HumanEval +3.0; MATH +2.4). Mechanistic analyses reveal that Engram relieves the backbone's early layers from static reconstruction, effectively deepening the network for complex reasoning. Furthermore, by delegating local dependencies to lookups, it frees up attention capacity for global context, substantially boosting long-context retrieval (e.g., Multi-Query NIAH: 84.2 to 97.0). Finally, Engram establishes infrastructure-aware efficiency: its deterministic addressing enables runtime prefetching from host memory, incurring negligible overhead. We envision conditional memory as an indispensable modeling primitive for next-generation sparse models.

preprint2022arXiv

OLLIE: Derivation-based Tensor Program Optimizer

Boosting the runtime performance of deep neural networks (DNNs) is critical due to their wide adoption in real-world tasks. Existing approaches to optimizing the tensor algebra expression of a DNN only consider expressions representable by a fixed set of predefined operators, missing possible optimization opportunities between general expressions. We propose OLLIE, the first derivation-based tensor program optimizer. OLLIE optimizes tensor programs by leveraging transformations between general tensor algebra expressions, enabling a significantly larger expression search space that includes those supported by prior work as special cases. OLLIE uses a hybrid derivation-based optimizer that effectively combines explorative and guided derivations to quickly discover highly optimized expressions. Evaluation on seven DNNs shows that OLLIE can outperform existing optimizers by up to 2.73$\times$ (1.46$\times$ on average) on an A100 GPU and up to 2.68$\times$ (1.51$\times$) on a V100 GPU, respectively.

preprint2021arXiv

Comprehensive Framework of RDMA-enabled Concurrency Control Protocols

In this paper, we develop RCC, the first unified and comprehensive RDMA-enabled distributed transaction processing framework supporting six serializable concurrency control protocols: not only the classical protocols NOWAIT, WAITDIE, and OCC, but also more advanced MVCC and SUNDIAL, and even CALVIN, the deterministic concurrency control protocol. Our goal is to unbiasedly compare the protocols in a common execution environment with the concurrency control protocol being the only changeable component. We focus on the correct and efficient implementation using key techniques, such as co-routines, outstanding requests, and doorbell batching, with two-sided and one-sided communication primitives. Based on RCC, we get the deep insights that cannot be obtained by any existing systems. Most importantly, we obtain the execution stage latency breakdowns with one-sided and two-sided primitive for each protocol, which are analyzed to develop more efficient hybrid implementations. Our results show that three hybrid designs are indeed better than both one-sided and two-sided implementations by up to 17.8%. We believe that RCC is a significant advance over the state-of-the-art; it can both provide performance insights and be used as the common infrastructure for fast prototyping new implementations.