Source author record

Mao Yang

Mao Yang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computation and Language Networking and Internet Architecture Hardware Architecture Information Retrieval Performance

Catalog footprint

What is connected

4works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache

The growth of long-context Large Language Models (LLMs) significantly increases memory and bandwidth pressure during autoregressive decoding due to the expanding Key-Value (KV) cache. While accuracy-preserving KV-cache quantization (e.g., 4-bit or 2-bit) reduces memory footprint, existing systems decode inefficiently by relying solely on CUDA cores, underutilizing Tensor Cores-the dominant compute resource on GPUs. We present BitDecoding, the first inference system to efficiently decode low-bit KV caches by cooperatively leveraging CUDA cores and Tensor Cores. BitDecoding smartly induces Tensor-Core-friendly layouts, introduces warp-level dequantization parallelism, and provides unified system support through query transformation, high-performance tensor- and channel-wise quantization, and a software-pipelined dequantization kernel enabling mixed-precision execution. Architecture-aware optimizations further leverage Hopper's warpgroup tensor instructions and Blackwell's NVFP4 (MXFP4) tensor formats. Evaluated on Blackwell, Hopper, and Ampere GPUs, BitDecoding achieves an average 7.5x decoding speedup over FP16 FlashDecoding-v2, up to 8.6x on Blackwell with NVFP4, and up to 4.3x over state-of-the-art approaches. On LLaMA-3.1-8B with a 128K context, BitDecoding reduces single-batch decoding latency by 3x. BitDecoding is open-sourced at https://github.com/OpenBitSys/BitDecoding.

preprint2022arXiv

SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance

Ad relevance modeling plays a critical role in online advertising systems including Microsoft Bing. To leverage powerful transformers like BERT in this low-latency setting, many existing approaches perform ad-side computations offline. While efficient, these approaches are unable to serve cold start ads, resulting in poor relevance predictions for such ads. This work aims to design a new, low-latency BERT via structured pruning to empower real-time online inference for cold start ads relevance on a CPU platform. Our challenge is that previous methods typically prune all layers of the transformer to a high, uniform sparsity, thereby producing models which cannot achieve satisfactory inference speed with an acceptable accuracy. In this paper, we propose SwiftPruner - an efficient framework that leverages evolution-based search to automatically find the best-performing layer-wise sparse BERT model under the desired latency constraint. Different from existing evolution algorithms that conduct random mutations, we propose a reinforced mutator with a latency-aware multi-objective reward to conduct better mutations for efficiently searching the large space of layer-wise sparse models. Extensive experiments demonstrate that our method consistently achieves higher ROC AUC and lower latency than the uniform sparse baseline and state-of-the-art search methods. Remarkably, under our latency requirement of 1900us on CPU, SwiftPruner achieves a 0.86% higher AUC than the state-of-the-art uniform sparse baseline for BERT-Mini on a large scale real-world dataset. Online A/B testing shows that our model also achieves a significant 11.7% cut in the ratio of defective cold start ads with satisfactory real-time serving latency.

preprint2014arXiv

Cross-Layer Software-Defined 5G Network

In the past few decades, the world has witnessed a rapid growth in mobile communication and reaped great benefits from it. Even though the fourth generation (4G) mobile communication system is just being deployed worldwide, proliferating mobile demands call for newer wireless communication technologies with even better performance. Consequently, the fifth generation (5G) system is already emerging in the research field. However, simply evolving the current mobile networks can hardly meet such great expectations, because over the years the infrastructures have generally become ossified, closed, and vertically constructed. Aiming to establish a new paradigm for 5G mobile networks, in this article, we propose a cross-layer software-defined 5G network architecture. By jointly considering both the network layer and the physical layer together, we establish the two software-defined programmable components, the control plane and the cloud computing pool, which enable an effective control of the mobile network from the global perspective and benefit technological innovations. Specifically, by the cross-layer design for software-defining, the logically centralized and programmable control plane abstracts the control functions from the network layer down to the physical layer, through which we achieve the fine-grained controlling of mobile network, while the cloud computing pool provides powerful computing capability to implement the baseband data processing of multiple heterogeneous networks. We discuss the main challenges of our architecture, including the fine-grained control strategies, network virtualization, and programmability. The architecture significantly benefits the convergence towards heterogeneous networks and it enables much more controllable, programmable and evolvable mobile networks.

preprint2014arXiv

Software-Defined and Virtualized Future Mobile and Wireless Networks: A Survey

With the proliferation of mobile demands and increasingly multifarious services and applications, mobile Internet has been an irreversible trend. Unfortunately, the current mobile and wireless network (MWN) faces a series of pressing challenges caused by the inherent design. In this paper, we extend two latest and promising innovations of Internet, software-defined networking and network virtualization, to mobile and wireless scenarios. We first describe the challenges and expectations of MWN, and analyze the opportunities provided by the software-defined wireless network (SDWN) and wireless network virtualization (WNV). Then, this paper focuses on SDWN and WNV by presenting the main ideas, advantages, ongoing researches and key technologies, and open issues respectively. Moreover, we interpret that these two technologies highly complement each other, and further investigate efficient joint design between them. This paper confirms that SDWN and WNV may efficiently address the crucial challenges of MWN and significantly benefit the future mobile and wireless network.

Mao Yang

What is connected

Connect this record

See the researcher in context

Building this map preview

4 published item(s)

BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache

SwiftPruner: Reinforced Evolutionary Pruning for Efficient Ad Relevance

Cross-Layer Software-Defined 5G Network

Software-Defined and Virtualized Future Mobile and Wireless Networks: A Survey