Source author record

Luis Ceze

Luis Ceze appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing cs.CY Artificial Intelligence Hardware Architecture Machine Learning Networking and Internet Architecture Neural and Evolutionary Computing Programming Languages

Catalog footprint

What is connected

9works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Recent advances show that large language models (LLMs) can act as autonomous agents capable of generating GPU kernels, but integrating these AI-generated kernels into real-world inference systems remains challenging. FlashInfer-Bench addresses this gap by establishing a standardized, closed-loop framework that connects kernel generation, benchmarking, and deployment. At its core, FlashInfer Trace provides a unified schema describing kernel definitions, workloads, implementations, and evaluations, enabling consistent communication between agents and systems. Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, a public leaderboard to track LLM agents' GPU programming capabilities, and a dynamic substitution mechanism (apply()) that seamlessly injects the best-performing kernels into production LLM engines such as SGLang and vLLM. Using FlashInfer-Bench, we further evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design. FlashInfer-Bench thus establishes a practical, reproducible pathway for continuously improving AI-generated kernels and deploying them into large-scale LLM inference.

preprint2022arXiv

Srifty: Swift and Thrifty Distributed Training on the Cloud

Finding the best VM configuration is key to achieve lower cost and higher throughput, two primary concerns in cloud-based distributed neural network (NN) training today. Optimal VM selection that meets user constraints requires efficiently navigating a large search space while controlling for the performance variance associated with sharing cloud instances and networks. In this work, we characterize this variance in the context of distributed NN training and present results of a comprehensive throughput and cost-efficiency study we conducted across a wide array of instances to prune for the optimal VM search space. Using insights from these studies, we built Srifty, a system that combines runtime profiling with learned performance models to accurately predict training performance and find the best VM choice that satisfies user constraints, potentially leveraging both heterogeneous setups and spot instances. We integrated Srifty with PyTorch and evaluated it on Amazon EC2. We conducted a large-scale generalization study of Srifty across more than 2K training setups on EC2. Our results show that Srifty achieves an iteration latency prediction error of 8%, and its VM instance recommendations offer significant throughput gain and cost reduction while satisfying user constraints compared to existing solutions in complex, real-world scenarios.

preprint2020arXiv

Enumerating Hardware-Software Splits with Program Rewriting

A core problem in hardware-software codesign is in the sheer size of the design space. Without a set ISA to constrain the hardware-software interface, the design space explodes. This work presents a strategy for managing the massive hardware-software design space within the domain of machine learning inference workloads and accelerators. We first propose EngineIR, a new language for representing machine learning hardware and software in a single program. Then, using equality graphs -- a data structure from the compilers literature -- we suggest a method for efficiently enumerating the design space by performing rewrites over our representation.

preprint2020arXiv

Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

Most work in the deep learning systems community has focused on faster inference, but arriving at a trained model requires lengthy experiments. Accelerating training lets developers iterate faster and come up with better models. DNN training is often seen as a compute-bound problem, best done in a single large compute node with many GPUs. As DNNs get bigger, training requires going distributed. Distributed deep neural network (DDNN) training constitutes an important workload on the cloud. Larger DNN models and faster compute engines shift the training performance bottleneck from computation to communication. Our experiments show existing DNN training frameworks do not scale in a typical cloud environment due to insufficient bandwidth and inefficient parameter server software stacks.We propose PBox, a balanced, scalable central PS hardware that balances compute and communication resources, and PHub, a high performance parameter server (PS) software design that provides an optimized network stack and a streamlined gradient processing pipeline to benefit common PS setups to utilize PBox. We show that in a typical cloud environment, PBox can achieve up to 3.8x speedup over state-of-the-art designs when training ImageNet. We discuss future directions of integrating PBox with programmable switches for in-network aggregation during training, leveraging the datacenter network topology to reduce bandwidth usage and localize data movement.

preprint2020arXiv

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

Distributed deep neural network (DDNN) training constitutes an increasingly important workload that frequently runs in the cloud. Larger DNN models and faster compute engines are shifting DDNN training bottlenecks from computation to communication. This paper characterizes DDNN training to precisely pinpoint these bottlenecks. We found that timely training requires high performance parameter servers (PSs) with optimized network stacks and gradient processing pipelines, as well as server and network hardware with balanced computation and communication resources. We therefore propose PHub, a high performance multi-tenant, rack-scale PS design. PHub co-designs the PS software and hardware to accelerate rack-level and hierarchical cross-rack parameter exchange, with an API compatible with many DDNN training frameworks. PHub provides a performance improvement of up to 2.7x compared to state-of-the-art distributed training techniques for cloud-based ImageNet workloads, with 25% better throughput per dollar.

preprint2016arXiv

21st Century Computer Architecture

Because most technology and computer architecture innovations were (intentionally) invisible to higher layers, application and other software developers could reap the benefits of this progress without engaging in it. Higher performance has both made more computationally demanding applications feasible (e.g., virtual assistants, computer vision) and made less demanding applications easier to develop by enabling higher-level programming abstractions (e.g., scripting languages and reusable components). Improvements in computer system cost-effectiveness enabled value creation that could never have been imagined by the field's founders (e.g., distributed web search sufficiently inexpensive so as to be covered by advertising links). The wide benefits of computer performance growth are clear. Recently, Danowitz et al. apportioned computer performance growth roughly equally between technology and architecture, with architecture credited with ~80x improvement since 1985. As semiconductor technology approaches its "end-of-the-road" (see below), computer architecture will need to play an increasing role in enabling future ICT innovation. But instead of asking, "How can I make my chip run faster?," architects must now ask, "How can I enable the 21st century infrastructure, from sensors to clouds, adding value from performance to privacy, but without the benefit of near-perfect technology scaling?". The challenges are many, but with appropriate investment, opportunities abound. Underlying these opportunities is a common theme that future architecture innovations will require the engagement of and investments from innovators in other ICT layers.

preprint2016arXiv

Arch2030: A Vision of Computer Architecture Research over the Next 15 Years

Application trends, device technologies and the architecture of systems drive progress in information technologies. However, the former engines of such progress - Moore's Law and Dennard Scaling - are rapidly reaching the point of diminishing returns. The time has come for the computing community to boldly confront a new challenge: how to secure a foundational future for information technology's continued progress. The computer architecture community engaged in several visioning exercises over the years. Five years ago, we released a white paper, 21st Century Computer Architecture, which influenced funding programs in both academia and industry. More recently, the IEEE Rebooting Computing Initiative explored the future of computing systems in the architecture, device, and circuit domains. This report stems from an effort to continue this dialogue, reach out to the applications and devices/circuits communities, and understand their trends and vision. We aim to identify opportunities where architecture research can bridge the gap between the application and device domains.

preprint2015arXiv

SAP: an Architecture for Selectively Approximate Wireless Communication

Integrity checking is ubiquitous in data networks, but not all network traffic needs integrity protection. Many applications can tolerate slightly damaged data while still working acceptably, trading accuracy versus efficiency to save time and energy. Such applications should be able to receive damaged data if they so desire. In today's network stacks, lower-layer integrity checks discard damaged data regardless of the application's wishes, violating the End-to-End Principle. This paper argues for optional integrity checking and gently redesigns a commodity network architecture to support integrity-unprotected data. Our scheme, called Selective Approximate Protocol (SAP), allows applications to coordinate multiple network layers to accept potentially damaged data. Unlike previous schemes that targeted video or media streaming, SAP is generic. SAP's improved throughput and decreased retransmission rate is a good match for applications in the domain of approximate computing. Implemented atop WiFi as a case study, SAP works with existing physical layers and requires no hardware changes. SAP's benefits increase as channel conditions degrade. In tests of an error-tolerant file-transfer application over WiFi, SAP sped up transmission by about 30% on average.

preprint2011arXiv

The Impact of Memory Models on Software Reliability in Multiprocessors

The memory consistency model is a fundamental system property characterizing a multiprocessor. The relative merits of strict versus relaxed memory models have been widely debated in terms of their impact on performance, hardware complexity and programmability. This paper adds a new dimension to this discussion: the impact of memory models on software reliability. By allowing some instructions to reorder, weak memory models may expand the window between critical memory operations. This can increase the chance of an undesirable thread-interleaving, thus allowing an otherwise-unlikely concurrency bug to manifest. To explore this phenomenon, we define and study a probabilistic model of shared-memory parallel programs that takes into account such reordering. We use this model to formally derive bounds on the \emph{vulnerability} to concurrency bugs of different memory models. Our results show that for 2 (or a small constant number of) concurrent threads, weaker memory models do indeed have a higher likelihood of allowing bugs. On the other hand, we show that as the number of parallel threads increases, the gap between the different memory models becomes proportionally insignificant. This suggests the counter-intuitive rule that \emph{as the number of parallel threads in the system increases, the importance of using a strict memory model diminishes}; which potentially has major implications on the choice of memory consistency models in future multi-core systems.

Luis Ceze

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems

Srifty: Swift and Thrifty Distributed Training on the Cloud

Enumerating Hardware-Software Splits with Program Rewriting

Parameter Box: High Performance Parameter Servers for Efficient Distributed Deep Neural Network Training

Parameter Hub: a Rack-Scale Parameter Server for Distributed Deep Neural Network Training

21st Century Computer Architecture

Arch2030: A Vision of Computer Architecture Research over the Next 15 Years

SAP: an Architecture for Selectively Approximate Wireless Communication

The Impact of Memory Models on Software Reliability in Multiprocessors