Source author record

Philip Levis

Philip Levis appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Networking and Internet Architecture Databases Graphics Hardware Architecture Machine Learning physics.med-ph physics.soc-ph Quantitative Methods

Catalog footprint

What is connected

10works

9topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Mixture of experts has emerged as the primary mechanism for making Large Language Models (LLMs) computationally efficient. However, in distributed settings, communicating token embeddings between experts is a significant bottleneck. We present the novel Federation of Experts (FoE) architecture. FoE restructures the MoE block of a transformer layer into multiple MoE clusters. Each cluster is responsible for only one of the KV heads and expert parallelism is applied between those experts. Between clusters, a sum synchronizes the post-attention residuals, which then drives routing and dispatch for the next MoE block. In a single-node setting, FoE completely eliminates all-to-all communication as all experts within a group are contained on the same GPU. In multi-node settings, FoE confines all-to-all communication to the intra-node fabric, thus significantly reducing communication overhead. An implementation of FoE finds that on LongBench, FoE significantly improves inference throughput and latency in both single-node and multi-node settings, reducing the end-to-end forward-pass latency by up to 5.2x, TTFT by 3.62x, and TBT by 1.95x. It does so while achieving comparable generation quality to a mixture of experts model of the same size and training configuration.

preprint2020arXiv

Approximate Partition Selection for Big-Data Workloads using Summary Statistics

Many big-data clusters store data in large partitions that support access at a coarse, partition-level granularity. As a result, approximate query processing via row-level sampling is inefficient, often requiring reads of many partitions. In this work, we seek to answer queries quickly and approximately by reading a subset of the data partitions and combining partial answers in a weighted manner without modifying the data layout. We illustrate how to efficiently perform this query processing using a set of pre-computed summary statistics, which inform the choice of partitions and weights. We develop novel means of using the statistics to assess the similarity and importance of partitions. Our experiments on several datasets and data layouts demonstrate that to achieve the same relative error compared to uniform partition sampling, our techniques offer from 2.7$\times$ to $70\times$ reduction in the number of partitions read, and the statistics stored per partition require fewer than 100KB.

preprint2020arXiv

Design Considerations for Low Power Internet Protocols

Over the past 10 years, low-power wireless networks have transitioned to supporting IPv6 connectivity through 6LoWPAN, a set of standards which specify how to aggressively compress IPv6 packets over low-power wireless links such as 802.15.4. We find that different low-power IPv6 stacks are unable to communicate using 6LoWPAN, and therefore IP, due to design tradeoffs between code size and energy efficiency. We argue that applying traditional protocol design principles to low-power networks is responsible for these failures, in part because receivers must accommodate a wide range of senders. Based on these findings, we propose three design principles for Internet protocols on low-power networks. These principles are based around the importance of providing flexible tradeoffs between code size and energy efficiency. We apply these principles to 6LoWPAN and show that the resulting design of the protocol provides developers a wide range of tradeoff points while allowing implementations with different choices to seamlessly communicate.

preprint2020arXiv

GRIP: A Graph Neural Network Accelerator Architecture

We present GRIP, a graph neural network accelerator architecture designed for low-latency inference. AcceleratingGNNs is challenging because they combine two distinct types of computation: arithmetic-intensive vertex-centric operations and memory-intensive edge-centric operations. GRIP splits GNN inference into a fixed set of edge- and vertex-centric execution phases that can be implemented in hardware. We then specialize each unit for the unique computational structure found in each phase.For vertex-centric phases, GRIP uses a high performance matrix multiply engine coupled with a dedicated memory subsystem for weights to improve reuse. For edge-centric phases, GRIP use multiple parallel prefetch and reduction engines to alleviate the irregularity in memory accesses. Finally, GRIP supports severalGNN optimizations, including a novel optimization called vertex-tiling which increases the reuse of weight data.We evaluate GRIP by performing synthesis and place and route for a 28nm implementation capable of executing inference for several widely-used GNN models (GCN, GraphSAGE, G-GCN, and GIN). Across several benchmark graphs, it reduces 99th percentile latency by a geometric mean of 17x and 23x compared to a CPU and GPU baseline, respectively, while drawing only 5W.

preprint2019arXiv

Learning in situ: a randomized experiment in video streaming

We describe the results of a randomized controlled trial of video-streaming algorithms for bitrate selection and network prediction. Over the last eight months, we have streamed 14.2 years of video to 56,000 users across the Internet. Sessions are randomized in blinded fashion among algorithms, and client telemetry is recorded for analysis. We found that in this real-world setting, it is difficult for sophisticated or machine-learned control schemes to outperform a "simple" scheme (buffer-based control), notwithstanding good performance in network emulators or simulators. We performed a statistical analysis and found that the variability and heavy-tailed nature of network and algorithm behavior create hurdles for robust learned algorithms in this area. We developed an ABR algorithm that robustly outperforms other schemes in practice, by combining classical control with a learned network predictor, trained with supervised learning in situ on data from the real deployment environment. To support further investigation, we are publishing an archive of traces and results each day, and will open our ongoing study to the community. We welcome other researchers to use this platform to develop and validate new algorithms for bitrate selection, network prediction, and congestion control.

preprint2016arXiv

Canary: A Scheduling Architecture for High Performance Cloud Computing

We present Canary, a scheduling architecture that allows high performance analytics workloads to scale out to run on thousands of cores. Canary is motivated by the observation that a central scheduler is a bottleneck for high performance codes: a handful of multicore workers can execute tasks faster than a controller can schedule them. The key insight in Canary is to reverse the responsibilities between controllers and workers. Rather than dispatch tasks to workers, which then fetch data as necessary, in Canary the controller assigns data partitions to workers, which then spawn and schedule tasks locally. We evaluate three benchmark applications in Canary on up to 64 servers and 1,152 cores on Amazon EC2. Canary achieves up to 9-90X speedup over Spark and up to 4X speedup over GraphX, a highly optimized graph analytics engine. While current centralized schedulers can schedule 2,500 tasks/second, each Canary worker can schedule 136,000 tasks/second per core and experiments show this scales out linearly, with 64 workers scheduling over 120 million tasks per second, allowing Canary to support optimized jobs running on thousands of cores.

preprint2016arXiv

Distributed Graphical Simulation in the Cloud

Graphical simulations are a cornerstone of modern media and films. But existing software packages are designed to run on HPC nodes, and perform poorly in the computing cloud. These simulations have complex data access patterns over complex data structures, and mutate data arbitrarily, and so are a poor fit for existing cloud computing systems. We describe a software architecture for running graphical simulations in the cloud that decouples control logic, computations and data exchanges. This allows a central controller to balance load by redistributing computations, and recover from failures. Evaluations show that the architecture can run existing, state-of-the-art simulations in the presence of stragglers and failures, thereby enabling this large class of applications to use the computing cloud for the first time.

preprint2016arXiv

Ebb: A DSL for Physical Simulation on CPUs and GPUs

Designing programming environments for physical simulation is challenging because simulations rely on diverse algorithms and geometric domains. These challenges are compounded when we try to run efficiently on heterogeneous parallel architectures. We present Ebb, a domain-specific language (DSL) for simulation, that runs efficiently on both CPUs and GPUs. Unlike previous DSLs, Ebb uses a three-layer architecture to separate (1) simulation code, (2) definition of data structures for geometric domains, and (3) runtimes supporting parallel architectures. Different geometric domains are implemented as libraries that use a common, unified, relational data model. By structuring the simulation framework in this way, programmers implementing simulations can focus on the physics and algorithms for each simulation without worrying about their implementation on parallel computers. Because the geometric domain libraries are all implemented using a common runtime based on relations, new geometric domains can be added as needed, without specifying the details of memory management, mapping to different parallel architectures, or having to expand the runtime's interface. We evaluate Ebb by comparing it to several widely used simulations, demonstrating comparable performance to hand-written GPU code where available, and surpassing existing CPU performance optimizations by up to 9$\times$ when no GPU code exists.

preprint2016arXiv

Scalable, Fast Cloud Computing with Execution Templates

Large scale cloud data analytics applications are often CPU bound. Most of these cycles are wasted: benchmarks written in C++ run 10-51 times faster than frameworks such as Naiad and Spark. However, calling faster implementations from those frameworks only sees moderate (3-5x) speedups because their control planes cannot schedule work fast enough. This paper presents execution templates, a control plane abstraction for CPU-bound cloud applications, such as machine learning. Execution templates leverage highly repetitive control flow to cache scheduling decisions as {\it templates}. Rather than reschedule hundreds of thousands of tasks on every loop execution, nodes instantiate these templates. A controller's template specifies the execution across all worker nodes, which it partitions into per-worker templates. To ensure that templates execute correctly, controllers dynamically patch templates to match program control flow. We have implemented execution templates in Nimbus, a C++ cloud computing framework. Running in Nimbus, analytics benchmarks can run 16-43 times faster than in Naiad and Spark. Nimbus's control plane can scale out to run these faster benchmarks on up to 100 nodes (800 cores).

preprint2010arXiv

A High-Resolution Human Contact Network for Infectious Disease Transmission

The most frequent infectious diseases in humans - and those with the highest potential for rapid pandemic spread - are usually transmitted via droplets during close proximity interactions (CPIs). Despite the importance of this transmission route, very little is known about the dynamic patterns of CPIs. Using wireless sensor network technology, we obtained high-resolution data of CPIs during a typical day at an American high school, permitting the reconstruction of the social network relevant for infectious disease transmission. At a 94% coverage, we collected 762,868 CPIs at a maximal distance of 3 meters among 788 individuals. The data revealed a high density network with typical small world properties and a relatively homogenous distribution of both interaction time and interaction partners among subjects. Computer simulations of the spread of an influenza-like disease on the weighted contact graph are in good agreement with absentee data during the most recent influenza season. Analysis of targeted immunization strategies suggested that contact network data are required to design strategies that are significantly more effective than random immunization. Immunization strategies based on contact network data were most effective at high vaccination coverage.

Philip Levis

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

Federation of Experts: Communication Efficient Distributed Inference for Large Language Models

Approximate Partition Selection for Big-Data Workloads using Summary Statistics

Design Considerations for Low Power Internet Protocols

GRIP: A Graph Neural Network Accelerator Architecture

Learning in situ: a randomized experiment in video streaming

Canary: A Scheduling Architecture for High Performance Cloud Computing

Distributed Graphical Simulation in the Cloud

Ebb: A DSL for Physical Simulation on CPUs and GPUs

Scalable, Fast Cloud Computing with Execution Templates

A High-Resolution Human Contact Network for Infectious Disease Transmission