Source author record

Phillip B. Gibbons

Phillip B. Gibbons appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Machine Learning Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Cryptography and Security Hardware Architecture

Catalog footprint

What is connected

9works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

There is often variation in the shape and size of input data used for deep learning. In many cases, such data can be represented using tensors with non-uniform shapes, or ragged tensors. Due to limited and non-portable support for efficient execution on ragged tensors, current deep learning frameworks generally use techniques such as padding and masking to make the data shapes uniform and then offload the computations to optimized kernels for dense tensor algebra. Such techniques can, however, lead to a lot of wasted computation and therefore, a loss in performance. This paper presents CoRa, a tensor compiler that allows users to easily generate efficient code for ragged tensor operators targeting a wide range of CPUs and GPUs. Evaluating CoRa on a variety of operators on ragged tensors as well as on an encoder layer of the transformer model, we find that CoRa (i)performs competitively with hand-optimized implementations of the operators and the transformer encoder and (ii) achieves, over PyTorch, a 1.6X geomean speedup for the encoder on an Nvidia GPU and a 1.86X geomean speedup for the multi-head attention module used in transformers on an ARM CPU.

preprint2021arXiv

Advances and Open Problems in Federated Learning

Federated learning (FL) is a machine learning setting where many clients (e.g. mobile devices or whole organizations) collaboratively train a model under the orchestration of a central server (e.g. service provider), while keeping the training data decentralized. FL embodies the principles of focused data collection and minimization, and can mitigate many of the systemic privacy risks and costs resulting from traditional, centralized machine learning and data science approaches. Motivated by the explosive growth in FL research, this paper discusses recent advances and presents an extensive collection of open problems and challenges.

preprint2021arXiv

Cortex: A Compiler for Recursive Deep Learning Models

Optimizing deep learning models is generally performed in two steps: (i) high-level graph optimizations such as kernel fusion and (ii) low level kernel optimizations such as those found in vendor libraries. This approach often leaves significant performance on the table, especially for the case of recursive deep learning models. In this paper, we present Cortex, a compiler-based approach to generate highly-efficient code for recursive models for low latency inference. Our compiler approach and low reliance on vendor libraries enables us to perform end-to-end optimizations, leading to up to 14X lower inference latencies over past work, across different backends.

preprint2020arXiv

DriftSurf: A Risk-competitive Learning Algorithm under Concept Drift

When learning from streaming data, a change in the data distribution, also known as concept drift, can render a previously-learned model inaccurate and require training a new model. We present an adaptive learning algorithm that extends previous drift-detection-based methods by incorporating drift detection into a broader stable-state/reactive-state process. The advantage of our approach is that we can use aggressive drift detection in the stable state to achieve a high detection rate, but mitigate the false positive rate of standalone drift detection via a reactive state that reacts quickly to true drifts while eliminating most false positives. The algorithm is generic in its base learner and can be applied across a variety of supervised learning problems. Our theoretical analysis shows that the risk of the algorithm is competitive to an algorithm with oracle knowledge of when (abrupt) drifts occur. Experiments on synthetic and real datasets with concept drifts confirm our theoretical analysis.

preprint2020arXiv

Sage: Parallel Semi-Asymmetric Graph Algorithms for NVRAMs

Non-volatile main memory (NVRAM) technologies provide an attractive set of features for large-scale graph analytics, including byte-addressability, low idle power, and improved memory-density. NVRAM systems today have an order of magnitude more NVRAM than traditional memory (DRAM). NVRAM systems could therefore potentially allow very large graph problems to be solved on a single machine, at a modest cost. However, a significant challenge in achieving high performance is in accounting for the fact that NVRAM writes can be much more expensive than NVRAM reads. In this paper, we propose an approach to parallel graph analytics using the Parallel Semi-Asymmetric Model (PSAM), in which the graph is stored as a read-only data structure (in NVRAM), and the amount of mutable memory is kept proportional to the number of vertices. Similar to the popular semi-external and semi-streaming models for graph analytics, the PSAM approach assumes that the vertices of the graph fit in a fast read-write memory (DRAM), but the edges do not. In NVRAM systems, our approach eliminates writes to the NVRAM, among other benefits. To experimentally study this new setting, we develop Sage, a parallel semi-asymmetric graph engine with which we implement provably-efficient (and often work-optimal) PSAM algorithms for over a dozen fundamental graph problems. We experimentally study Sage using a 48-core machine on the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) equipped with Optane DC Persistent Memory, and show that Sage outperforms the fastest prior systems designed for NVRAM. Importantly, we also show that Sage nearly matches the fastest prior systems running solely in DRAM, by effectively hiding the costs of repeatedly accessing NVRAM versus DRAM.

preprint2020arXiv

The Non-IID Data Quagmire of Decentralized Machine Learning

Many large-scale machine learning (ML) applications need to perform decentralized learning over datasets generated at different devices and locations. Such datasets pose a significant challenge to decentralized learning because their different contexts result in significant data distribution skew across devices/locations. In this paper, we take a step toward better understanding this challenge by presenting a detailed experimental study of decentralized DNN training on a common type of data skew: skewed distribution of data labels across devices/locations. Our study shows that: (i) skewed data labels are a fundamental and pervasive problem for decentralized learning, causing significant accuracy loss across many ML applications, DNN models, training datasets, and decentralized learning algorithms; (ii) the problem is particularly challenging for DNN models with batch normalization; and (iii) the degree of data skew is a key determinant of the difficulty of the problem. Based on these findings, we present SkewScout, a system-level approach that adapts the communication frequency of decentralized learning algorithms to the (skew-induced) accuracy loss between data partitions. We also show that group normalization can recover much of the accuracy loss of batch normalization.

preprint2016arXiv

Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM

Bitwise operations are an important component of modern day programming. Many widely-used data structures (e.g., bitmap indices in databases) rely on fast bitwise operations on large bit vectors to achieve high performance. Unfortunately, in existing systems, regardless of the underlying architecture (e.g., CPU, GPU, FPGA), the throughput of such bulk bitwise operations is limited by the available memory bandwidth. We propose Buddy, a new mechanism that exploits the analog operation of DRAM to perform bulk bitwise operations completely inside the DRAM chip. Buddy consists of two components. First, simultaneous activation of three DRAM rows that are connected to the same set of sense amplifiers enables us to perform bitwise AND and OR operations. Second, the inverters present in each sense amplifier enables us to perform bitwise NOT operations, with modest changes to the DRAM array. These two components make Buddy functionally complete. Our implementation of Buddy largely exploits the existing DRAM structure and interface, and incurs low overhead (1% of DRAM chip area). Our evaluations based on SPICE simulations show that, across seven commonly-used bitwise operations, Buddy provides between 10.9X---25.6X improvement in raw throughput and 25.1X---59.5X reduction in energy consumption. We evaluate three real-world data-intensive applications that exploit bitwise operations: 1) bitmap indices, 2) BitWeaving, and 3) bitvector-based implementation of sets. Our evaluations show that Buddy significantly outperforms the state-of-the-art.

preprint2016arXiv

Efficient Algorithms with Asymmetric Read and Write Costs

In several emerging technologies for computer memory (main memory), the cost of reading is significantly cheaper than the cost of writing. Such asymmetry in memory costs poses a fundamentally different model from the RAM for algorithm design. In this paper we study lower and upper bounds for various problems under such asymmetric read and write costs. We consider both the case in which all but $O(1)$ memory has asymmetric cost, and the case of a small cache of symmetric memory. We model both cases using the $(M,ω)$-ARAM, in which there is a small (symmetric) memory of size $M$ and a large unbounded (asymmetric) memory, both random access, and where reading from the large memory has unit cost, but writing has cost $ω\gg 1$. For FFT and sorting networks we show a lower bound cost of $Ω(ωn\log_{ωM} n)$, which indicates that it is not possible to achieve asymptotic improvements with cheaper reads when $ω$ is bounded by a polynomial in $M$. Also, there is an asymptotic gap (of $\min(ω,\log n)/\log(ωM)$) between the cost of sorting networks and comparison sorting in the model. This contrasts with the RAM, and most other models. We also show a lower bound for computations on an $n\times n$ diamond DAG of $Ω(ωn^2/M)$ cost, which indicates no asymptotic improvement is achievable with fast reads. However, we show that for the edit distance problem (and related problems), which would seem to be a diamond DAG, there exists an algorithm with only $O(ωn^2/(M\min(ω^{1/3},M^{1/2})))$ cost. To achieve this we make use of a "path sketch" technique that is forbidden in a strict DAG computation. Finally, we show several interesting upper bounds for shortest path problems, minimum spanning trees, and other problems. A common theme in many of the upper bounds is to have redundant computation to tradeoff between reads and writes.

preprint2016arXiv

Sorting with Asymmetric Read and Write Costs

Emerging memory technologies have a significant gap between the cost, both in time and in energy, of writing to memory versus reading from memory. In this paper we present models and algorithms that account for this difference, with a focus on write-efficient sorting algorithms. First, we consider the PRAM model with asymmetric write cost, and show that sorting can be performed in $O\left(n\right)$ writes, $O\left(n \log n\right)$ reads, and logarithmic depth (parallel time). Next, we consider a variant of the External Memory (EM) model that charges $ω> 1$ for writing a block of size $B$ to the secondary memory, and present variants of three EM sorting algorithms (multi-way mergesort, sample sort, and heapsort using buffer trees) that asymptotically reduce the number of writes over the original algorithms, and perform roughly $ω$ block reads for every block write. Finally, we define a variant of the Ideal-Cache model with asymmetric write costs, and present write-efficient, cache-oblivious parallel algorithms for sorting, FFTs, and matrix multiplication. Adapting prior bounds for work-stealing and parallel-depth-first schedulers to the asymmetric setting, these yield parallel cache complexity bounds for machines with private caches or with a shared cache, respectively.

Phillip B. Gibbons

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

The CoRa Tensor Compiler: Compilation for Ragged Tensors with Minimal Padding

Advances and Open Problems in Federated Learning

Cortex: A Compiler for Recursive Deep Learning Models

DriftSurf: A Risk-competitive Learning Algorithm under Concept Drift

Sage: Parallel Semi-Asymmetric Graph Algorithms for NVRAMs

The Non-IID Data Quagmire of Decentralized Machine Learning

Buddy-RAM: Improving the Performance and Efficiency of Bulk Bitwise Operations Using DRAM

Efficient Algorithms with Asymmetric Read and Write Costs

Sorting with Asymmetric Read and Write Costs