Source author record

Brandon Reagen

Brandon Reagen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Cryptography and Security Distributed, Parallel, and Cluster Computing Machine Learning Hardware Architecture

Catalog footprint

What is connected

10works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

CryptoNite: Revealing the Pitfalls of End-to-End Private Inference at Scale

The privacy concerns of providing deep learning inference as a service have underscored the need for private inference (PI) protocols that protect users' data and the service provider's model using cryptographic methods. Recently proposed PI protocols have achieved significant reductions in PI latency by moving the computationally heavy homomorphic encryption (HE) parts to an offline/pre-compute phase. Paired with recent optimizations that tailor networks for PI, these protocols have achieved performance levels that are tantalizingly close to being practical. In this paper, we conduct a rigorous end-to-end characterization of PI protocols and optimization techniques and find that the current understanding of PI performance is overly optimistic. Specifically, we find that offline storage costs of garbled circuits (GC), a key cryptographic protocol used in PI, on user/client devices are prohibitively high and force much of the expensive offline HE computation to the online phase, resulting in a 10-1000$\times$ increase to PI latency. We propose a modified PI protocol that significantly reduces client-side storage costs for a small increase in online latency. Evaluated end-to-end, the modified protocol outperforms current protocols by reducing the mean PI latency by $4\times$ for ResNet18 on TinyImageNet. We conclude with a discussion of several recently proposed PI optimizations in light of the findings and note many actually increase PI latency when evaluated from an end-to-end perspective.

preprint2022arXiv

Homomorphically Encrypted Computation using Stochastic Encodings

Homomorphic encryption (HE) is a privacy-preserving technique that enables computation directly over ciphertext. Unfortunately, a key challenge for HE is that implementations can be impractically slow and have limits on computation that can be efficiently implemented. For instance, in Boolean constructions of HE like TFHE, arithmetic operations need to be decomposed into constituent elementary logic gates to implement so performance depends on logical circuit depth. For even heavily quantized fixed-point arithmetic operations, these HE circuit implementations can be slow. This paper explores the merit of using stochastic computing (SC) encodings to reduce the logical depth required for HE computation to enable more efficient implementations. Contrary to computation in the plaintext space where many efficient hardware implementations are available, HE provides support for only a limited number of primitive operators and their performance may not directly correlate to their plaintext performance. Our results show that by layering SC encodings on top of TFHE, we observe similar challenges and limitations that SC faces in the plaintext space. Additional breakthroughs would require more support from the HE libraries to make SC with HE a viable solution.

preprint2022arXiv

Impala: Low-Latency, Communication-Efficient Private Deep Learning Inference

This paper proposes Impala, a new cryptographic protocol for private inference in the client-cloud setting. Impala builds upon recent solutions that combine the complementary strengths of homomorphic encryption (HE) and secure multi-party computation (MPC). A series of protocol optimizations are developed to reduce both communication and performance bottlenecks. First, we remove MPC's overwhelmingly high communication cost from the client by introducing a proxy server and developing a low-overhead key switching technique. Key switching reduces the clients bandwidth by multiple orders of magnitude, however the communication between the proxy and cloud is still excessive. Second, to we develop an optimized garbled circuit that leverages truncated secret shares for faster evaluation and less proxy-cloud communication. Finally, we propose sparse HE convolution to reduce the computational bottleneck of using HE. Compared to the state-of-the-art, these optimizations provide a bandwidth savings of over 3X and speedup of 4X for private deep learning inference.

preprint2022arXiv

Selective Network Linearization for Efficient Private Inference

Private inference (PI) enables inference directly on cryptographically secure data.While promising to address many privacy issues, it has seen limited use due to extreme runtimes. Unlike plaintext inference, where latency is dominated by FLOPs, in PI non-linear functions (namely ReLU) are the bottleneck. Thus, practical PI demands novel ReLU-aware optimizations. To reduce PI latency we propose a gradient-based algorithm that selectively linearizes ReLUs while maintaining prediction accuracy. We evaluate our algorithm on several standard PI benchmarks. The results demonstrate up to $4.25\%$ more accuracy (iso-ReLU count at 50K) or $2.2\times$ less latency (iso-accuracy at 70\%) than the current state of the art and advance the Pareto frontier across the latency-accuracy space. To complement empirical results, we present a "no free lunch" theorem that sheds light on how and when network linearization is possible while maintaining prediction accuracy. Public code is available at \url{https://github.com/NYU-DICE-Lab/selective_network_linearization}.

preprint2022arXiv

Verifiable Access Control for Augmented Reality Localization and Mapping

Localization and mapping is a key technology for bridging the virtual and physical worlds in augmented reality (AR). Localization and mapping works by creating and querying maps made of anchor points that enable the overlay of these two worlds. As a result, information about the physical world is captured in the map and naturally gives rise to concerns around who can map physical spaces as well as who can access or modify the virtual ones. This paper discusses how we can provide access controls over virtual maps as a basic building block to enhance security and privacy of AR systems. In particular, we propose VACMaps: an access control system for localization and mapping using formal methods. VACMaps defines a domain-specific language that enables users to specify access control policies for virtual spaces. Access requests to virtual spaces are then evaluated against relevant policies in a way that preserves confidentiality and integrity of virtual spaces owned by the users. The precise semantics of the policies are defined by SMT formulas, which allow VACMaps to reason about properties of access policies automatically. An evaluation of VACMaps is provided using an AR testbed of a single-family home. We show that VACMaps is scalable in that it can run at practical speeds and that it can also reason about access control policies automatically to detect potential policy misconfigurations.

preprint2021arXiv

Porcupine: A Synthesizing Compiler for Vectorized Homomorphic Encryption

Homomorphic encryption (HE) is a privacy-preserving technique that enables computation directly on encrypted data. Despite its promise, HE has seen limited use due to performance overheads and compilation challenges. Recent work has made significant advances to address the performance overheads but automatic compilation of efficient HE kernels remains relatively unexplored. This paper presents Porcupine, an optimizing compiler, and HE DSL named Quill to automatically generate HE code using program synthesis. HE poses three major compilation challenges: it only supports a limited set of SIMD-like operators, it uses long-vector operands, and decryption can fail if ciphertext noise growth is not managed properly. Quill captures the underlying HE operator behavior that enables Porcupine to reason about the complex trade-offs imposed by the challenges and generate optimized, verified HE kernels. To improve synthesis time, we propose a series of optimizations including a sketch design tailored to HE and instruction restriction to narrow the program search space. We evaluate Procupine using a set of kernels and show speedups of up to 51% (11% geometric mean) compared to heuristic-driven hand-optimized kernels. Analysis of Porcupine's synthesized code reveals that optimal solutions are not always intuitive, underscoring the utility of automated reasoning in this domain.

preprint2020arXiv

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

Neural personalized recommendation is the corner-stone of a wide collection of cloud services and products, constituting significant compute demand of the cloud infrastructure. Thus, improving the execution efficiency of neural recommendation directly translates into infrastructure capacity saving. In this paper, we devise a novel end-to-end modeling infrastructure, DeepRecInfra, that adopts an algorithm and system co-design methodology to custom-design systems for recommendation use cases. Leveraging the insights from the recommendation characterization, a new dynamic scheduler, DeepRecSched, is proposed to maximize latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, recommendation model architectures, and underlying hardware systems. By doing so, system throughput is doubled across the eight industry-representative recommendation models. Finally, design, deployment, and evaluation in at-scale production datacenter shows over 30% latency reduction across a wide variety of recommendation models running on hundreds of machines.

preprint2020arXiv

The Architectural Implications of Facebook's DNN-based Personalized Recommendation

The widespread application of deep learning has changed the landscape of computation in the data center. In particular, personalized recommendation for content ranking is now largely accomplished leveraging deep neural networks. However, despite the importance of these models and the amount of compute cycles they consume, relatively little research attention has been devoted to systems for recommendation. To facilitate research and to advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inferences can drastically improve latency-bounded throughput, and the diverse composition of recommendation models leads to different optimization strategies.

preprint2019arXiv

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, resulting in up to 9.8x memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2x throughput improvement and 45.8% memory energy savings.

preprint2016arXiv

Fathom: Reference Workloads for Modern Deep Learning Methods

Deep learning has been popularized by its recent successes on challenging artificial intelligence problems. One of the reasons for its dominance is also an ongoing challenge: the need for immense amounts of computational power. Hardware architects have responded by proposing a wide array of promising ideas, but to date, the majority of the work has focused on specific algorithms in somewhat narrow application domains. While their specificity does not diminish these approaches, there is a clear need for more flexible solutions. We believe the first step is to examine the characteristics of cutting edge models from across the deep learning community. Consequently, we have assembled Fathom: a collection of eight archetypal deep learning workloads for study. Each of these models comes from a seminal work in the deep learning community, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook's AI research group. Fathom has been released online, and this paper focuses on understanding the fundamental performance characteristics of each model. We use a set of application-level modeling tools built around the TensorFlow deep learning framework in order to analyze the behavior of the Fathom workloads. We present a breakdown of where time is spent, the similarities between the performance profiles of our models, an analysis of behavior in inference and training, and the effects of parallelism on scaling.

Brandon Reagen

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

CryptoNite: Revealing the Pitfalls of End-to-End Private Inference at Scale

Homomorphically Encrypted Computation using Stochastic Encodings

Impala: Low-Latency, Communication-Efficient Private Deep Learning Inference

Selective Network Linearization for Efficient Private Inference

Verifiable Access Control for Augmented Reality Localization and Mapping

Porcupine: A Synthesizing Compiler for Vectorized Homomorphic Encryption

DeepRecSys: A System for Optimizing End-To-End At-scale Neural Recommendation Inference

The Architectural Implications of Facebook's DNN-based Personalized Recommendation

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Fathom: Reference Workloads for Modern Deep Learning Methods