Source author record

John Kim

John Kim appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Hardware Architecture math.CO Cryptography and Security Distributed, Parallel, and Cluster Computing Machine Learning Computation and Language Computational Complexity Computer Vision Information Theory math.IT math.NT math.QA math.RT Performance

Catalog footprint

What is connected

11works

14topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Accelerating Polynomial Multiplication for Homomorphic Encryption on GPUs

Homomorphic Encryption (HE) enables users to securely outsource both the storage and computation of sensitive data to untrusted servers. Not only does HE offer an attractive solution for security in cloud systems, but lattice-based HE systems are also believed to be resistant to attacks by quantum computers. However, current HE implementations suffer from prohibitively high latency. For lattice-based HE to become viable for real-world systems, it is necessary for the key bottlenecks - particularly polynomial multiplication - to be highly efficient. In this paper, we present a characterization of GPU-based implementations of polynomial multiplication. We begin with a survey of modular reduction techniques and analyze several variants of the widely-used Barrett modular reduction algorithm. We then propose a modular reduction variant optimized for 64-bit integer words on the GPU, obtaining a 1.8x speedup over the existing comparable implementations. Next, we explore the following GPU-specific improvements for polynomial multiplication targeted at optimizing latency and throughput: 1) We present a 2D mixed-radix, multi-block implementation of NTT that results in a 1.85x average speedup over the previous state-of-the-art. 2) We explore shared memory optimizations aimed at reducing redundant memory accesses, further improving speedups by 1.2x. 3) Finally, we fuse the Hadamard product with neighboring stages of the NTT, reducing the twiddle factor memory footprint by 50%. By combining our NTT optimizations, we achieve an overall speedup of 123.13x and 2.37x over the previous state-of-the-art CPU and GPU implementations of NTT kernels, respectively.

preprint2022arXiv

Answer Fast: Accelerating BERT on the Tensor Streaming Processor

Transformers have become a predominant machine learning workload, they are not only the de-facto standard for natural language processing tasks, but they are also being deployed in other domains such as vision and speech recognition. Many of the transformer-based applications are real-time systems such as machine translation and web search. These real time systems often come with strict end-to-end inference latency requirements. Unfortunately, while the majority of the transformer computation comes from matrix multiplications, transformers also include several non-linear components that tend to become the bottleneck during an inference. In this work, we accelerate the inference of BERT models on the tensor streaming processor. By carefully fusing all the nonlinear components with the matrix multiplication components, we are able to efficiently utilize the on-chip matrix multiplication units resulting in a deterministic tail latency of 130 $μ$s for a batch-1 inference through BERT-base, which is 6X faster than the current state-of-the-art.

preprint2022arXiv

BTS: An Accelerator for Bootstrappable Fully Homomorphic Encryption

Homomorphic encryption (HE) enables the secure offloading of computations to the cloud by providing computation on encrypted data (ciphertexts). HE is based on noisy encryption schemes in which noise accumulates as more computations are applied to the data. The limited number of operations applicable to the data prevents practical applications from exploiting HE. Bootstrapping enables an unlimited number of operations or fully HE (FHE) by refreshing the ciphertext. Unfortunately, bootstrapping requires a significant amount of additional computation and memory bandwidth as well. Prior works have proposed hardware accelerators for computation primitives of FHE. However, to the best of our knowledge, this is the first to propose a hardware FHE accelerator that supports bootstrapping as a first-class citizen. In particular, we propose BTS - Bootstrappable, Technologydriven, Secure accelerator architecture for FHE. We identify the challenges of supporting bootstrapping in the accelerator and analyze the off-chip memory bandwidth and computation required. In particular, given the limitations of modern memory technology, we identify the HE parameter sets that are efficient for FHE acceleration. Based on the insights gained from our analysis, we propose BTS, which effectively exploits the parallelism innate in HE operations by arranging a massive number of processing elements in a grid. We present the design and microarchitecture of BTS, including a network-on-chip design that exploits a deterministic communication pattern. BTS shows 5,556x and 1,306x improved execution time on ResNet-20 and logistic regression over a CPU, with a chip area of 373.6mm^2 and up to 163.2W of power.

preprint2020arXiv

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers. Recommendation models present unique challenges in training because they exercise not only compute but also memory capacity as well as memory and network bandwidth. As model size and complexity increase, efficiently scaling training becomes a challenge. To address it we design Zion - Facebook's next-generation large-memory training platform that consists of both CPUs and accelerators. Also, we discuss the design requirements of future scale-out training systems.

preprint2020arXiv

HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for Multi-GPU systems

While multi-GPU (MGPU) systems are extremely popular for compute-intensive workloads, several inefficiencies in the memory hierarchy and data movement result in a waste of GPU resources and difficulties in programming MGPU systems. First, due to the lack of hardware-level coherence, the MGPU programming model requires the programmer to replicate and repeatedly transfer data between the GPUs' memory. This leads to inefficient use of precious GPU memory. Second, to maintain coherency across an MGPU system, transferring data using low-bandwidth and high-latency off-chip links leads to degradation in system performance. Third, since the programmer needs to manually maintain data coherence, the programming of an MGPU system to maximize its throughput is extremely challenging. To address the above issues, we propose a novel lightweight timestamp-based coherence protocol, HALCONE, for MGPU systems and modify the memory hierarchy of the GPUs to support physically shared memory. HALCONE replaces the Compute Unit (CU) level logical time counters with cache level logical time counters to reduce coherence traffic. Furthermore, HALCONE introduces a novel timestamp storage unit (TSU) with no additional performance overhead in the main memory to perform coherence actions. Our proposed HALCONE protocol maintains the data coherence in the memory hierarchy of the MGPU with minimal performance overhead (less than 1\%). Using a set of standard MGPU benchmarks, we observe that a 4-GPU MGPU system with shared memory and HALCONE performs, on average, 4.6$\times$ and 3$\times$ better than a 4-GPU MGPU system with existing RDMA and with the recently proposed HMG coherence protocol, respectively. We demonstrate the scalability of HALCONE using different GPU counts (2, 4, 8, and 16) and different CU counts (32, 48, and 64 CUs per GPU) for 11 standard benchmarks.

preprint2020arXiv

MGPU-TSM: A Multi-GPU System with Truly Shared Memory

The sizes of GPU applications are rapidly growing. They are exhausting the compute and memory resources of a single GPU, and are demanding the move to multiple GPUs. However, the performance of these applications scales sub-linearly with GPU count because of the overhead of data movement across multiple GPUs. Moreover, a lack of hardware support for coherency exacerbates the problem because a programmer must either replicate the data across GPUs or fetch the remote data using high-overhead off-chip links. To address these problems, we propose a multi-GPU system with truly shared memory (MGPU-TSM), where the main memory is physically shared across all the GPUs. We eliminate remote accesses and avoid data replication using an MGPU-TSM system, which simplifies the memory hierarchy. Our preliminary analysis shows that MGPU-TSM with 4 GPUs performs, on average, 3.9x? better than the current best performing multi-GPU configuration for standard application benchmarks.

preprint2019arXiv

LYTNet: A Convolutional Neural Network for Real-Time Pedestrian Traffic Lights and Zebra Crossing Recognition for the Visually Impaired

Currently, the visually impaired rely on either a sighted human, guide dog, or white cane to safely navigate. However, the training of guide dogs is extremely expensive, and canes cannot provide essential information regarding the color of traffic lights and direction of crosswalks. In this paper, we propose a deep learning based solution that provides information regarding the traffic light mode and the position of the zebra crossing. Previous solutions that utilize machine learning only provide one piece of information and are mostly binary: only detecting red or green lights. The proposed convolutional neural network, LYTNet, is designed for comprehensiveness, accuracy, and computational efficiency. LYTNet delivers both of the two most important pieces of information for the visually impaired to cross the road. We provide five classes of pedestrian traffic lights rather than the commonly seen three or four, and a direction vector representing the midline of the zebra crossing that is converted from the 2D image plane to real-world positions. We created our own dataset of pedestrian traffic lights containing over 5000 photos taken at hundreds of intersections in Shanghai. The experiments carried out achieve a classification accuracy of 94%, average angle error of 6.35 degrees, with a frame rate of 20 frames per second when testing the network on an iPhone 7 with additional post-processing steps.

preprint2016arXiv

Cauchy-Davenport Theorem for linear maps: Simplification and Extension

We give a new proof of the Cauchy-Davenport Theorem for linear maps given by Herdade et al., (2015). This theorem gives a lower bound on the size of the image of a linear map on a grid. Our proof is purely combinatorial and offers a partial insight into the range of parameters not handled previously.

preprint2015arXiv

A Cauchy-Davenport theorem for linear maps

We prove a version of the Cauchy-Davenport theorem for general linear maps. For subsets $A,B$ of the finite field $\mathbb{F}_p$, the classical Cauchy-Davenport theorem gives a lower bound for the size of the sumset $A+B$ in terms of the sizes of the sets $A$ and $B$. Our theorem considers a general linear map $L: \mathbb{F}_p^n \to \mathbb{F}_p^m$, and subsets $A_1, \ldots, A_n \subseteq \mathbb{F}_p$, and gives a lower bound on the size of $L(A_1 \times A_2 \times \ldots \times A_n)$ in terms of the sizes of the sets $A_1, \ldots, A_n$. Our proof uses Alon's Combinatorial Nullstellensatz and a variation of the polynomial method.

preprint2015arXiv

Decoding Reed-Muller codes over product sets

We give a polynomial time algorithm to decode multivariate polynomial codes of degree $d$ up to half their minimum distance, when the evaluation points are an arbitrary product set $S^m$, for every $d < |S|$. Previously known algorithms can achieve this only if the set $S$ has some very special algebraic structure, or if the degree $d$ is significantly smaller than $|S|$. We also give a near-linear time randomized algorithm, which is based on tools from list-decoding, to decode these codes from nearly half their minimum distance, provided $d < (1-ε)|S|$ for constant $ε> 0$. Our result gives an $m$-dimensional generalization of the well known decoding algorithms for Reed-Solomon codes, and can be viewed as giving an algorithmic version of the Schwartz-Zippel lemma.

preprint2012arXiv

On the lower central series of an associative algebra

This paper continues the study of the lower central series quotients of an associative algebra A, regarded as a Lie algebra, which was started in math/0610410 by Feigin and Shoikhet. Namely, it provides a basis for the second quotient in the case when A is the free algebra in n generators (note that the Hilbert series of this quotient was determined earlier in math/0610410). Further, it uses this basis to determine the structure of the second quotient in the case when A is the free algebra modulo the relations saying that the generators have given nilpotency orders. Finally, it determines the structure of the third and fourth quotient in the case of 2 generators, confirming an answer conjectured in math/0610410. Finally, in the appendix, the results of math/0610410 are generalized to the case when A is an arbitrary associative algebra (under certain conditions on $A$).

John Kim

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Accelerating Polynomial Multiplication for Homomorphic Encryption on GPUs

Answer Fast: Accelerating BERT on the Tensor Streaming Processor

BTS: An Accelerator for Bootstrappable Fully Homomorphic Encryption

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

HALCONE : A Hardware-Level Timestamp-based Cache Coherence Scheme for Multi-GPU systems

MGPU-TSM: A Multi-GPU System with Truly Shared Memory

LYTNet: A Convolutional Neural Network for Real-Time Pedestrian Traffic Lights and Zebra Crossing Recognition for the Visually Impaired

Cauchy-Davenport Theorem for linear maps: Simplification and Extension

A Cauchy-Davenport theorem for linear maps

Decoding Reed-Muller codes over product sets

On the lower central series of an associative algebra