Source author record

Mikhail Smelyanskiy

Mikhail Smelyanskiy appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Machine Learning quant-ph Hardware Architecture hep-lat Performance physics.chem-ph physics.comp-ph

Catalog footprint

What is connected

7works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2021arXiv

FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference

Deep learning models typically use single-precision (FP32) floating point data types for representing activations and weights, but a slew of recent research work has shown that computations with reduced-precision data types (FP16, 16-bit integers, 8-bit integers or even 4- or 2-bit integers) are enough to achieve same accuracy as FP32 and are much more efficient. Therefore, we designed fbgemm, a high-performance kernel library, from ground up to perform high-performance quantized inference on current generation CPUs. fbgemm achieves efficiency by fusing common quantization operations with a high-performance gemm implementation and by shape- and size-specific kernel code generation at runtime. The library has been deployed at Facebook, where it delivers greater than 2x performance gains with respect to our current production baseline.

preprint2020arXiv

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers. Recommendation models present unique challenges in training because they exercise not only compute but also memory capacity as well as memory and network bandwidth. As model size and complexity increase, efficiently scaling training becomes a challenge. To address it we design Zion - Facebook's next-generation large-memory training platform that consists of both CPUs and accelerators. Also, we discuss the design requirements of future scale-out training systems.

preprint2020arXiv

The Architectural Implications of Facebook's DNN-based Personalized Recommendation

The widespread application of deep learning has changed the landscape of computation in the data center. In particular, personalized recommendation for content ranking is now largely accomplished leveraging deep neural networks. However, despite the importance of these models and the amount of compute cycles they consume, relatively little research attention has been devoted to systems for recommendation. To facilitate research and to advance the understanding of these workloads, this paper presents a set of real-world, production-scale DNNs for personalized recommendation coupled with relevant performance metrics for evaluation. In addition to releasing a set of open-source workloads, we conduct in-depth analysis that underpins future system design and optimization for at-scale recommendation: Inference latency varies by 60% across three Intel server generations, batching and co-location of inferences can drastically improve latency-bounded throughput, and the diverse composition of recommendation models leads to different optimization strategies.

preprint2019arXiv

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Personalized recommendation systems leverage deep learning models and account for the majority of data center AI cycles. Their performance is dominated by memory-bound sparse embedding operations with unique irregular memory access patterns that pose a fundamental challenge to accelerate. This paper proposes a lightweight, commodity DRAM compliant, near-memory processing solution to accelerate personalized recommendation inference. The in-depth characterization of production-grade recommendation models shows that embedding operations with high model-, operator- and data-level parallelism lead to memory bandwidth saturation, limiting recommendation inference performance. We propose RecNMP which provides a scalable solution to improve system throughput, supporting a broad range of sparse embedding models. RecNMP is specifically tailored to production environments with heavy co-location of operators on a single server. Several hardware/software co-optimization techniques such as memory-side caching, table-aware packet scheduling, and hot entry profiling are studied, resulting in up to 9.8x memory latency speedup over a highly-optimized baseline. Overall, RecNMP offers 4.2x throughput improvement and 45.8% memory energy savings.

preprint2016arXiv

Error Sensitivity to Environmental Noise in Quantum Circuits for Chemical State Preparation

Calculating molecular energies is likely to be one of the first useful applications to achieve quantum supremacy, performing faster on a quantum than a classical computer. However, if future quantum devices are to produce accurate calculations, errors due to environmental noise and algorithmic approximations need to be characterized and reduced. In this study, we use the high performance qHiPSTER software to investigate the effects of environmental noise on the preparation of quantum chemistry states. We simulated eighteen 16-qubit quantum circuits under environmental noise, each corresponding to a unitary coupled cluster state preparation of a different molecule or molecular configuration. Additionally, we analyze the nature of simple gate errors in noise-free circuits of up to 40 qubits. We find that the Jordan-Wigner (JW) encoding produces consistently smaller errors under a noisy environment as compared to the Bravyi-Kitaev (BK) encoding. For the JW encoding, pure-dephasing noise is shown to produce substantially smaller errors than pure relaxation noise of the same magnitude. We report error trends in both molecular energy and electron particle number within a unitary coupled cluster state preparation scheme, against changes in nuclear charge, bond length, number of electrons, noise types, and noise magnitude. These trends may prove to be useful in making algorithmic and hardware-related choices for quantum simulation of molecular energies.

preprint2016arXiv

qHiPSTER: The Quantum High Performance Software Testing Environment

We present qHiPSTER, the Quantum High Performance Software Testing Environment. qHiPSTER is a distributed high-performance implementation of a quantum simulator on a classical computer, that can simulate general single-qubit gates and two-qubit controlled gates. We perform a number of single- and multi-node optimizations, including vectorization, multi-threading, cache blocking, as well as overlapping computation with communication. Using the TACC Stampede supercomputer, we simulate quantum circuits ("quantum software") of up to 40 qubits. We carry out a detailed performance analysis to show that our simulator achieves both high performance and high hardware efficiency, limited only by the sustainable memory and network bandwidth of the machine.

preprint2014arXiv

Lattice QCD with Domain Decomposition on Intel Xeon Phi Co-Processors

The gap between the cost of moving data and the cost of computing continues to grow, making it ever harder to design iterative solvers on extreme-scale architectures. This problem can be alleviated by alternative algorithms that reduce the amount of data movement. We investigate this in the context of Lattice Quantum Chromodynamics and implement such an alternative solver algorithm, based on domain decomposition, on Intel Xeon Phi co-processor (KNC) clusters. We demonstrate close-to-linear on-chip scaling to all 60 cores of the KNC. With a mix of single- and half-precision the domain-decomposition method sustains 400-500 Gflop/s per chip. Compared to an optimized KNC implementation of a standard solver [1], our full multi-node domain-decomposition solver strong-scales to more nodes and reduces the time-to-solution by a factor of 5.

Mikhail Smelyanskiy

What is connected

Connect this record

See the researcher in context

Building this map preview

7 published item(s)

FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference

Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

The Architectural Implications of Facebook's DNN-based Personalized Recommendation

RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing

Error Sensitivity to Environmental Noise in Quantum Circuits for Chemical State Preparation

qHiPSTER: The Quantum High Performance Software Testing Environment

Lattice QCD with Domain Decomposition on Intel Xeon Phi Co-Processors