Researcher profile

Shahar Kvatinsky

Shahar Kvatinsky contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
10topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2022arXiv

abstractPIM: A Technology Backward-Compatible Compilation Flow for Processing-In-Memory

The von Neumann architecture, in which the memory and the computation units are separated, demands massive data traffic between the memory and the CPU. To reduce data movement, new technologies and computer architectures have been explored. The use of memristors, which are devices with both memory and computation capabilities, has been considered for different processing-in-memory (PIM) solutions, including using memristive stateful logic for a programmable digital PIM system. Nevertheless, all previous work has focused on a specific stateful logic family, and on optimizing the execution for a certain target machine. These solutions require new compiler and compilation when changing the target machine, and provide no backward compatibility with other target machines. In this chapter, we present abstractPIM, a new compilation concept and flow which enables executing any function within the memory, using different stateful logic families and different instruction set architectures (ISAs). By separating the code generation into two independent components, intermediate representation of the code using target independent ISA and then microcode generation for a specific target machine, we provide a flexible flow with backward compatibility and lay foundations for a PIM compiler. Using abstractPIM, we explore various logic technologies and ISAs and how they impact each other, and discuss the challenges associated with it, such as the increase in execution time.

preprint2022arXiv

C-AND: Mixed Writing Scheme for Disturb Reduction in 1T Ferroelectric FET Memory

Ferroelectric field effect transistor (FeFET) memory has shown the potential to meet the requirements of the growing need for fast, dense, low-power, and non-volatile memories. In this paper, we propose a memory architecture named crossed-AND (C-AND), in which each storage cell consists of a single ferroelectric transistor. The write operation is performed using different write schemes and different absolute voltages, to account for the asymmetric switching voltages of the FeFET. It enables writing an entire wordline in two consecutive cycles and prevents current and power through the channel of the transistor. During the read operation, the current and power are mostly sensed at a single selected device in each column. The read scheme additionally enables reading an entire word without read errors, even along long bitlines. Our Simulations demonstrate that, in comparison to the previously proposed AND architecture, the C-AND architecture diminishes read errors, reduces write disturbs, enables the usage of longer bitlines, and saves up to 2.92X in memory cell area.

preprint2022arXiv

Efficient Training of the Memristive Deep Belief Net Immune to Non-Idealities of the Synaptic Devices

The tunability of conductance states of various emerging non-volatile memristive devices emulates the plasticity of biological synapses, making it promising in the hardware realization of large-scale neuromorphic systems. The inference of the neural network can be greatly accelerated by the vector-matrix multiplication (VMM) performed within a crossbar array of memristive devices in one step. Nevertheless, the implementation of the VMM needs complex peripheral circuits and the complexity further increases since non-idealities of memristive devices prevent precise conductance tuning (especially for the online training) and largely degrade the performance of the deep neural networks (DNNs). Here, we present an efficient online training method of the memristive deep belief net (DBN). The proposed memristive DBN uses stochastically binarized activations, reducing the complexity of peripheral circuits, and uses the contrastive divergence (CD) based gradient descent learning algorithm. The analog VMM and digital CD are performed separately in a mixed-signal hardware arrangement, making the memristive DBN high immune to non-idealities of synaptic devices. The number of write operations on memristive devices is reduced by two orders of magnitude. The recognition accuracy of 95%~97% can be achieved for the MNIST dataset using pulsed synaptic behaviors of various memristive synaptic devices.

preprint2022arXiv

Enhancing Security of Memristor Computing System Through Secure Weight Mapping

Emerging memristor computing systems have demonstrated great promise in improving the energy efficiency of neural network (NN) algorithms. The NN weights stored in memristor crossbars, however, may face potential theft attacks due to the nonvolatility of the memristor devices. In this paper, we propose to protect the NN weights by mapping selected columns of them in the form of 1's complements and leaving the other columns in their original form, preventing the adversary from knowing the exact representation of each weight. The results show that compared with prior work, our method achieves effectiveness comparable to the best of them and reduces the hardware overhead by more than 18X.

preprint2022arXiv

FiltPIM: In-Memory Filter for DNA Sequencing

Aligning the entire genome of an organism is a compute-intensive task. Pre-alignment filters substantially reduce computation complexity by filtering potential alignment locations. The base-count filter successfully removes over 68% of the potential locations through a histogram-based heuristic. This paper presents FiltPIM, an efficient design of the basecount filter that is based on memristive processing-in-memory. The in-memory design reduces CPU-to-memory data transfer and utilizes both intra-crossbar and inter-crossbar memristive stateful-logic parallelism. The reduction in data transfer and the efficient stateful-logic computation together improve filtering time by 100x compared to a CPU implementation of the filter.

preprint2022arXiv

HashPIM: High-Throughput SHA-3 via Memristive Digital Processing-in-Memory

Recent research has sought to accelerate cryptographic hash functions as they are at the core of modern cryptography. Traditional designs, however, suffer from the von Neumann bottleneck that originates from the separation of processing and memory units. An emerging solution to overcome this bottleneck is processing-in-memory (PIM): performing logic within the same devices responsible for memory to eliminate data-transfer and simultaneously provide massive computational parallelism. In this paper, we seek to vastly accelerate the state-of-the-art SHA-3 cryptographic function using the memristive memory processing unit (mMPU), a general-purpose memristive PIM architecture. To that end, we propose a novel in-memory algorithm for variable rotation, and utilize an efficient mapping of the SHA-3 state vector for memristive crossbar arrays to efficiently exploit PIM parallelism. We demonstrate a massive energy efficiency of 1,422 Gbps/W, improving a state-of-the-art memristive SHA-3 accelerator (SHINE-2) by 4.6x.

preprint2022arXiv

Making Real Memristive Processing-in-Memory Faster and Reliable

Memristive technologies are attractive candidates to replace conventional memory technologies, and can also be used to perform logic and arithmetic operations using a technique called 'stateful logic.' Combining data storage and computation in the memory array enables a novel non-von Neumann architecture, where both the operations are performed within a memristive Memory Processing Unit (mMPU). The mMPU relies on adding computing capabilities to the memristive memory cells without changing the basic memory array structure. The use of an mMPU alleviates the primary restriction on performance and energy in a von Neumann machine, which is the data transfer between CPU and memory. Here, the various aspects of mMPU are discussed, including its architecture and implications on the computing system and software, as well as examining the microarchitectural aspects. We show how mMPU can be improved to accelerate different applications and how the poor reliability of memristors can be improved as part of the mMPU operation.

preprint2022arXiv

MatPIM: Accelerating Matrix Operations with Memristive Stateful Logic

The emerging memristive Memory Processing Unit (mMPU) overcomes the memory wall through memristive devices that unite storage and logic for real processing-in-memory (PIM) systems. At the core of the mMPU is stateful logic, which is accelerated with memristive partitions to enable logic with massive inherent parallelism within crossbar arrays. This paper vastly accelerates the fundamental operations of matrix-vector multiplication and convolution in the mMPU, with either full-precision or binary elements. These proposed algorithms establish an efficient foundation for large-scale mMPU applications such as neural-networks, image processing, and numerical methods. We overcome the inherent asymmetry limitation in the previous in-memory full-precision matrix-vector multiplication solutions by utilizing techniques from block matrix multiplication and reduction. We present the first fast in-memory binary matrix-vector multiplication algorithm by utilizing memristive partitions with a tree-based popcount reduction (39x faster than previous work). For convolution, we present a novel in-memory input-parallel concept which we utilize for a full-precision algorithm that overcomes the asymmetry limitation in convolution, while also improving latency (2x faster than previous work), and the first fast binary algorithm (12x faster than previous work).

preprint2022arXiv

PartitionPIM: Practical Memristive Partitions for Fast Processing-in-Memory

Digital memristive processing-in-memory overcomes the memory wall through a fundamental storage device capable of stateful logic within crossbar arrays. Dynamically dividing the crossbar arrays by adding memristive partitions further increases parallelism, thereby overcoming an inherent trade-off in memristive processing-in-memory. The algorithmic topology of partitions is highly unique, and was recently exploited to accelerate multiplication (11x with 32 partitions) and sorting (14x with 16 partitions). Yet, the physical implementation of memristive partitions, such as the peripheral decoders and the control message, has never been considered and may lead to vast impracticality. This paper overcomes that challenge with several novel techniques, presenting efficient practical designs of memristive partitions. We begin by formalizing the algorithmic properties of memristive partitions into serial, parallel, and semi-parallel operations. Peripheral overhead is addressed via a novel technique of half-gates that enables efficient decoding with negligible overhead. Control overhead is addressed by carefully reducing the operation set of memristive partitions, while resulting in negligible performance impact, by utilizing techniques such as shared indices and pattern generators. Ultimately, these efficient practical solutions, combined with the vast algorithmic potential, may revolutionize digital memristive processing-in-memory.

preprint2022arXiv

Performing Stateful Logic Using Spin-Orbit Torque (SOT) MRAM

Stateful logic is a promising processing-in-memory (PIM) paradigm to perform logic operations using emerging nonvolatile memory cells. While most stateful logic circuits to date focused on technologies such as resistive RAM, we propose two approaches to designing stateful logic using spin orbit torque (SOT) MRAM. The first approach utilizes the separation of read and write paths in SOT devices to perform logic operations. In contrast to previous work, our method utilizes a standard memory structure, and each row can be used as input or output. The second approach uses voltage-gated SOT switching to allow stateful logic in denser memory arrays. We present array structures to support the two approaches and evaluate their functionality using SPICE simulations in the presence of process variation and device mismatch.

preprint2022arXiv

Stateful Logic using Phase Change Memory

Stateful logic is a digital processing-in-memory technique that could address von Neumann memory bottleneck challenges while maintaining backward compatibility with standard von Neumann architectures. In stateful logic, memory cells are used to perform the logic operations without reading or moving any data outside the memory array. Stateful logic has been previously demonstrated using several resistive memory types, mostly by resistive RAM (RRAM). Here we present a new method to design stateful logic using a different resistive memory - phase change memory (PCM). We propose and experimentally demonstrate four logic gate types (NOR, IMPLY, OR, NIMP) using commonly used PCM materials. Our stateful logic circuits are different than previously proposed circuits due to the different switching mechanism and functionality of PCM compared to RRAM. Since the proposed stateful logic form a functionally complete set, these gates enable sequential execution of any logic function within the memory, paving the way to PCM-based digital processing-in-memory systems.

preprint2022arXiv

Training of Quantized Deep Neural Networks using a Magnetic Tunnel Junction-Based Synapse

Quantized neural networks (QNNs) are being actively researched as a solution for the computational complexity and memory intensity of deep neural networks. This has sparked efforts to develop algorithms that support both inference and training with quantized weight and activation values, without sacrificing accuracy. A recent example is the GXNOR framework for stochastic training of ternary (TNN) and binary (BNN) neural networks. In this paper, we show how magnetic tunnel junction (MTJ) devices can be used to support QNN training. We introduce a novel hardware synapse circuit that uses the MTJ stochastic behavior to support the quantize update. The proposed circuit enables processing near memory (PNM) of QNN training, which subsequently reduces data movement. We simulated MTJ-based stochastic training of a TNN over the MNIST, SVHN, and CIFAR10 datasets and achieved an accuracy of 98.61%, 93.99% and 82.71%, respectively (less than 1% degradation compared to the GXNOR algorithm). We evaluated the synapse array performance potential and showed that the proposed synapse circuit can train ternary networks in situ, with 18.3TOPs/W for feedforward and 3TOPs/W for weight update.

preprint2020arXiv

CONTRA: Area-Constrained Technology Mapping Framework For Memristive Memory Processing Unit

Data-intensive applications are poised to benefit directly from processing-in-memory platforms, such as memristive Memory Processing Units, which allow leveraging data locality and performing stateful logic operations. Developing design automation flows for such platforms is a challenging and highly relevant research problem. In this work, we investigate the problem of minimizing delay under arbitrary area constraint for MAGIC-based in-memory computing platforms. We propose an end-to-end area constrained technology mapping framework, CONTRA. CONTRA uses Look-Up Table(LUT) based mapping of the input function on the crossbar array to maximize parallel operations and uses a novel search technique to move data optimally inside the array. CONTRA supports benchmarks in a variety of formats, along with crossbar dimensions as input to generate MAGIC instructions. CONTRA scales for large benchmarks, as demonstrated by our experiments. CONTRA allows mapping benchmarks to smaller crossbar dimensions than achieved by any other technique before, while allowing a wide variety of area-delay trade-offs. CONTRA improves the composite metric of area-delay product by 2.1x to 13.1x compared to seven existing technology mapping approaches.