Source author record

Hyuk-Jae Lee

Hyuk-Jae Lee appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Hardware Architecture Distributed, Parallel, and Cluster Computing hep-th Computer Vision Emerging Technologies Machine Learning

Catalog footprint

What is connected

6works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

An In-Module Disturbance Barrier for Mitigating Write Disturbance in Phase-Change Memory

Write disturbance error (WDE) appears as a serious reliability problem preventing phase-change memory (PCM) from general commercialization, and therefore several studies have been proposed to mitigate WDEs. Verify-and-correction (VnC) eliminates WDEs by always verifying the data correctness on neighbors after programming, but incurs significant performance overhead. Encoding-based schemes mitigate WDEs by reducing the number of WDE-vulnerable data patterns; however, mitigation performance notably fluctuates with applications. Moreover, encoding-based schemes still rely on VnC-based schemes. Cache-based schemes lower WDEs by storing data in a write cache, but it requires several megabytes of SRAM to significantly mitigate WDEs. Despite the efforts of previous studies, these methods incur either significant performance or area overhead. Therefore, a new approach, which does not rely on VnC-based schemes or application data patterns, is highly necessary. Furthermore, the new approach should be transparent to processors (i.e., in-module), because the characteristic of WDEs is determined by manufacturers of PCM products. In this paper, we present an in-module disturbance barrier (IMDB) that mitigates WDEs on demand. IMDB includes a two-level hierarchy comprising two SRAM-based tables, whose entries are managed with a dedicated replacement policy that sufficiently utilizes the characteristics of WDEs. The naive implementation of the replacement policy requires hundreds of read ports on SRAM, which is infeasible in real hardware; hence, an approximate comparator is also designed. We also conduct a rigorous exploration of architecture parameters to obtain a cost-effective design. The proposed method significantly reduces WDEs without noticeable speed degradation or additional energy consumption compared to previous methods.

preprint2022arXiv

ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data

Residual block is a very common component in recent state-of-the art CNNs such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152 [8]. Most of the previous DNN compilers, accelerators ignore the shortcut data optimization. This paper presents ShortcutFusion, an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card 2.8x faster and 9.9x more power efficient than NVIDIA RTX 2080 Ti for 256x256 input size. . Compared to the result from baseline, in which the weights, inputs, and outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8], which also mine the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps 5.27x while accessing weight from off-chip memory exactly once.

preprint2021arXiv

GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent

In this paper, we present GradPIM, a processing-in-memory architecture which accelerates parameter updates of deep neural networks training. As one of processing-in-memory techniques that could be realized in the near future, we propose an incremental, simple architectural design that does not invade the existing memory protocol. Extending DDR4 SDRAM to utilize bank-group parallelism makes our operation designs in processing-in-memory (PIM) module efficient in terms of hardware cost and performance. Our experimental results show that the proposed architecture can improve the performance of DNN training and greatly reduce memory bandwidth requirement while posing only a minimal amount of overhead to the protocol and DRAM area.

preprint2020arXiv

Layer-specific Optimization for Mixed Data Flow with Mixed Precision in FPGA Design for CNN-based Object Detectors

Convolutional neural networks (CNNs) require both intensive computation and frequent memory access, which lead to a low processing speed and large power dissipation. Although the characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This paper proposes a layer-specific design that employs different organizations that are optimized for the different layers. The proposed design employs two layer-specific optimizations: layer-specific mixed data flow and layer-specific mixed precision. The mixed data flow aims to minimize the off-chip access while demanding a minimal on-chip memory (BRAM) resource of an FPGA device. The mixed precision quantization is to achieve both a lossless accuracy and an aggressive model compression, thereby further reducing the off-chip access. A Bayesian optimization approach is used to select the best sparsity for each layer, achieving the best trade-off between the accuracy and compression. This mixing scheme allows the entire network model to be stored in BRAMs of the FPGA to aggressively reduce the off-chip access, and thereby achieves a significant performance enhancement. The model size is reduced by 22.66-28.93 times compared to that in a full-precision network with a negligible degradation of accuracy on VOC, COCO, and ImageNet datasets. Furthermore, the combination of mixed dataflow and mixed precision significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.

preprint1998arXiv

Vortex Solutions of Four-fermion Theory coupled to a Yang-Mills-Chern-Simons Gauge Field

We have constructed a four-fermion theory coupled to a Yang-Mills-Chern-Simons gauge field which admits static multi-vortex solutions. This is achieved through the introduction of an anomalous magnetic interation term in addition to the usual minimal coupling, and the appropriate choice of the fermion quartic coupling constant.

preprint1997arXiv

Vortex solutions of a Maxwell-Chern-Simons field coupled to four-fermion theory

We find the static vortex solutions of the model of Maxwell-Chern-Simons gauge field coupled to a (2+1)-dimensional four-fermion theory. Especially, we introduce two matter currents coupled to the gauge field minimally: the electromagnetic current and a topological current associated with the electromagnetic current. Unlike other Chern-Simons solitons the N-soliton solution of this theory has binding energy and the stability of the solutions is maintained by the charge conservation laws.