Researcher profile

Hyuk-Jae Lee

Hyuk-Jae Lee contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - UnverifiedVerification L1Unclaimed author
4works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2022arXiv

An In-Module Disturbance Barrier for Mitigating Write Disturbance in Phase-Change Memory

Write disturbance error (WDE) appears as a serious reliability problem preventing phase-change memory (PCM) from general commercialization, and therefore several studies have been proposed to mitigate WDEs. Verify-and-correction (VnC) eliminates WDEs by always verifying the data correctness on neighbors after programming, but incurs significant performance overhead. Encoding-based schemes mitigate WDEs by reducing the number of WDE-vulnerable data patterns; however, mitigation performance notably fluctuates with applications. Moreover, encoding-based schemes still rely on VnC-based schemes. Cache-based schemes lower WDEs by storing data in a write cache, but it requires several megabytes of SRAM to significantly mitigate WDEs. Despite the efforts of previous studies, these methods incur either significant performance or area overhead. Therefore, a new approach, which does not rely on VnC-based schemes or application data patterns, is highly necessary. Furthermore, the new approach should be transparent to processors (i.e., in-module), because the characteristic of WDEs is determined by manufacturers of PCM products. In this paper, we present an in-module disturbance barrier (IMDB) that mitigates WDEs on demand. IMDB includes a two-level hierarchy comprising two SRAM-based tables, whose entries are managed with a dedicated replacement policy that sufficiently utilizes the characteristics of WDEs. The naive implementation of the replacement policy requires hundreds of read ports on SRAM, which is infeasible in real hardware; hence, an approximate comparator is also designed. We also conduct a rigorous exploration of architecture parameters to obtain a cost-effective design. The proposed method significantly reduces WDEs without noticeable speed degradation or additional energy consumption compared to previous methods.

preprint2022arXiv

ShortcutFusion: From Tensorflow to FPGA-based accelerator with reuse-aware memory allocation for shortcut data

Residual block is a very common component in recent state-of-the art CNNs such as EfficientNet or EfficientDet. Shortcut data accounts for nearly 40% of feature-maps access in ResNet152 [8]. Most of the previous DNN compilers, accelerators ignore the shortcut data optimization. This paper presents ShortcutFusion, an optimization tool for FPGA-based accelerator with a reuse-aware static memory allocation for shortcut data, to maximize on-chip data reuse given resource constraints. From TensorFlow DNN models, the proposed design generates instruction sets for a group of nodes which uses an optimized data reuse for each residual block. The accelerator design implemented on the Xilinx KCU1500 FPGA card 2.8x faster and 9.9x more power efficient than NVIDIA RTX 2080 Ti for 256x256 input size. . Compared to the result from baseline, in which the weights, inputs, and outputs are accessed from the off-chip memory exactly once per each layer, ShortcutFusion reduces the DRAM access by 47.8-84.8% for RetinaNet, Yolov3, ResNet152, and EfficientNet. Given a similar buffer size to ShortcutMining [8], which also mine the shortcut data in hardware, the proposed work reduces off-chip access for feature-maps 5.27x while accessing weight from off-chip memory exactly once.

preprint2021arXiv

GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent

In this paper, we present GradPIM, a processing-in-memory architecture which accelerates parameter updates of deep neural networks training. As one of processing-in-memory techniques that could be realized in the near future, we propose an incremental, simple architectural design that does not invade the existing memory protocol. Extending DDR4 SDRAM to utilize bank-group parallelism makes our operation designs in processing-in-memory (PIM) module efficient in terms of hardware cost and performance. Our experimental results show that the proposed architecture can improve the performance of DNN training and greatly reduce memory bandwidth requirement while posing only a minimal amount of overhead to the protocol and DRAM area.

preprint2020arXiv

Layer-specific Optimization for Mixed Data Flow with Mixed Precision in FPGA Design for CNN-based Object Detectors

Convolutional neural networks (CNNs) require both intensive computation and frequent memory access, which lead to a low processing speed and large power dissipation. Although the characteristics of the different layers in a CNN are frequently quite different, previous hardware designs have employed common optimization schemes for them. This paper proposes a layer-specific design that employs different organizations that are optimized for the different layers. The proposed design employs two layer-specific optimizations: layer-specific mixed data flow and layer-specific mixed precision. The mixed data flow aims to minimize the off-chip access while demanding a minimal on-chip memory (BRAM) resource of an FPGA device. The mixed precision quantization is to achieve both a lossless accuracy and an aggressive model compression, thereby further reducing the off-chip access. A Bayesian optimization approach is used to select the best sparsity for each layer, achieving the best trade-off between the accuracy and compression. This mixing scheme allows the entire network model to be stored in BRAMs of the FPGA to aggressively reduce the off-chip access, and thereby achieves a significant performance enhancement. The model size is reduced by 22.66-28.93 times compared to that in a full-precision network with a negligible degradation of accuracy on VOC, COCO, and ImageNet datasets. Furthermore, the combination of mixed dataflow and mixed precision significantly outperforms the previous works in terms of both throughput, off-chip access, and on-chip memory requirement.