Researcher profile

Jemin Lee

Jemin Lee contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2026arXiv

QFlash: Bridging Quantization and Memory Efficiency in Vision Transformer Attention

FlashAttention improves efficiency through tiling, but its online softmax still relies on floating-point arithmetic for numerical stability, making full quantization difficult. We identify three main obstacles to integer-only FlashAttention: (1) scale explosion during tile-wise accumulation, (2) inefficient shift-based exponential operations on GPUs, and (3) quantization granularity constraints requiring uniform scales for integer comparison. To address these challenges, we propose \textit{QFlash}, an end-to-end integer FlashAttention design that performs softmax entirely in the integer domain and runs as a single Triton kernel. On seven attention workloads from ViT, DeiT, and Swin models, QFlash achieves up to 6.73$\times$ speedup over I-ViT and up to 8.69$\times$ speedup on Swin, while reducing energy consumption by 18.8\% compared to FP16 FlashAttention, without sacrificing Top-1 accuracy on ViT/DeiT and remaining competitive on Swin under per-tensor quantization. Our code is publicly available at https://github.com/EfficientCompLab/qflash.

preprint2023arXiv

Joint Service Caching and Computing Resource Allocation for Edge Computing-Enabled Networks

In this paper, we consider the service caching and the computing resource allocation in edge computing (EC) enabled networks. We introduce a random service caching design considering multiple types of latency sensitive services and the base stations (BSs)' service caching storage. We then derive a successful service probability (SSP). We also formulate a SSP maximization problem subject to the service caching distribution and the computing resource allocation. Then, we show that the optimization problem is nonconvex and develop a novel algorithm to obtain the stationary point of the SSP maximization problem by adopting the parallel successive convex approximation (SCA). Moreover, to further reduce the computational complexity, we also provide a low complex algorithm that can obtain the near-optimal solution of the SSP maximization problem in high computing capability region. Finally, from numerical simulations, we show that proposed solutions achieve higher SSP than baseline schemes. Moreover, we show that the near-optimal solution achieves reliable performance in the high computing capability region. We also explore the impacts of target delays, a BSs' service cache size, and an EC servers' computing capability on the SSP.

preprint2022arXiv

CPrune: Compiler-Informed Model Pruning for Efficient Target-Aware DNN Execution

Mobile devices run deep learning models for various purposes, such as image classification and speech recognition. Due to the resource constraints of mobile devices, researchers have focused on either making a lightweight deep neural network (DNN) model using model pruning or generating an efficient code using compiler optimization. Surprisingly, we found that the straightforward integration between model compression and compiler auto-tuning often does not produce the most efficient model for a target device. We propose CPrune, a compiler-informed model pruning for efficient target-aware DNN execution to support an application with a required target accuracy. CPrune makes a lightweight DNN model through informed pruning based on the structural information of subgraphs built during the compiler tuning process. Our experimental results show that CPrune increases the DNN execution speed up to 2.73x compared to the state-of-the-art TVM auto-tune while satisfying the accuracy requirement.

preprint2022arXiv

Facing to Latency of Hyperledger Fabric for Blockchain-enabled IoT: Modeling and Analysis

Hyperledger Fabric (HLF), one of the most popular private blockchains, has recently received attention for blockchain-enabled Internet of Things (IoT). However, for IoT applications to handle time-sensitive data, the processing latency in HLF has emerged as a new challenge. In this article, therefore, we establish a practical HLF latency model for HLF-enabled IoT. We first discuss the structure and transaction flow of HLF-enabled IoT. After implementing real HLF, we capture the latencies that each transaction experiences and show that the total latency of HLF can be modeled as a Gamma distribution, which is validated by conducting a goodness-of-fit test (i.e., the Kolmogorov-Smirnov (KS) test). We also provide the parameter values of the modeled latency distribution for various HLF environments. Furthermore, we explore the impacts of three important HLF parameters including transaction generation rate, block size, and block-generation timeout on the HLF latency. As a result, this article provides design insights on minimizing the latency for HLF-enabled IoT.

preprint2020arXiv

Mobile Edge Computing-Enabled Heterogeneous Networks

The mobile edge computing (MEC) has been introduced for providing computing capabilities at the edge of networks to improve the latency performance of wireless networks. In this paper, we provide the novel framework for MEC-enabled heterogeneous networks (HetNets), composed of the multi-tier networks with access points (APs) (i.e., MEC servers), which have different transmission power and different computing capabilities. In this framework, we also consider multiple-type mobile users with different sizes of computation tasks, and they offload the tasks to a MEC server, and receive the computation resulting data from the server. We derive the successful edge computing probability (SECP), defined as the probability that a user offloads and finishes its computation task at the MEC server within the target latency. We provide a closed-form expression of the approximated SECP for general case, and closed-form expressions of the exact SECP for special cases. This paper then provides the design insights for the optimal configuration of MEC-enabled HetNets by analyzing the effects of network parameters and bias factors, used in MEC server association, on the SECP. Specifically, it shows how the optimal bias factors in terms of SECP can be changed according to the numbers of user types and tiers of MEC servers, and how they are different to the conventional ones that did not consider the computing capabilities and task sizes.