Source author record

Hyesoon Kim

Hyesoon Kim appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Distributed, Parallel, and Cluster Computing Cryptography and Security Databases Hardware Architecture

Catalog footprint

What is connected

4works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

CuFuzz: Hardening CUDA Programs through Transformation and Fuzzing

GPUs have gained significant popularity over the past decade, extending beyond their original role in graphics rendering. This evolution has brought GPU security and reliability to the forefront of concerns. Prior research has shown that CUDA's lack of memory safety can lead to serious vulnerabilities. While fuzzing is effective for finding such bugs on CPUs, equivalent tools for GPUs are lacking due to architectural differences and lack of built-in error detection. In this paper, we propose CuFuzz, a novel compiler-runtime co-design solution to extend state-of-the-art CPU fuzzing tools to GPU programs. CuFuzz transforms GPU programs into CPU programs using compiler IR-level transformations to enable effective fuzz testing. To the best of our knowledge, CuFuzz is the first mechanism to bring fuzzing support to CUDA, addressing a critical gap in GPU security research. By leveraging CPU memory error detectors such as Address Sanitizer, CuFuzz aims to uncover memory safety bugs and related correctness vulnerabilities in CUDA code, enhancing the security and reliability of GPU-accelerated applications. To ensure high fuzzing throughput, we introduce two compiler-runtime co-optimizations tailored for GPU code: Partial Representative Execution (PREX) and Access-Index Preserving Pruning (AXIPrune), achieving average throughput improvements of 32x with PREX and an additional 33% gain with AXIPrune on top of PREX-optimized code. Together, these optimizations can yield up to a 224.31x speedup. In our fuzzing campaigns, CuFuzz uncovered 122 security vulnerabilities in widely used benchmarks.

preprint2022arXiv

CuPBoP: CUDA for Parallelized and Broad-range Processors

CUDA is one of the most popular choices for GPU programming, but it can only be executed on NVIDIA GPUs. Executing CUDA on non-NVIDIA devices not only benefits the hardware community, but also allows data-parallel computation in heterogeneous systems. To make CUDA programs portable, some researchers have proposed using source-to-source translators to translate CUDA to portable programming languages that can be executed on non-NVIDIA devices. However, most CUDA translators require additional manual modifications on the translated code, which imposes a heavy workload on developers. In this paper, CuPBoP is proposed to execute CUDA on non-NVIDIA devices without relying on any portable programming languages. Compared with existing work that executes CUDA on non-NVIDIA devices, CuPBoP does not require manual modification of the CUDA source code, but it still achieves the highest coverage (69.6%), much higher than existing frameworks (56.6%) on the Rodinia benchmark. In particular, for CPU backends, CuPBoP supports several ISAs (e.g., X86, RISC-V, AArch64) and has close or even higher performance compared with other projects. We also compare and analyze the performance among CuPBoP, manually optimized OpenMP/MPI programs, and CUDA programs on the latest Ampere architecture GPU, and show future directions for supporting CUDA programs on non-NVIDIA devices with high performance

preprint2021arXiv

THIA: Accelerating Video Analytics using Early Inference and Fine-Grained Query Planning

To efficiently process visual data at scale, researchers have proposed two techniques for lowering the computational overhead associated with the underlying deep learning models. The first approach consists of leveraging a specialized, lightweight model to directly answer the query. The second approach focuses on filtering irrelevant frames using a lightweight model and processing the filtered frames using a heavyweight model. These techniques suffer from two limitations. With the first approach, the specialized model is unable to provide accurate results for hard-to-detect events. With the second approach, the system is unable to accelerate queries focusing on frequently occurring events as the filter is unable to eliminate a significant fraction of frames in the video. In this paper, we present THIA, a video analytics system for tackling these limitations. The design of THIA is centered around three techniques. First, instead of using a cascade of models, it uses a single object detection model with multiple exit points for short-circuiting the inference. This early inference technique allows it to support a range of throughput-accuracy tradeoffs. Second, it adopts a fine-grained approach to planning and processes different chunks of the video using different exit points to meet the user's requirements. Lastly, it uses a lightweight technique for directly estimating the exit point for a chunk to lower the optimization time. We empirically show that these techniques enable THIA to outperform two state-of-the-art video analytics systems by up to 6.5X, while providing accurate results even on queries focusing on hard-to-detect events.

preprint2020arXiv

Vortex: OpenCL Compatible RISC-V GPGPU

The current challenges in technology scaling are pushing the semiconductor industry towards hardware specialization, creating a proliferation of heterogeneous systems-on-chip, delivering orders of magnitude performance and power benefits compared to traditional general-purpose architectures. This transition is getting a significant boost with the advent of RISC-V with its unique modular and extensible ISA, allowing a wide range of low-cost processor designs for various target applications. In addition, OpenCL is currently the most widely adopted programming framework for heterogeneous platforms available on mainstream CPUs, GPUs, as well as FPGAs and custom DSP. In this work, we present Vortex, a RISC-V General-Purpose GPU that supports OpenCL. Vortex implements a SIMT architecture with a minimal ISA extension to RISC-V that enables the execution of OpenCL programs. We also extended OpenCL runtime framework to use the new ISA. We evaluate this design using 15nm technology. We also show the performance and energy numbers of running them with a subset of benchmarks from the Rodinia Benchmark suite.