Source author record

Minhyeok Lee

Minhyeok Lee appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning Artificial Intelligence eess.IV Neural and Evolutionary Computing

Catalog footprint

What is connected

9works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.

preprint2024arXiv

Class-Continuous Conditional Generative Neural Radiance Field

The 3D-aware image synthesis focuses on conserving spatial consistency besides generating high-resolution images with fine details. Recently, Neural Radiance Field (NeRF) has been introduced for synthesizing novel views with low computational cost and superior performance. While several works investigate a generative NeRF and show remarkable achievement, they cannot handle conditional and continuous feature manipulation in the generation procedure. In this work, we introduce a novel model, called Class-Continuous Conditional Generative NeRF ($\text{C}^{3}$G-NeRF), which can synthesize conditionally manipulated photorealistic 3D-consistent images by projecting conditional features to the generator and the discriminator. The proposed $\text{C}^{3}$G-NeRF is evaluated with three image datasets, AFHQ, CelebA, and Cars. As a result, our model shows strong 3D-consistency with fine details and smooth interpolation in conditional feature manipulation. For instance, $\text{C}^{3}$G-NeRF exhibits a Fréchet Inception Distance (FID) of 7.64 in 3D-aware face image synthesis with a $\text{128}^{2}$ resolution. Additionally, we provide FIDs of generated 3D-aware images of each class of the datasets as it is possible to synthesize class-conditional images with $\text{C}^{3}$G-NeRF.

preprint2022arXiv

RandomSEMO: Normality Learning Of Moving Objects For Video Anomaly Detection

Recent anomaly detection algorithms have shown powerful performance by adopting frame predicting autoencoders. However, these methods face two challenging circumstances. First, they are likely to be trained to be excessively powerful, generating even abnormal frames well, which leads to failure in detecting anomalies. Second, they are distracted by the large number of objects captured in both foreground and background. To solve these problems, we propose a novel superpixel-based video data transformation technique named Random Superpixel Erasing on Moving Objects (RandomSEMO) and Moving Object Loss (MOLoss), built on top of a simple lightweight autoencoder. RandomSEMO is applied to the moving object regions by randomly erasing their superpixels. It enforces the network to pay attention to the foreground objects and learn the normal features more effectively, rather than simply predicting the future frame. Moreover, MOLoss urges the model to focus on learning normal objects captured within RandomSEMO by amplifying the loss on the pixels near the moving objects. The experimental results show that our model outperforms state-of-the-arts on three benchmarks.

preprint2022arXiv

SPSN: Superpixel Prototype Sampling Network for RGB-D Salient Object Detection

RGB-D salient object detection (SOD) has been in the spotlight recently because it is an important preprocessing operation for various vision tasks. However, despite advances in deep learning-based methods, RGB-D SOD is still challenging due to the large domain gap between an RGB image and the depth map and low-quality depth maps. To solve this problem, we propose a novel superpixel prototype sampling network (SPSN) architecture. The proposed model splits the input RGB image and depth map into component superpixels to generate component prototypes. We design a prototype sampling network so that the network only samples prototypes corresponding to salient objects. In addition, we propose a reliance selection module to recognize the quality of each RGB and depth feature map and adaptively weight them in proportion to their reliability. The proposed method makes the model robust to inconsistencies between RGB images and depth maps and eliminates the influence of non-salient objects. Our method is evaluated on five popular datasets, achieving state-of-the-art performance. We prove the effectiveness of the proposed method through comparative experiments.

preprint2022arXiv

Tackling Background Distraction in Video Object Segmentation

Semi-supervised video object segmentation (VOS) aims to densely track certain designated objects in videos. One of the main challenges in this task is the existence of background distractors that appear similar to the target objects. We propose three novel strategies to suppress such distractors: 1) a spatio-temporally diversified template construction scheme to obtain generalized properties of the target objects; 2) a learnable distance-scoring function to exclude spatially-distant distractors by exploiting the temporal consistency between two consecutive frames; 3) swap-and-attach augmentation to force each object to have unique features by providing training samples containing entangled objects. On all public benchmark datasets, our model achieves a comparable performance to contemporary state-of-the-art approaches, even with real-time performance. Qualitative results also demonstrate the superiority of our approach over existing methods. We believe our approach will be widely used for future VOS research.

preprint2022arXiv

Unsupervised Video Object Segmentation via Prototype Memory Network

Unsupervised video object segmentation aims to segment a target object in the video without a ground truth mask in the initial frame. This challenging task requires extracting features for the most salient common objects within a video sequence. This difficulty can be solved by using motion information such as optical flow, but using only the information between adjacent frames results in poor connectivity between distant frames and poor performance. To solve this problem, we propose a novel prototype memory network architecture. The proposed model effectively extracts the RGB and motion information by extracting superpixel-based component prototypes from the input RGB images and optical flow maps. In addition, the model scores the usefulness of the component prototypes in each frame based on a self-learning algorithm and adaptively stores the most useful prototypes in memory and discards obsolete prototypes. We use the prototypes in the memory bank to predict the next query frames mask, which enhances the association between distant frames to help with accurate mask prediction. Our method is evaluated on three datasets, achieving state-of-the-art performance. We prove the effectiveness of the proposed model with various ablation studies.

preprint2020arXiv

Estimation with Uncertainty via Conditional Generative Adversarial Networks

Conventional predictive Artificial Neural Networks (ANNs) commonly employ deterministic weight matrices; therefore, their prediction is a point estimate. Such a deterministic nature in ANNs causes the limitations of using ANNs for medical diagnosis, law problems, and portfolio management, in which discovering not only the prediction but also the uncertainty of the prediction is essentially required. To address such a problem, we propose a predictive probabilistic neural network model, which corresponds to a different manner of using the generator in conditional Generative Adversarial Network (cGAN) that has been routinely used for conditional sample generation. By reversing the input and output of ordinary cGAN, the model can be successfully used as a predictive model; besides, the model is robust against noises since adversarial training is employed. In addition, to measure the uncertainty of predictions, we introduce the entropy and relative entropy for regression problems and classification problems, respectively. The proposed framework is applied to stock market data and an image classification task. As a result, the proposed framework shows superior estimation performance, especially on noisy data; moreover, it is demonstrated that the proposed framework can properly estimate the uncertainty of predictions.

preprint2020arXiv

Regularization Methods for Generative Adversarial Networks: An Overview of Recent Studies

Despite its short history, Generative Adversarial Network (GAN) has been extensively studied and used for various tasks, including its original purpose, i.e., synthetic sample generation. However, applying GAN to different data types with diverse neural network architectures has been hindered by its limitation in training, where the model easily diverges. Such a notorious training of GANs is well known and has been addressed in numerous studies. Consequently, in order to make the training of GAN stable, numerous regularization methods have been proposed in recent years. This paper reviews the regularization methods that have been recently introduced, most of which have been published in the last three years. Specifically, we focus on general methods that can be commonly used regardless of neural network architectures. To explore the latest research trends in the regularization for GANs, the methods are classified into several groups by their operation principles, and the differences between the methods are analyzed. Furthermore, to provide practical knowledge of using these methods, we investigate popular methods that have been frequently employed in state-of-the-art GANs. In addition, we discuss the limitations in existing methods and propose future research directions.

preprint2020arXiv

Score-Guided Generative Adversarial Networks

We propose a Generative Adversarial Network (GAN) that introduces an evaluator module using pre-trained networks. The proposed model, called score-guided GAN (ScoreGAN), is trained with an evaluation metric for GANs, i.e., the Inception score, as a rough guide for the training of the generator. By using another pre-trained network instead of the Inception network, ScoreGAN circumvents the overfitting of the Inception network in order that generated samples do not correspond to adversarial examples of the Inception network. Also, to prevent the overfitting, the evaluation metrics are employed only as an auxiliary role, while the conventional target of GANs is mainly used. Evaluated with the CIFAR-10 dataset, ScoreGAN demonstrated an Inception score of 10.36$\pm$0.15, which corresponds to state-of-the-art performance. Furthermore, to generalize the effectiveness of ScoreGAN, the model was further evaluated with another dataset, i.e., the CIFAR-100; as a result, ScoreGAN outperformed the other existing methods, where the Fréchet Inception Distance (FID) was 13.98.

Minhyeok Lee

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

Class-Continuous Conditional Generative Neural Radiance Field

RandomSEMO: Normality Learning Of Moving Objects For Video Anomaly Detection

SPSN: Superpixel Prototype Sampling Network for RGB-D Salient Object Detection

Tackling Background Distraction in Video Object Segmentation

Unsupervised Video Object Segmentation via Prototype Memory Network

Estimation with Uncertainty via Conditional Generative Adversarial Networks

Regularization Methods for Generative Adversarial Networks: An Overview of Recent Studies

Score-Guided Generative Adversarial Networks