Source author record

Ig-jae Kim

Ig-jae Kim appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Artificial Intelligence

Catalog footprint

What is connected

5works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

As Video Large Language Models (Video-LLMs) scale to longer and more complex videos, their inference cost grows rapidly due to the large volume of visual tokens accumulated across frames. Training-free token compression has emerged as a practical solution to this bottleneck. However, existing temporal compression methods rely primarily on cross-frame token similarity or segmentation heuristics, overlooking each token's semantic role within its frame and failing to adapt compression strength to the compressibility of each frame pair. In this work, we propose OTT-Vid, a transport-derived allocation framework for temporal token compression. Our approach consists of two stages: spatial pruning identifies representative content within each frame, and optimal transport (OT) is then solved between neighboring frames to estimate temporal compressibility. We formulate this OT with non-uniform token mass, which protects semantically important tokens from aggressive compression, and a locality-aware cost that captures both feature and spatial disparities. The resulting transport plan jointly balances token importance and matching cost, while its total cost defines the transport difficulty of each frame pair, which we use to allocate compression budgets dynamically. Experiments on six benchmarks spanning video question answering and temporal grounding show that OTT-Vid preserves 95.8% of VQA and 73.9% of VTG performance while retaining only 10% of tokens, consistently outperforming existing state-of-the-art training-free compression methods.

preprint2022arXiv

Probabilistic Representations for Video Contrastive Learning

This paper presents Probabilistic Video Contrastive Learning, a self-supervised representation learning method that bridges contrastive learning with probabilistic representation. We hypothesize that the clips composing the video have different distributions in short-term duration, but can represent the complicated and sophisticated video distribution through combination in a common embedding space. Thus, the proposed method represents video clips as normal distributions and combines them into a Mixture of Gaussians to model the whole video distribution. By sampling embeddings from the whole video distribution, we can circumvent the careful sampling strategy or transformations to generate augmented views of the clips, unlike previous deterministic methods that have mainly focused on such sample generation strategies for contrastive learning. We further propose a stochastic contrastive loss to learn proper video distributions and handle the inherent uncertainty from the nature of the raw video. Experimental results verify that our probabilistic embedding stands as a state-of-the-art video representation learning for action recognition and video retrieval on the most popular benchmarks, including UCF101 and HMDB51.

preprint2020arXiv

Cylindrical Convolutional Networks for Joint Object Detection and Viewpoint Estimation

Existing techniques to encode spatial invariance within deep convolutional neural networks only model 2D transformation fields. This does not account for the fact that objects in a 2D space are a projection of 3D ones, and thus they have limited ability to severe object viewpoint changes. To overcome this limitation, we introduce a learnable module, cylindrical convolutional networks (CCNs), that exploit cylindrical representation of a convolutional kernel defined in the 3D space. CCNs extract a view-specific feature through a view-specific convolutional kernel to predict object category scores at each viewpoint. With the view-specific feature, we simultaneously determine objective category and viewpoints using the proposed sinusoidal soft-argmax module. Our experiments demonstrate the effectiveness of the cylindrical convolutional networks on joint object detection and viewpoint estimation.

preprint2020arXiv

Relational Deep Feature Learning for Heterogeneous Face Recognition

Heterogeneous Face Recognition (HFR) is a task that matches faces across two different domains such as visible light (VIS), near-infrared (NIR), or the sketch domain. Due to the lack of databases, HFR methods usually exploit the pre-trained features on a large-scale visual database that contain general facial information. However, these pre-trained features cause performance degradation due to the texture discrepancy with the visual domain. With this motivation, we propose a graph-structured module called Relational Graph Module (RGM) that extracts global relational information in addition to general facial features. Because each identity's relational information between intra-facial parts is similar in any modality, the modeling relationship between features can help cross-domain matching. Through the RGM, relation propagation diminishes texture dependency without losing its advantages from the pre-trained features. Furthermore, the RGM captures global facial geometrics from locally correlated convolutional features to identify long-range relationships. In addition, we propose a Node Attention Unit (NAU) that performs node-wise recalibration to concentrate on the more informative nodes arising from relation-based propagation. Furthermore, we suggest a novel conditional-margin loss function (C-softmax) for the efficient projection learning of the embedding vector in HFR. The proposed method outperforms other state-of-the-art methods on five HFR databases. Furthermore, we demonstrate performance improvement on three backbones because our module can be plugged into any pre-trained face recognition backbone to overcome the limitations of a small HFR database.

preprint2020arXiv

SumGraph: Video Summarization via Recursive Graph Modeling

The goal of video summarization is to select keyframes that are visually diverse and can represent a whole story of an input video. State-of-the-art approaches for video summarization have mostly regarded the task as a frame-wise keyframe selection problem by aggregating all frames with equal weight. However, to find informative parts of the video, it is necessary to consider how all the frames of the video are related to each other. To this end, we cast video summarization as a graph modeling problem. We propose recursive graph modeling networks for video summarization, termed SumGraph, to represent a relation graph, where frames are regarded as nodes and nodes are connected by semantic relationships among frames. Our networks accomplish this through a recursive approach to refine an initially estimated graph to correctly classify each node as a keyframe by reasoning the graph representation via graph convolutional networks. To leverage SumGraph in a more practical environment, we also present a way to adapt our graph modeling in an unsupervised fashion. With SumGraph, we achieved state-of-the-art performance on several benchmarks for video summarization in both supervised and unsupervised manners.

Ig-jae Kim

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

OTT-Vid: Optimal Transport Temporal Token Compression for Video Large Language Models

Probabilistic Representations for Video Contrastive Learning

Cylindrical Convolutional Networks for Joint Object Detection and Viewpoint Estimation

Relational Deep Feature Learning for Heterogeneous Face Recognition

SumGraph: Video Summarization via Recursive Graph Modeling