Researcher profile

Donghyun Kim

Donghyun Kim contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
24works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

24 published item(s)

preprint2026arXiv

Rethinking Graph Convolution for 2D-to-3D Hand Pose Lifting

Graph convolutional networks (GCNs) are widely used for 3D hand pose estimation, where the hand skeleton is encoded as a fixed adjacency graph. We revisit whether this is the most effective way to incorporate hand topology in 2D-to-3D lifting. In this paper, we perform controlled, parameter-matched ablations on the FPHA benchmark and show that standard multi-head self-attention consistently outperforms GCN baselines. Even when the GCN is strengthened with multi-hop adjacency and matched parameter count, self-attention reduces MPJPE from 12.36 mm to 10.09 mm. A skeleton-constrained graph attention network recovers most of this gap, indicating that input-dependent aggregation is a major source of improvement, while fully connected attention yields additional gains. We further show that hand topology is most effective when introduced as a soft structural prior through graph-distance positional encoding, rather than as a hard adjacency constraint. These results suggest that, for hand pose lifting, adaptive spatial attention is a more effective inductive bias than fixed graph convolution.

preprint2026arXiv

Think as Needed: Geometry-Driven Adaptive Perception for Autonomous Driving

Autonomous driving scenes range from empty highways to dense intersections with dozens of interacting road users, yet current 3D detection models apply a fixed computation budget to every frame, wasting resources on simple scenes while lacking capacity for complex ones. Existing approaches compound this problem: Transformer-based interaction models scale quadratically with the number of detected objects, and frame-by-frame processing causes the system to immediately forget objects the moment they become occluded. We propose Enhanced HOPE, an adaptive perception architecture that measures the geometric complexity of each incoming LiDAR frame using an unsupervised statistical estimator and routes it through a shallow or deep processing path accordingly, requiring no manual scene labels. To keep interaction modeling efficient, we replace quadratic pairwise attention with a linear-time subspace-based network that groups nearby objects into clusters and processes them jointly. The computational savings from these two mechanisms free up resources for a persistent temporal memory module that retains previously detected objects and traffic rules across frames, enabling the system to recall occluded objects seconds after they disappear from view. On the nuScenes and CARLA benchmarks, Enhanced HOPE reduces latency by 38% on simple scenes with no accuracy loss, improves mean Average Precision by 2.7 points on rare long-tail scenarios, and tracks objects through occlusions lasting over 5 seconds, where all tested baselines fail.

preprint2022arXiv

A Broad Study of Pre-training for Domain Generalization and Adaptation

Deep models must learn robust and transferable representations in order to perform well on new domains. While domain transfer methods (e.g., domain adaptation, domain generalization) have been proposed to learn transferable representations across domains, they are typically applied to ResNet backbones pre-trained on ImageNet. Thus, existing works pay little attention to the effects of pre-training on domain transfer tasks. In this paper, we provide a broad study and in-depth analysis of pre-training for domain adaptation and generalization, namely: network architectures, size, pre-training loss, and datasets. We observe that simply using a state-of-the-art backbone outperforms existing state-of-the-art domain adaptation baselines and set new baselines on Office-Home and DomainNet improving by 10.7\% and 5.5\%. We hope that this work can provide more insights for future domain transfer research.

preprint2022arXiv

A Unified Framework for Domain Adaptive Pose Estimation

While pose estimation is an important computer vision task, it requires expensive annotation and suffers from domain shift. In this paper, we investigate the problem of domain adaptive 2D pose estimation that transfers knowledge learned on a synthetic source domain to a target domain without supervision. While several domain adaptive pose estimation models have been proposed recently, they are not generic but only focus on either human pose or animal pose estimation, and thus their effectiveness is somewhat limited to specific scenarios. In this work, we propose a unified framework that generalizes well on various domain adaptive pose estimation problems. We propose to align representations using both input-level and output-level cues (pixels and pose labels, respectively), which facilitates the knowledge transfer from the source domain to the unlabeled target domain. Our experiments show that our method achieves state-of-the-art performance under various domain shifts. Our method outperforms existing baselines on human pose estimation by up to 4.5 percent points (pp), hand pose estimation by up to 7.4 pp, and animal pose estimation by up to 4.8 pp for dogs and 3.3 pp for sheep. These results suggest that our method is able to mitigate domain shift on diverse tasks and even unseen domains and objects (e.g., trained on horse and tested on dog). Our code will be publicly available at: https://github.com/VisionLearningGroup/UDA_PoseEstimation.

preprint2022arXiv

BROS: A Pre-trained Language Model Focusing on Text and Layout for Better Key Information Extraction from Documents

Key information extraction (KIE) from document images requires understanding the contextual and spatial semantics of texts in two-dimensional (2D) space. Many recent studies try to solve the task by developing pre-trained language models focusing on combining visual features from document images with texts and their layout. On the other hand, this paper tackles the problem by going back to the basic: effective combination of text and layout. Specifically, we propose a pre-trained language model, named BROS (BERT Relying On Spatiality), that encodes relative positions of texts in 2D space and learns from unlabeled documents with area-masking strategy. With this optimized training scheme for understanding texts in 2D space, BROS shows comparable or better performance compared to previous methods on four KIE benchmarks (FUNSD, SROIE*, CORD, and SciTSR) without relying on visual features. This paper also reveals two real-world challenges in KIE tasks-(1) minimizing the error from incorrect text ordering and (2) efficient learning from fewer downstream examples-and demonstrates the superiority of BROS over previous methods. Code is available at https://github.com/clovaai/bros.

preprint2022arXiv

Emp-RFT: Empathetic Response Generation via Recognizing Feature Transitions between Utterances

Each utterance in multi-turn empathetic dialogues has features such as emotion, keywords, and utterance-level meaning. Feature transitions between utterances occur naturally. However, existing approaches fail to perceive the transitions because they extract features for the context at the coarse-grained level. To solve the above issue, we propose a novel approach of recognizing feature transitions between utterances, which helps understand the dialogue flow and better grasp the features of utterance that needs attention. Also, we introduce a response generation strategy to help focus on emotion and keywords related to appropriate features when generating responses. Experimental results show that our approach outperforms baselines and especially, achieves significant improvements on multi-turn dialogues.

preprint2022arXiv

Label-free detection of single nanoparticles with disordered nanoisland surface plasmon sensor

We report sensing of single nanoparticles using disordered metallic nanoisland substrates supporting surface plasmon polaritons (SPPs). Speckle patterns arising from leakage radiation of elastically scattered SPPs provides a unique fingerprint of the scattering microstructure at the sensor surface. Experimental measurements of the speckle decorrelation are presented and shown to enable detection of sorption of individual gold nanoparticles and polystyrene beads. Our approach is verified through bright-field and fluorescence imaging of particles adhering to the nanoisland substrate.

preprint2022arXiv

Schubert polynomials, the inhomogeneous TASEP, and evil-avoiding permutations

Consider a lattice of n sites arranged around a ring, with the $n$ sites occupied by particles of weights $\{1,2,\dots,n\}$; the possible arrangements of particles in sites thus corresponds to the $n!$ permutations in $S_n$. The inhomogeneous totally asymmetric simple exclusion process (or TASEP) is a Markov chain on the set of permutations, in which two adjacent particles of weights $i<j$ swap places at rate $x_i - y_{n+1-j}$ if the particle of weight $j$ is to the right of the particle of weight $i$. (Otherwise nothing happens.) In the case that $y_i=0$ for all $i$, the stationary distribution was conjecturally linked to Schubert polynomials by Lam-Williams, and explicit formulas for steady state probabilities were subsequently given in terms of multiline queues by Ayyer-Linusson and Arita-Mallick. In the case of general $y_i$, Cantini showed that $n$ of the $n!$ states have probabilities proportional to products of double Schubert polynomials. In this paper we introduce the class of evil-avoiding permutations, which are the permutations avoiding the patterns $2413, 4132, 4213$ and $3214$. We show that there are $\frac{(2+\sqrt{2})^{n-1}+(2-\sqrt{2})^{n-1}}{2}$ evil-avoiding permutations in $S_n$, and for each evil-avoiding permutation $w$, we give an explicit formula for the steady state probability $ψ_w$ as a product of double Schubert polynomials. We also show that the Schubert polynomials that arise in these formulas are flagged Schur functions, and give a bijection in this case between multiline queues and semistandard Young tableaux.

preprint2022arXiv

Techniques in equivariant Ehrhart theory

Equivariant Ehrhart theory generalizes the study of lattice point enumeration to also account for the symmetries of a polytope under a linear group action. We present a catalogue of techniques with applications in this field, including zonotopal decompositions, symmetric triangulations, combinatorial interpretation of the $h^\ast$-polynomial, and certificates for the (non)existence of invariant non-degenerate hypersurfaces. We apply these methods to several families of examples including hypersimplices, orbit polytopes, and graphic zonotopes, expanding the library of polytopes for which their equivariant Ehrhart theory is known.

preprint2022arXiv

Temporal Relevance Analysis for Video Action Models

In this paper, we provide a deep analysis of temporal modeling for action recognition, an important but underexplored problem in the literature. We first propose a new approach to quantify the temporal relationships between frames captured by CNN-based action models based on layer-wise relevance propagation. We then conduct comprehensive experiments and in-depth analysis to provide a better understanding of how temporal modeling is affected by various factors such as dataset, network architecture, and input frames. With this, we further study some important questions for action recognition that lead to interesting findings. Our analysis shows that there is no strong correlation between temporal relevance and model performance; and action models tend to capture local temporal information, but less long-range dependencies. Our codes and models will be publicly available.

preprint2021arXiv

Predicting Participation in Cancer Screening Programs with Machine Learning

In this paper, we present machine learning models based on random forest classifiers, support vector machines, gradient boosted decision trees, and artificial neural networks to predict participation in cancer screening programs in South Korea. The top performing model was based on gradient boosted decision trees and achieved an area under the receiver operating characteristic curve (AUC-ROC) of 0.8706 and average precision of 0.8776. The results of this study are encouraging and suggest that with further research, these models can be directly applied to Korea&#39;s healthcare system, thus increasing participation in Korea&#39;s National Cancer Screening Program.

preprint2021arXiv

Schubert polynomials and the inhomogeneous TASEP on a ring

Consider a lattice of n sites arranged around a ring, with the $n$ sites occupied by particles of weights $\{1,2,\dots,n\}$; the possible arrangements of particles in sites thus corresponds to the $n!$ permutations in $S_n$. The \emph{inhomogeneous totally asymmetric simple exclusion process} (or TASEP) is a Markov chain on the set of permutations, in which two adjacent particles of weights $i<j$ swap places at rate $x_i - y_{n+1-j}$ if the particle of weight $j$ is to the right of the particle of weight $i$. (Otherwise nothing happens.) In the case that $y_i=0$ for all $i$, the stationary distribution was conjecturally linked to Schubert polynomials by Lam-Williams, and explicit formulas for steady state probabilities were subsequently given in terms of multiline queues by Ayyer-Linusson and Arita-Mallick. In the case of general $y_i$, Cantini showed that $n$ of the $n!$ states have probabilities proportional to double Schubert polynomials. In this paper we introduce the class of \emph{evil-avoiding permutations}, which are the permutations avoiding the patterns $2413, 4132, 4213$ and $3214$. We show that there are $\frac{(2+\sqrt{2})^{n-1}+(2-\sqrt{2})^{n-1}}{2}$ evil-avoiding permutations in $S_n$, and for each evil-avoiding permutation $w$, we give an explicit formula for the steady state probability $ψ_w$ as a product of double Schubert polynomials. We also show that the Schubert polynomials that arise in these formulas are flagged Schur functions, and give a bijection in this case between multiline queues and semistandard Young tableaux.

preprint2020arXiv

A combinatorial formula for the Ehrhart $h^{*}$-vector of the hypersimplex

We give a combinatorial formula for the Ehrhart $h^*$-vector of the hypersimplex. In particular, we show that $h^{*}_{d}(Δ_{k,n})$ is the number of hypersimplicial decorated ordered set partitions of type $(k,n)$ with winding number $d$, thereby proving a conjecture of Nick Early. We do this by proving a more general conjecture of Nick Early on the Ehrhart $h^*$-vector of a generic cross-section of a hypercube.

preprint2020arXiv

Click-aware purchase prediction with push at the top

Eliciting user preferences from purchase records for performing purchase prediction is challenging because negative feedback is not explicitly observed, and because treating all non-purchased items equally as negative feedback is unrealistic. Therefore, in this study, we present a framework that leverages the past click records of users to compensate for the missing user-item interactions of purchase records, i.e., non-purchased items. We begin by formulating various model assumptions, each one assuming a different order of user preferences among purchased, clicked-but-not-purchased, and non-clicked items, to study the usefulness of leveraging click records. We implement the model assumptions using the Bayesian personalized ranking model, which maximizes the area under the curve for bipartite ranking. However, we argue that using click records for bipartite ranking needs a meticulously designed model because of the relative unreliableness of click records compared with that of purchase records. Therefore, we ultimately propose a novel learning-to-rank method, called P3Stop, for performing purchase prediction. The proposed model is customized to be robust to relatively unreliable click records by particularly focusing on the accuracy of top-ranked items. Experimental results on two real-world e-commerce datasets demonstrate that P3STop considerably outperforms the state-of-the-art implicit-feedback-based recommendation methods, especially for top-ranked items.

preprint2020arXiv

Combinatorial formulas for the coefficients of the Al-Salam-Chihara polynomials

The Al-Salam-Chihara polynomials are an important family of orthogonal polynomials in one variable $x$ depending on 3 parameters $α$, $β$ and $q$. They are closely connected to a model from statistical mechanics called the partially asymmetric simple exclusion process (PASEP) and they can be obtained as a specialization of the Askey-Wilson polynomials. We give two different combinatorial formulas for the coefficients of the (transformed) Al-Salam-Chihara polynomials. Our formulas make manifest the fact that the coefficients are polynomials in $α$, $β$ and $q$ with positive coefficients.

preprint2020arXiv

Cross-domain Self-supervised Learning for Domain Adaptation with Few Source Labels

Existing unsupervised domain adaptation methods aim to transfer knowledge from a label-rich source domain to an unlabeled target domain. However, obtaining labels for some source domains may be very expensive, making complete labeling as used in prior work impractical. In this work, we investigate a new domain adaptation scenario with sparsely labeled source data, where only a few examples in the source domain have been labeled, while the target domain is unlabeled. We show that when labeled source examples are limited, existing methods often fail to learn discriminative features applicable for both source and target domains. We propose a novel Cross-Domain Self-supervised (CDS) learning approach for domain adaptation, which learns features that are not only domain-invariant but also class-discriminative. Our self-supervised learning method captures apparent visual similarity with in-domain self-supervision in a domain adaptive manner and performs cross-domain feature matching with across-domain self-supervision. In extensive experiments with three standard benchmark datasets, our method significantly boosts performance of target accuracy in the new target domain with few source labels and is even helpful on classical domain adaptation scenarios.

preprint2020arXiv

Factors involved in Cancer Screening Participation: Multilevel Mediation Model

In this paper, we identify the factors associated with cancer screening participation in Korea. We expand upon previous studies through a multilevel mediation model and a composite regional socioeconomic status index which combines education level and income level. Results of the model indicate that education level, nutritional education status and income level are significantly associated with cancer screening participation. With our findings in mind, we recommend health authorities to increase promotional health campaigns toward certain at-risk groups and expand the availability of nutrition education programs.

preprint2020arXiv

Learning to Scale Multilingual Representations for Vision-Language Tasks

Current multilingual vision-language models either require a large number of additional parameters for each supported language, or suffer performance degradation as languages are added. In this paper, we propose a Scalable Multilingual Aligned Language Representation (SMALR) that supports many languages with few model parameters without sacrificing downstream task performance. SMALR learns a fixed size language-agnostic representation for most words in a multilingual vocabulary, keeping language-specific features for just a few. We use a masked cross-language modeling loss to align features with context from other languages. Additionally, we propose a cross-lingual consistency module that ensures predictions made for a query and its machine translation are comparable. The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date. We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.

preprint2020arXiv

Multi-way Encoding for Robustness

Deep models are state-of-the-art for many computer vision tasks including image classification and object detection. However, it has been shown that deep models are vulnerable to adversarial examples. We highlight how one-hot encoding directly contributes to this vulnerability and propose breaking away from this widely-used, but highly-vulnerable mapping. We demonstrate that by leveraging a different output encoding, multi-way encoding, we decorrelate source and target models, making target models more secure. Our approach makes it more difficult for adversaries to find useful gradients for generating adversarial attacks. We present robustness for black-box and white-box attacks on four benchmark datasets: MNIST, CIFAR-10, CIFAR-100, and SVHN. The strength of our approach is also presented in the form of an attack for model watermarking, raising challenges in detecting stolen models.

preprint2020arXiv

Sharing Nim and Enumeration of Nim Characteristics

In this paper, we introduce and examine a variant of the game of Nim (Sharing Nim), where players can either remove or transfer objects from 1 pile to another. The only restriction is that players may not transfer objects from a pile of greater size to a pile of smaller size. We also find new methods of enumerating characteristics of Nim and Sharing Nim, including the number of zero nim positions.

preprint2020arXiv

Unsupervised Attributed Multiplex Network Embedding

Nodes in a multiplex network are connected by multiple types of relations. However, most existing network embedding methods assume that only a single type of relation exists between nodes. Even for those that consider the multiplexity of a network, they overlook node attributes, resort to node labels for training, and fail to model the global properties of a graph. We present a simple yet effective unsupervised network embedding method for attributed multiplex network called DMGI, inspired by Deep Graph Infomax (DGI) that maximizes the mutual information between local patches of a graph, and the global representation of the entire graph. We devise a systematic way to jointly integrate the node embeddings from multiple graphs by introducing 1) the consensus regularization framework that minimizes the disagreements among the relation-type specific node embeddings, and 2) the universal discriminator that discriminates true samples regardless of the relation types. We also show that the attention mechanism infers the importance of each relation type, and thus can be useful for filtering unnecessary relation types as a preprocessing step. Extensive experiments on various downstream tasks demonstrate that DMGI outperforms the state-of-the-art methods, even though DMGI is fully unsupervised.

preprint2020arXiv

Unsupervised Differentiable Multi-aspect Network Embedding

Network embedding is an influential graph mining technique for representing nodes in a graph as distributed vectors. However, the majority of network embedding methods focus on learning a single vector representation for each node, which has been recently criticized for not being capable of modeling multiple aspects of a node. To capture the multiple aspects of each node, existing studies mainly rely on offline graph clustering performed prior to the actual embedding, which results in the cluster membership of each node (i.e., node aspect distribution) fixed throughout training of the embedding model. We argue that this not only makes each node always have the same aspect distribution regardless of its dynamic context, but also hinders the end-to-end training of the model that eventually leads to the final embedding quality largely dependent on the clustering. In this paper, we propose a novel end-to-end framework for multi-aspect network embedding, called asp2vec, in which the aspects of each node are dynamically assigned based on its local context. More precisely, among multiple aspects, we dynamically assign a single aspect to each node based on its current context, and our aspect selection module is end-to-end differentiable via the Gumbel-Softmax trick. We also introduce the aspect regularization framework to capture the interactions among the multiple aspects in terms of relatedness and diversity. We further demonstrate that our proposed framework can be readily extended to heterogeneous networks. Extensive experiments towards various downstream tasks on various types of homogeneous networks and a heterogeneous network demonstrate the superiority of asp2vec.

preprint2019arXiv

Control of A High Performance Bipedal Robot using Viscoelastic Liquid Cooled Actuators

This paper describes the control, and evaluation of a new human-scaled biped robot with liquid cooled viscoelastic actuators (VLCA). Based on the lessons learned from previous work from our team on VLCA [1], we present a new system design embodying a Reaction Force Sensing Series Elastic Actuator (RFSEA) and a Force Sensing Series Elastic Actuator (FSEA). These designs are aimed at reducing the size and weight of the robot&#39;s actuation system while inheriting the advantages of our designs such as energy efficiency, torque density, impact resistance and position/force controllability. The system design takes into consideration human-inspired kinematics and range-of-motion (ROM), while relying on foot placement to balance. In terms of actuator control, we perform a stability analysis on a Disturbance Observer (DOB) designed for force control. We then evaluate various position control algorithms both in the time and frequency domains for our VLCA actuators. Having the low level baseline established, we first perform a controller evaluation on the legs using Operational Space Control (OSC) [2]. Finally, we move on to evaluating the full bipedal robot by accomplishing unsupported dynamic walking by means of the algorithms to appear in [3].

preprint2019arXiv

MULE: Multimodal Universal Language Embedding

Existing vision-language methods typically support two languages at a time at most. In this paper, we present a modular approach which can easily be incorporated into existing vision-language methods in order to support many languages. We accomplish this by learning a single shared Multimodal Universal Language Embedding (MULE) which has been visually-semantically aligned across all languages. Then we learn to relate MULE to visual data as if it were a single language. Our method is not architecture specific, unlike prior work which typically learned separate branches for each language, enabling our approach to easily be adapted to many vision-language methods and tasks. Since MULE learns a single language branch in the multimodal model, we can also scale to support many languages, and languages with fewer annotations can take advantage of the good representation learned from other (more abundant) language data. We demonstrate the effectiveness of MULE on the bidirectional image-sentence retrieval task, supporting up to four languages in a single model. In addition, we show that Machine Translation can be used for data augmentation in multilingual learning, which, combined with MULE, improves mean recall by up to 21.9% on a single-language compared to prior work, with the most significant gains seen on languages with relatively few annotations. Our code is publicly available.