Source author record

Jaeyoung Choi

Jaeyoung Choi appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision cs.CY eess.SP Information Retrieval Multimedia Social and Information Networks Sound

Catalog footprint

What is connected

6works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

Forecasting future 3D hand pose sequences from egocentric video is essential for understanding human intention and enabling embodied applications such as AR/VR assistance and human-robot interaction. However, this task remains a highly challenging problem because egocentric hand motion is driven by complex human intent, exhibits highly dexterous articulations, and is observed under drastic viewpoint shifts induced by ego-motion. In this work, we introduce EggHand, a foundation-model-based framework for egocentric hand pose forecasting that unifies multimodal semantic reasoning with dynamic motion modeling. Our approach couples an action decoder from a Vision-Language-Action (VLA) model, which captures the structured temporal dynamics of hand motion, with an egocentric video-text encoder that provides viewpoint-aware contextual information learned from large-scale first-person video. Together, these components overcome the brittleness of generic visual encoders under ego-motion and enable joint reasoning over motion, context, and high-level intent-without relying on body pose or external tracking. Experiments on the EgoExo4D dataset show that EggHand sets a new state of the art in forecasting accuracy, remains robust under severe ego-motion, and further enables controllable prediction via language-based task prompts. Project page: https://jyoun9.github.io/EggHand

preprint2022arXiv

Dynamic RF Beam Codebook Reduction for Cost-Efficient mmWave Full-Duplex Systems

The recent attempts to realize full-duplex (FD) communications in millimeter wave (mmWave) systems have garnered a significant amount of interest for its potential. In this paper, we present a cost-efficient design of mmWave FD systems, where the system dynamically reduces the RF beam codebook in a computationally efficient manner, so that it is comprised of the RF beams that will prevent the Rx receive chain from saturating due to the self-interference (SI). The analog beamformer will suppress the SI to the level that the residual SI can be completely removed with digital SI cancellation, allowing the digital beamformer to concentrate on the desired channel, free of the SI. To reduce the computation required for the proposed method, we propose two sufficient conditions that prevent the Rx side saturations, which are practically tight. Through performance valuations conducted in realistically modeled mmWave FD scenarios, we demonstrate that the proposed design achieves comparable performance with the ideal FD and other benchmarks with significantly lower costs.

preprint2022arXiv

FontNet: Closing the gap to font designer performance in font synthesis

Font synthesis has been a very active topic in recent years because manual font design requires domain expertise and is a labor-intensive and time-consuming job. While remarkably successful, existing methods for font synthesis have major shortcomings; they require finetuning for unobserved font style with large reference images, the recent few-shot font synthesis methods are either designed for specific language systems or they operate on low-resolution images which limits their use. In this paper, we tackle this font synthesis problem by learning the font style in the embedding space. To this end, we propose a model, called FontNet, that simultaneously learns to separate font styles in the embedding space where distances directly correspond to a measure of font similarity, and translates input images into the given observed or unobserved font style. Additionally, we design the network architecture and training procedure that can be adopted for any language system and can produce high-resolution font images. Thanks to this approach, our proposed method outperforms the existing state-of-the-art font generation methods on both qualitative and quantitative experiments.

preprint2022arXiv

Sequential Movie Genre Prediction using Average Transition Probability with Clustering

In recent movie recommendations, predicting the user's sequential behavior and suggesting the next movie to watch is one of the most important issues. However, capturing such sequential behavior is not easy because each user's short-term or long-term behavior must be taken into account. For this reason, many research results show that the performance of recommending a specific movie is not very high in a sequential recommendation. In this paper, we propose a cluster-based method for classifying users with similar movie purchase patterns and a movie genre prediction algorithm rather than the movie itself considering their short-term and long-term behaviors. The movie genre prediction does not recommend a specific movie, but it predicts the genre for the next movie to watch in consideration of each user's preference for the movie genre based on the genre included in the movie. Through this, it is possible to provide appropriate guidelines for recommending movies including the genre to users who tend to prefer a specific genre. In particular, in this paper, users with similar genre preferences are organized into clusters to recommend genres, and in clusters that do not have relatively specific tendencies, genre prediction is performed by appropriately trimming genres that are not necessary for recommendation in order to improve performance. We evaluate our method on well-known movie datasets, and qualitatively that it captures personalized dynamics and is able to make meaningful recommendations.

preprint2016arXiv

DCAR: A Discriminative and Compact Audio Representation to Improve Event Detection

This paper presents a novel two-phase method for audio representation, Discriminative and Compact Audio Representation (DCAR), and evaluates its performance at detecting events in consumer-produced videos. In the first phase of DCAR, each audio track is modeled using a Gaussian mixture model (GMM) that includes several components to capture the variability within that track. The second phase takes into account both global structure and local structure. In this phase, the components are rendered more discriminative and compact by formulating an optimization problem on Grassmannian manifolds, which we found represents the structure of audio effectively. Our experiments used the YLI-MED dataset (an open TRECVID-style video corpus based on YFCC100M), which includes ten events. The results show that the proposed DCAR representation consistently outperforms state-of-the-art audio representations. DCAR's advantage over i-vector, mv-vector, and GMM representations is significant for both easier and harder discrimination tasks. We discuss how these performance differences across easy and hard cases follow from how each type of model leverages (or doesn't leverage) the intrinsic structure of the data. Furthermore, DCAR shows a particularly notable accuracy advantage on events where humans have more difficulty classifying the videos, i.e., events with lower mean annotator confidence.

preprint2016arXiv

Where to be wary: The impact of widespread photo-taking and image enhancement practices on users' geo-privacy

Today's geo-location estimation approaches are able to infer the location of a target image using its visual content alone. These approaches exploit visual matching techniques, applied to a large collection of background images with known geo-locations. Users who are unaware that visual retrieval approaches can compromise their geo-privacy, unwittingly open themselves to risks of crime or other unintended consequences. Private photo sharing is not able to protect users effectively, since its inconvenience is a barrier to consistent use, and photos can still fall into the wrong hands if they are re-shared. This paper lays the groundwork for a new approach to geo-privacy of social images: Instead of requiring a complete change of user behavior, we investigate the protection potential latent in users existing practices. We carry out a series of retrieval experiments using a large collection of social images (8.5M) to systematically analyze where users should be wary, and how both photo taking and editing practices impact the performance of geo-location estimation. We find that practices that are currently widespread are already sufficient to protect single-handedly the geo-location ('geo-cloak') up to more than 50% of images whose location would otherwise be automatically predictable. Our conclusion is that protecting users against the unwanted effects of visual retrieval is a viable research field, and should take as its starting point existing user practices.

Jaeyoung Choi

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

EggHand: A Multimodal Foundation Model for Egocentric Hand Pose Forecasting

Dynamic RF Beam Codebook Reduction for Cost-Efficient mmWave Full-Duplex Systems

FontNet: Closing the gap to font designer performance in font synthesis

Sequential Movie Genre Prediction using Average Transition Probability with Clustering

DCAR: A Discriminative and Compact Audio Representation to Improve Event Detection

Where to be wary: The impact of widespread photo-taking and image enhancement practices on users' geo-privacy