Researcher profile

Jae Sung Park

Jae Sung Park contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition

Videos are unique in their ability to capture actions which transcend multiple frames. Accordingly, for many years action recognition was the quintessential task for video understanding. Unfortunately, due to a lack of sufficiently diverse and challenging data, modern vision-language models (VLMs) are no longer evaluated on their action recognition capabilities. To revitalize action recognition in the era of VLMs, we advocate for a returned focus on domain-specific actions. To this end, we introduce VideoNet, a domain-specific action recognition benchmark covering 1,000 distinct actions from 37 domains. We begin with a multiple-choice evaluation setting, where the difference between closed and open models is stark: Gemini 3.1 Pro attains 69.9% accuracy while Qwen3-VL-8B gets a mere 45.0%. To understand why VLMs struggle on VideoNet, we relax the questions into a binary setting, where random chance is 50%. Still, Qwen achieves only 59.2% accuracy. Further relaxing the evaluation setup, we provide $k\in\{1,2,3\}$ in-context examples of the action. Some models excel in the few-shot setting, while others falter; Qwen improves $+7.0\%$, while Gemini declines $-4.8\%$. Notably, these gains fall short of the $+13.6\%$ improvement in non-expert humans when given few-shot examples. Finding that VLMs struggle to fully exploit in-context examples, we shift from test-time improvements to the training side. We collect the first large-scale training dataset for domain-specific actions, totaling nearly 500k video question-answer pairs. Fine-tuning a Molmo2-4B model on our data, we surpass all open-weight 8B models on the VideoNet benchmark.

preprint2022arXiv

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

Humans have remarkable capacity to reason abductively and hypothesize about what lies beyond the literal content of an image. By identifying concrete visual clues scattered throughout a scene, we almost can't help but draw probable inferences beyond the literal scene based on our everyday experience and knowledge about the world. For example, if we see a "20 mph" sign alongside a road, we might assume the street sits in a residential area (rather than on a highway), even if no houses are pictured. Can machines perform similar visual reasoning? We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents. We adopt a free-viewing paradigm: participants first observe and identify salient clues within images (e.g., objects, actions) and then provide a plausible inference about the scene, given the clue. In total, we collect 363K (clue, inference) pairs, which form a first-of-its-kind abductive visual reasoning dataset. Using our corpus, we test three complementary axes of abductive reasoning. We evaluate the capacity of models to: i) retrieve relevant inferences from a large candidate corpus; ii) localize evidence for inferences via bounding boxes, and iii) compare plausible inferences to match human judgments on a newly-collected diagnostic corpus of 19K Likert-scale judgments. While we find that fine-tuning CLIP-RN50x64 with a multitask objective outperforms strong baselines, significant headroom exists between model performance and human agreement. Data, models, and leaderboard available at http://visualabduction.com/

preprint2021arXiv

Exact coherent structures and phase space geometry of pre-turbulent 2D active nematic channel flow

Confined active nematics exhibit rich dynamical behavior, including spontaneous flows, periodic defect dynamics, and chaotic `active turbulence'. Here, we study these phenomena using the framework of Exact Coherent Structures, which has been successful in characterizing the routes to high Reynolds number turbulence of passive fluids. Exact Coherent Structures are stationary, periodic, quasiperiodic, or traveling wave solutions of the hydrodynamic equations that, together with their invariant manifolds, serve as an organizing template of the dynamics. We compute the dominant Exact Coherent Structures and connecting orbits in a pre-turbulent active nematic channel flow, which enables a fully nonlinear but highly reduced order description in terms of a directed graph. Using this reduced representation, we compute instantaneous perturbations that switch the system between disparate spatiotemporal states occupying distant regions of the infinite dimensional phase space. Our results lay the groundwork for a systematic means of understanding and controlling active nematic flows in the moderate to high activity regime.

preprint2020arXiv

Dynamics of laminar and transitional flows over slip surfaces: effects on the laminar-turbulent separatrix

The effect of slip surfaces on the laminar-turbulent separatrix of plane Poiseuille flow is studied by direct numerical simulation. Turbulence lifetimes, the likelihood that turbulence is sustained, is investigated for transitional flows with various slip lengths. Slip surfaces decrease the likelihood of sustained turbulence compared to the no-slip case, and likelihood is further decreased as slip length is increased. A deterministic analysis of the effects of slip surfaces on transition to turbulence is performed using nonlinear traveling wave solutions to the Navier-Stokes equations, also known as exact coherent solutions. Two solution families, dubbed P3 and P4, are used since their lower-branch solutions are embedded on the boundary of the basin of attraction of laminar and turbulent flows (Park & Graham 2015). Additionally, they exhibit distinct flow structures -- the P3 and P4 are denoted as core mode and critical layer mode, respectively. Distinct effects of slip surfaces on the solutions are observed by the skin friction evolution, linear growth rate, and phase-space projection of transitional trajectories. The slip surface modifies transition dynamics little for the core mode, but considerably for the critical layer mode. Most importantly, the slip surface promotes different transition dynamics -- early and bypass-like transition for the core mode and delayed and H-/K-type-like transition for the critical layer mode. Based on spatiotemporal and quadrant analyses, it is found that slip surfaces promote the prevalence of strong wall-toward motions (sweep-like events) near vortex cores close to the channel centre, inducing an early transition, while sustained ejection events are present in the region of the $Λ$-shaped vortex cores close to the critical layer, resulting in a delayed transition.

preprint2020arXiv

HMPO: Human Motion Prediction in Occluded Environments for Safe Motion Planning

We present a novel approach to generate collision-free trajectories for a robot operating in close proximity with a human obstacle in an occluded environment. The self-occlusions of the robot can significantly reduce the accuracy of human motion prediction, and we present a novel deep learning-based prediction algorithm. Our formulation uses CNNs and LSTMs and we augment human-action datasets with synthetically generated occlusion information for training. We also present an occlusion-aware planner that uses our motion prediction algorithm to compute collision-free trajectories. We highlight performance of the overall approach (HMPO) in complex scenarios and observe upto 68% performance improvement in motion prediction accuracy, and 38% improvement in terms of error distance between the ground-truth and the predicted human joint positions.

preprint2020arXiv

Identity-Aware Multi-Sentence Video Description

Standard video and movie description tasks abstract away from person identities, thus failing to link identities across sentences. We propose a multi-sentence Identity-Aware Video Description task, which overcomes this limitation and requires to re-identify persons locally within a set of consecutive clips. We introduce an auxiliary task of Fill-in the Identity, that aims to predict persons' IDs consistently within a set of clips, when the video descriptions are given. Our proposed approach to this task leverages a Transformer architecture allowing for coherent joint prediction of multiple IDs. One of the key components is a gender-aware textual representation as well an additional gender prediction objective in the main model. This auxiliary task allows us to propose a two-stage approach to Identity-Aware Video Description. We first generate multi-sentence video descriptions, and then apply our Fill-in the Identity model to establish links between the predicted person entities. To be able to tackle both tasks, we augment the Large Scale Movie Description Challenge (LSMDC) benchmark with new annotations suited for our problem statement. Experiments show that our proposed Fill-in the Identity model is superior to several baselines and recent works, and allows us to generate descriptions with locally re-identified people.

preprint2020arXiv

LSTM-based Anomaly Detection for Non-linear Dynamical System

Anomaly detection for non-linear dynamical system plays an important role in ensuring the system stability. However, it is usually complex and has to be solved by large-scale simulation which requires extensive computing resources. In this paper, we propose a novel anomaly detection scheme in non-linear dynamical system based on Long Short-Term Memory (LSTM) to capture complex temporal changes of the time sequence and make multi-step predictions. Specifically, we first present the framework of LSTM-based anomaly detection in non-linear dynamical system, including data preprocessing, multi-step prediction and anomaly detection. According to the prediction requirement, two types of training modes are explored in multi-step prediction, where samples in a wall shear stress dataset are collected by an adaptive sliding window. On the basis of the multi-step prediction result, a Local Average with Adaptive Parameters (LAAP) algorithm is proposed to extract local numerical features of the time sequence and estimate the upcoming anomaly. The experimental results show that our proposed multi-step prediction method can achieve a higher prediction accuracy than traditional method in wall shear stress dataset, and the LAAP algorithm performs better than the absolute value-based method in anomaly detection task.

preprint2020arXiv

VisualCOMET: Reasoning about the Dynamic Context of a Still Image

Even from a single frame of a still image, people can reason about the dynamic story of the image before, after, and beyond the frame. For example, given an image of a man struggling to stay afloat in water, we can reason that the man fell into the water sometime in the past, the intent of that man at the moment is to stay alive, and he will need help in the near future or else he will get washed away. We propose VisualComet, the novel framework of visual commonsense reasoning tasks to predict events that might have happened before, events that might happen next, and the intents of the people at present. To support research toward visual commonsense reasoning, we introduce the first large-scale repository of Visual Commonsense Graphs that consists of over 1.4 million textual descriptions of visual commonsense inferences carefully annotated over a diverse set of 60,000 images, each paired with short video summaries of before and after. In addition, we provide person-grounding (i.e., co-reference links) between people appearing in the image and people mentioned in the textual commonsense descriptions, allowing for tighter integration between images and text. We establish strong baseline performances on this task and demonstrate that integration between visual and textual commonsense reasoning is the key and wins over non-integrative alternatives.