Source author record

Sheng Jin

Sheng Jin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision astro-ph.EP astro-ph.SR eess.AS Human-Computer Interaction Machine Learning

Catalog footprint

What is connected

14works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

This paper presents JavisGPT, the first unified multimodal large language model (MLLM) for joint audio-video (JAV) comprehension and generation. JavisGPT has a concise encoder-LLM-decoder architecture, which has a SyncFusion module for spatio-temporal audio-video fusion and synchrony-aware learnable queries to bridge a pretrained JAV-DiT generator. This design enables temporally coherent video-audio understanding and generation from multimodal instructions. We design an effective three-stage training pipeline consisting of multimodal pretraining, audio-video fine-tuning, and large-scale instruction-tuning, to progressively build multimodal comprehension and generation from existing vision-language models. For instruction tuning, we construct JavisInst-Omni, a high-quality instruction dataset with over 200K GPT-4o-curated audio-video-text dialogues that cover diverse and multi-level comprehension and generation scenarios. On JAV comprehension and generation benchmarks, our experiments show that JavisGPT outperforms existing MLLMs, particularly in complex and temporally synchronized settings.

preprint2022arXiv

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Vision transformers have achieved great successes in many computer vision tasks. Most methods generate vision tokens by splitting an image into a regular and fixed grid and treating each cell as a token. However, not all regions are equally important in human-centric vision tasks, e.g., the human body needs a fine representation with many tokens, while the image background can be modeled by a few tokens. To address this problem, we propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer), which merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes. The tokens in TCFormer can not only focus on important areas but also adjust the token shapes to fit the semantic concept and adopt a fine resolution for regions containing critical details, which is beneficial to capturing detailed information. Extensive experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets, including whole-body pose estimation on COCO-WholeBody and 3D human mesh reconstruction on 3DPW. Code is available at https://github.com/zengwang430521/TCFormer.git

preprint2022arXiv

Pose for Everything: Towards Category-Agnostic Pose Estimation

Existing works on 2D pose estimation mainly focus on a certain category, e.g. human, animal, and vehicle. However, there are lots of application scenarios that require detecting the poses/keypoints of the unseen class of objects. In this paper, we introduce the task of Category-Agnostic Pose Estimation (CAPE), which aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition. To achieve this goal, we formulate the pose estimation problem as a keypoint matching problem and design a novel CAPE framework, termed POse Matching Network (POMNet). A transformer-based Keypoint Interaction Module (KIM) is proposed to capture both the interactions among different keypoints and the relationship between the support and query images. We also introduce Multi-category Pose (MP-100) dataset, which is a 2D pose dataset of 100 object categories containing over 20K instances and is well-designed for developing CAPE algorithms. Experiments show that our method outperforms other baseline approaches by a large margin. Codes and data are available at https://github.com/luminxu/Pose-for-Everything.

preprint2022arXiv

PoseTrans: A Simple Yet Effective Pose Transformation Augmentation for Human Pose Estimation

Human pose estimation aims to accurately estimate a wide variety of human poses. However, existing datasets often follow a long-tailed distribution that unusual poses only occupy a small portion, which further leads to the lack of diversity of rare poses. These issues result in the inferior generalization ability of current pose estimators. In this paper, we present a simple yet effective data augmentation method, termed Pose Transformation (PoseTrans), to alleviate the aforementioned problems. Specifically, we propose Pose Transformation Module (PTM) to create new training samples that have diverse poses and adopt a pose discriminator to ensure the plausibility of the augmented poses. Besides, we propose Pose Clustering Module (PCM) to measure the pose rarity and select the "rarest" poses to help balance the long-tailed distribution. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our method, especially on rare poses. Also, our method is efficient and simple to implement, which can be easily integrated into the training pipeline of existing pose estimation models.

preprint2022arXiv

Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization

Localizing keypoints of an object is a basic visual problem. However, supervised learning of a keypoint localization network often requires a large amount of data, which is expensive and time-consuming to obtain. To remedy this, there is an ever-growing interest in semi-supervised learning (SSL), which leverages a small set of labeled data along with a large set of unlabeled data. Among these SSL approaches, pseudo-labeling (PL) is one of the most popular. PL approaches apply pseudo-labels to unlabeled data, and then train the model with a combination of the labeled and pseudo-labeled data iteratively. The key to the success of PL is the selection of high-quality pseudo-labeled samples. Previous works mostly select training samples by manually setting a single confidence threshold. We propose to automatically select reliable pseudo-labeled samples with a series of dynamic thresholds, which constitutes a learning curriculum. Extensive experiments on six keypoint localization benchmark datasets demonstrate that the proposed approach significantly outperforms the previous state-of-the-art SSL approaches.

preprint2022arXiv

ZoomNAS: Searching for Whole-body Human Pose Estimation in the Wild

This paper investigates the task of 2D whole-body human pose estimation, which aims to localize dense landmarks on the entire human body including body, feet, face, and hands. We propose a single-network approach, termed ZoomNet, to take into account the hierarchical structure of the full human body and solve the scale variation of different body parts. We further propose a neural architecture search framework, termed ZoomNAS, to promote both the accuracy and efficiency of whole-body pose estimation. ZoomNAS jointly searches the model architecture and the connections between different sub-modules, and automatically allocates computational complexity for searched sub-modules. To train and evaluate ZoomNAS, we introduce the first large-scale 2D human whole-body dataset, namely COCO-WholeBody V1.0, which annotates 133 keypoints for in-the-wild images. Extensive experiments demonstrate the effectiveness of ZoomNAS and the significance of COCO-WholeBody V1.0.

preprint2021arXiv

Relative occurrence rates of terrestrial planets orbiting FGK stars

This paper aims to derive a map of relative planet occurrence rates that can provide constraints on the overall distribution of terrestrial planets around FGK stars. Based on the planet candidates in the Kepler DR25 data release, I first generate a continuous density map of planet distribution using a Gaussian kernel model and correct the geometric factor that the discovery space of a transit event decreases along with the increase of planetary orbital distance. Then I fit two exponential decay functions of detection efficiency along with the increase of planetary orbital distance and the decrease of planetary radius. Finally, the density map of planet distribution is compensated for the fitted exponential decay functions of detection efficiency to obtain a relative occurrence rate distribution of terrestrial planets. The result shows two regions with planet abundance: one corresponds to planets with radii between 0.5 and 1.5 R_Earth within 0.2 AU, the other corresponds to planets with radii between 1.5 and 3 R_Earth beyond 0.5 AU. It also confirms the features that may be caused by atmospheric evaporation: there is a vacancy of planets of sizes between 2.0 and 4.0 R_Earth inside of ~ 0.5 AU, and a valley with relatively low occurrence rates between 0.2 and 0.5 AU for planets with radii between 1.5 and 3.0 R_Earth.

preprint2020arXiv

Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation

Multi-person pose estimation is challenging because it localizes body keypoints for multiple persons simultaneously. Previous methods can be divided into two streams, i.e. top-down and bottom-up methods. The top-down methods localize keypoints after human detection, while the bottom-up methods localize keypoints directly and then cluster/group them for different persons, which are generally more efficient than top-down methods. However, in existing bottom-up methods, the keypoint grouping is usually solved independently from keypoint detection, making them not end-to-end trainable and have sub-optimal performance. In this paper, we investigate a new perspective of human part grouping and reformulate it as a graph clustering task. Especially, we propose a novel differentiable Hierarchical Graph Grouping (HGG) method to learn the graph grouping in bottom-up multi-person pose estimation task. Moreover, HGG is easily embedded into main-stream bottom-up methods. It takes human keypoint candidates as graph nodes and clusters keypoints in a multi-layer graph neural network model. The modules of HGG can be trained end-to-end with the keypoint detection network and is able to supervise the grouping process in a hierarchical manner. To improve the discrimination of the clustering, we add a set of edge discriminators and macro-node discriminators. Extensive experiments on both COCO and OCHuman datasets demonstrate that the proposed method improves the performance of bottom-up pose estimation methods.

preprint2020arXiv

RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning

This paper presents a deep reinforcement learning algorithm for online accompaniment generation, with potential for real-time interactive human-machine duet improvisation. Different from offline music generation and harmonization, online music accompaniment requires the algorithm to respond to human input and generate the machine counterpart in a sequential order. We cast this as a reinforcement learning problem, where the generation agent learns a policy to generate a musical note (action) based on previously generated context (state). The key of this algorithm is the well-functioning reward model. Instead of defining it using music composition rules, we learn this model from monophonic and polyphonic training data. This model considers the compatibility of the machine-generated note with both the machine-generated context and the human-generated context. Experiments show that this algorithm is able to respond to the human part and generate a melodic, harmonic and diverse machine part. Subjective evaluations on preferences show that the proposed algorithm generates music pieces of higher quality than the baseline method.

preprint2020arXiv

Whole-Body Human Pose Estimation in the Wild

This paper investigates the task of 2D human whole-body pose estimation, which aims to localize dense landmarks on the entire human body including face, hands, body, and feet. As existing datasets do not have whole-body annotations, previous methods have to assemble different deep models trained independently on different datasets of the human face, hand, and body, struggling with dataset biases and large model complexity. To fill in this blank, we introduce COCO-WholeBody which extends COCO dataset with whole-body annotations. To our best knowledge, it is the first benchmark that has manual annotations on the entire human body, including 133 dense landmarks with 68 on the face, 42 on hands and 23 on the body and feet. A single-network model, named ZoomNet, is devised to take into account the hierarchical structure of the full human body to solve the scale variation of different body parts of the same person. ZoomNet is able to significantly outperform existing methods on the proposed COCO-WholeBody dataset. Extensive experiments show that COCO-WholeBody not only can be used to train deep models from scratch for whole-body pose estimation but also can serve as a powerful pre-training dataset for many different tasks such as facial landmark detection and hand keypoint estimation. The dataset is publicly available at https://github.com/jin-s13/COCO-WholeBody.

preprint2016arXiv

Modeling Dust Emission of HL Tau Disk Based on Planet-Disk Interactions

We use extensive global two-dimensional hydrodynamic disk gas+dust simulations with embedded planets, coupled with three dimensional radiative transfer calculations, to model the dust ring and gap structures in the HL Tau protoplanetary disk observed with the Atacama Large Millimeter/Submillimeter Array (ALMA). We include the self-gravity of disk gas and dust components and make reasonable choices of disk parameters, assuming an already settled dust distribution and no planet migration. We can obtain quite adequate fits to the observed dust emission using three planets with masses 0.35, 0.17, and 0.26 $M_{Jup}$ at 13.1, 33.0, and 68.6 AU, respectively. Implications for the planet formation as well as the limitations of this scenario are discussed.

preprint2014arXiv

Planetary population synthesis coupled with atmospheric escape: a statistical view of evaporation

We apply hydrodynamic evaporation models to different synthetic planet populations that were obtained from a planet formation code based on a core-accretion paradigm. We investigated the evolution of the planet populations using several evaporation models, which are distinguished by the driving force of the escape flow (X-ray or EUV), the heating efficiency in energy-limited evaporation regimes, or both. Although the mass distribution of the planet populations is barely affected by evaporation, the radius distribution clearly shows a break at approximately 2 $R_{\oplus}$. We find that evaporation can lead to a bimodal distribution of planetary sizes (Owen & Wu 2013) and to an "evaporation valley" running diagonally downwards in the orbital distance - planetary radius plane, separating bare cores from low-mass planet that have kept some primordial H/He. Furthermore, this bimodal distribution is related to the initial characteristics of the planetary populations because low-mass planetary cores can only accrete small primordial H/He envelopes and their envelope masses are proportional to their core masses. We also find that the population-wide effect of evaporation is not sensitive to the heating efficiency of energy-limited description. However, in two extreme cases, namely without evaporation or with a 100\% heating efficiency in an evaporation model, the final size distributions show significant differences; these two scenarios can be ruled out from the size distribution of $Kepler$ candidates.

preprint2011arXiv

Terrestrial Planet Formation in the Inclined Systems: Application to OGLE-2006-BLG-109L System

In this work, we extensively investigate the terrestrial planetary formation for the inclined planetary systems (considering the OGLE-2006-BLG-109L system as prototype) in the late stage. In the simulations, we show that the occurrence of terrestrial planets is quite common, in the final assembly stage. Moreover, we find that 40% of the runs finally occupy one planet in the habitable zone (HZ). On the other hand, the numerical results also indicate that the inner region of the planetesimal disk, ranging from $\sim 0.1$ to 0.3 AU, plays an important role in building up terrestrial planets. By examining all simulations, we note that the survivals are located either between 0.1$\sim$1.0 AU or beyond 7 AU, or at the 1:1 mean motion resonance of OGLE-2006-BLG-109Lb at $\sim$2.20 AU. The outcomes suggest that it may exist moderate possibility for the inclined systems to harbor terrestrial planets, even planets in the HZs.

preprint2010arXiv

Forming Close-in Earth-like Planets via a Collision-Merger Mechanism in Late-stage Planet Formation

The large number of exoplanets found to orbit their host stars in very close orbits have significantly advanced our understanding of the planetary formation process. It is now widely accepted that such short-period planets cannot have formed {\em in situ}, but rather must have migrated to their current orbits from a formation location much farther from their host star. In the late stages of planetary formation, once the gas in the proto-planetary disk has dissipated and migration has halted, gas-giants orbiting in the inner disk regions will excite planetesimals and planetary embryos, resulting in an increased rate of orbital crossings and large impacts. We present the results of dynamical simulations for planetesimal evolution in this later stage of planet formation. We find that a mechanism is revealed by which the collision-merger of planetary embryos can kick terrestrial planets directly into orbits extremely close to their parent stars.

Sheng Jin

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

JavisGPT: A Unified Multi-modal LLM for Sounding-Video Comprehension and Generation

Not All Tokens Are Equal: Human-centric Visual Analysis via Token Clustering Transformer

Pose for Everything: Towards Category-Agnostic Pose Estimation

PoseTrans: A Simple Yet Effective Pose Transformation Augmentation for Human Pose Estimation

Pseudo-Labeled Auto-Curriculum Learning for Semi-Supervised Keypoint Localization

ZoomNAS: Searching for Whole-body Human Pose Estimation in the Wild

Relative occurrence rates of terrestrial planets orbiting FGK stars

Differentiable Hierarchical Graph Grouping for Multi-Person Pose Estimation

RL-Duet: Online Music Accompaniment Generation Using Deep Reinforcement Learning

Whole-Body Human Pose Estimation in the Wild

Modeling Dust Emission of HL Tau Disk Based on Planet-Disk Interactions

Planetary population synthesis coupled with atmospheric escape: a statistical view of evaporation

Terrestrial Planet Formation in the Inclined Systems: Application to OGLE-2006-BLG-109L System

Forming Close-in Earth-like Planets via a Collision-Merger Mechanism in Late-stage Planet Formation