Source author record

Jianqin Yin

Jianqin Yin appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Artificial Intelligence Robotics Multimedia

Catalog footprint

What is connected

10works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.

preprint2023arXiv

A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition

Human Interaction Recognition is the process of identifying interactive actions between multiple participants in a specific situation. The aim is to recognise the action interactions between multiple entities and their meaning. Many single Convolutional Neural Network has issues, such as the inability to capture global instance interaction features or difficulty in training, leading to ambiguity in action semantics. In addition, the computational complexity of the Transformer cannot be ignored, and its ability to capture local information and motion features in the image is poor. In this work, we propose a Two-stream Hybrid CNN-Transformer Network (THCT-Net), which exploits the local specificity of CNN and models global dependencies through the Transformer. CNN and Transformer simultaneously model the entity, time and space relationships between interactive entities respectively. Specifically, Transformer-based stream integrates 3D convolutions with multi-head self-attention to learn inter-token correlations; We propose a new multi-branch CNN framework for CNN-based streams that automatically learns joint spatio-temporal features from skeleton sequences. The convolutional layer independently learns the local features of each joint neighborhood and aggregates the features of all joints. And the raw skeleton coordinates as well as their temporal difference are integrated with a dual-branch paradigm to fuse the motion features of the skeleton. Besides, a residual structure is added to speed up training convergence. Finally, the recognition results of the two branches are fused using parallel splicing. Experimental results on diverse and challenging datasets, demonstrate that the proposed method can better comprehend and infer the meaning and context of various actions, outperforming state-of-the-art methods.

preprint2022arXiv

An Attractor-Guided Neural Networks for Skeleton-Based Human Motion Prediction

Joint relation modeling is a curial component in human motion prediction. Most existing methods tend to design skeletal-based graphs to build the relations among joints, where local interactions between joint pairs are well learned. However, the global coordination of all joints, which reflects human motion's balance property, is usually weakened because it is learned from part to whole progressively and asynchronously. Thus, the final predicted motions are sometimes unnatural. To tackle this issue, we learn a medium, called balance attractor (BA), from the spatiotemporal features of motion to characterize the global motion features, which is subsequently used to build new joint relations. Through the BA, all joints are related synchronously, and thus the global coordination of all joints can be better learned. Based on the BA, we propose our framework, referred to Attractor-Guided Neural Network, mainly including Attractor-Based Joint Relation Extractor (AJRE) and Multi-timescale Dynamics Extractor (MTDE). The AJRE mainly includes Global Coordination Extractor (GCE) and Local Interaction Extractor (LIE). The former presents the global coordination of all joints, and the latter encodes local interactions between joint pairs. The MTDE is designed to extract dynamic information from raw position information for effective prediction. Extensive experiments show that the proposed framework outperforms state-of-the-art methods in both short and long-term predictions in H3.6M, CMU-Mocap, and 3DPW.

preprint2022arXiv

An end-to-end multi-scale network for action prediction in videos

In this paper, we develop an efficient multi-scale network to predict action classes in partial videos in an end-to-end manner. Unlike most existing methods with offline feature generation, our method directly takes frames as input and further models motion evolution on two different temporal scales.Therefore, we solve the complexity problems of the two stages of modeling and the problem of insufficient temporal and spatial information of a single scale. Our proposed End-to-End MultiScale Network (E2EMSNet) is composed of two scales which are named segment scale and observed global scale. The segment scale leverages temporal difference over consecutive frames for finer motion patterns by supplying 2D convolutions. For observed global scale, a Long Short-Term Memory (LSTM) is incorporated to capture motion features of observed frames. Our model provides a simple and efficient modeling framework with a small computational cost. Our E2EMSNet is evaluated on three challenging datasets: BIT, HMDB51, and UCF101. The extensive experiments demonstrate the effectiveness of our method for action prediction in videos.

preprint2022arXiv

Past and Future Motion Guided Network for Audio Visual Event Localization

In recent years, audio-visual event localization has attracted much attention. It's purpose is to detect the segment containing audio-visual events and recognize the event category from untrimmed videos. Existing methods use audio-guided visual attention to lead the model pay attention to the spatial area of the ongoing event, devoting to the correlation between audio and visual information but ignoring the correlation between audio and spatial motion. We propose a past and future motion extraction (pf-ME) module to mine the visual motion from videos ,embedded into the past and future motion guided network (PFAGN), and motion guided audio attention (MGAA) module to achieve focusing on the information related to interesting events in audio modality through the past and future visual motion. We choose AVE as the experimental verification dataset and the experiments show that our method outperforms the state-of-the-arts in both supervised and weakly-supervised settings.

preprint2022arXiv

Real-World Semantic Grasp Detection Based on Attention Mechanism

Recognizing the category of the object and using the features of the object itself to predict grasp configuration is of great significance to improve the accuracy of the grasp detection model and expand its application. Researchers have been trying to combine these capabilities in an end-to-end network to grasping specific objects in a cluttered scene efficiently. In this paper, we propose an end-to-end semantic grasp detection model, which can accomplish both semantic recognition and grasp detection. And we also design a target feature attention mechanism to guide the model focus on the features of target object ontology for grasp prediction according to the semantic information. This method effectively reduces the background features that are weakly correlated to the target object, thus making the features more unique and guaranteeing the accuracy and efficiency of grasp detection. Experimental results show that the proposed method can achieve 98.38% accuracy in Cornell Grasp Dataset. Furthermore, our results on complex multi-object scenarios or more rigorous evaluation metrics show the domain adaptability of our method over the state-of-the-art.

preprint2021arXiv

Mask-GD Segmentation Based Robotic Grasp Detection

The reliability of grasp detection for target objects in complex scenes is a challenging task and a critical problem that needs to be solved urgently in practical application. At present, the grasp detection location comes from searching the feature space of the whole image. However, the cluttered background information in the image impairs the accuracy of grasping detection. In this paper, a robotic grasp detection algorithm named MASK-GD is proposed, which provides a feasible solution to this problem. MASK is a segmented image that only contains the pixels of the target object. MASK-GD for grasp detection only uses MASK features rather than the features of the entire image in the scene. It has two stages: the first stage is to provide the MASK of the target object as the input image, and the second stage is a grasp detector based on the MASK feature. Experimental results demonstrate that MASK-GD's performance is comparable with state-of-the-art grasp detection algorithms on Cornell Datasets and Jacquard Dataset. In the meantime, MASK-GD performs much better in complex scenes.

preprint2021arXiv

Neighborhood Spatial Aggregation MC Dropout for Efficient Uncertainty-aware Semantic Segmentation in Point Clouds

Uncertainty-aware semantic segmentation of the point clouds includes the predictive uncertainty estimation and the uncertainty-guided model optimization. One key challenge in the task is the efficiency of point-wise predictive distribution establishment. The widely-used MC dropout establishes the distribution by computing the standard deviation of samples using multiple stochastic forward propagations, which is time-consuming for tasks based on point clouds containing massive points. Hence, a framework embedded with NSA-MC dropout, a variant of MC dropout, is proposed to establish distributions in just one forward pass. Specifically, the NSA-MC dropout samples the model many times through a space-dependent way, outputting point-wise distribution by aggregating stochastic inference results of neighbors. Based on this, aleatoric and predictive uncertainties acquire from the predictive distribution. The aleatoric uncertainty is integrated into the loss function to penalize noisy points, avoiding the over-fitting of the model to some degree. Besides, the predictive uncertainty quantifies the confidence degree of predictions. Experimental results show that our framework obtains better segmentation results of real-world point clouds and efficiently quantifies the credibility of results. Our NSA-MC dropout is several times faster than MC dropout, and the inference time does not establish a coupling relation with the sampling times. The code will be available if the paper is accepted.

preprint2020arXiv

Energy-based Periodicity Mining with Deep Features for Action Repetition Counting in Unconstrained Videos

Action repetition counting is to estimate the occurrence times of the repetitive motion in one action, which is a relatively new, important but challenging measurement problem. To solve this problem, we propose a new method superior to the traditional ways in two aspects, without preprocessing and applicable for arbitrary periodicity actions. Without preprocessing, the proposed model makes our method convenient for real applications; processing the arbitrary periodicity action makes our model more suitable for the actual circumstance. In terms of methodology, firstly, we analyze the movement patterns of the repetitive actions based on the spatial and temporal features of actions extracted by deep ConvNets; Secondly, the Principal Component Analysis algorithm is used to generate the intuitive periodic information from the chaotic high-dimensional deep features; Thirdly, the periodicity is mined based on the high-energy rule using Fourier transform; Finally, the inverse Fourier transform with a multi-stage threshold filter is proposed to improve the quality of the mined periodicity, and peak detection is introduced to finish the repetition counting. Our work features two-fold: 1) An important insight that deep features extracted for action recognition can well model the self-similarity periodicity of the repetitive action is presented. 2) A high-energy based periodicity mining rule using deep features is presented, which can process arbitrary actions without preprocessing. Experimental results show that our method achieves comparable results on the public datasets YT Segments and QUVA.

preprint2020arXiv

TrajectoryNet: a new spatio-temporal feature learning network for human motion prediction

Human motion prediction is an increasingly interesting topic in computer vision and robotics. In this paper, we propose a new 2D CNN based network, TrajectoryNet, to predict future poses in the trajectory space. Compared with most existing methods, our model focuses on modeling the motion dynamics with coupled spatio-temporal features, local-global spatial features and global temporal co-occurrence features of the previous pose sequence. Specifically, the coupled spatio-temporal features describe the spatial and temporal structure information hidden in the natural human motion sequence, which can be mined by covering the space and time dimensions of the input pose sequence with the convolutional filters. The local-global spatial features that encode different correlations of different joints of the human body (e.g. strong correlations between joints of one limb, weak correlations between joints of different limbs) are captured hierarchically by enlarging the receptive field layer by layer and residual connections from the lower layers to the deeper layers in our proposed convolutional network. And the global temporal co-occurrence features represent the co-occurrence relationship that different subsequences in a complex motion sequence are appeared simultaneously, which can be obtained automatically with our proposed TrajectoryNet by reorganizing the temporal information as the depth dimension of the input tensor. Finally, future poses are approximated based on the captured motion dynamics features. Extensive experiments show that our method achieves state-of-the-art performance on three challenging benchmarks (e.g. Human3.6M, G3D, and FNTU), which demonstrates the effectiveness of our proposed method. The code will be available if the paper is accepted.

Jianqin Yin

What is connected

Connect this record

See the researcher in context

Building this map preview

10 published item(s)

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

A Two-stream Hybrid CNN-Transformer Network for Skeleton-based Human Interaction Recognition

An Attractor-Guided Neural Networks for Skeleton-Based Human Motion Prediction

An end-to-end multi-scale network for action prediction in videos

Past and Future Motion Guided Network for Audio Visual Event Localization

Real-World Semantic Grasp Detection Based on Attention Mechanism

Mask-GD Segmentation Based Robotic Grasp Detection

Neighborhood Spatial Aggregation MC Dropout for Efficient Uncertainty-aware Semantic Segmentation in Point Clouds

Energy-based Periodicity Mining with Deep Features for Action Repetition Counting in Unconstrained Videos

TrajectoryNet: a new spatio-temporal feature learning network for human motion prediction