Source author record

Jiabao Wang

Jiabao Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Artificial Intelligence eess.SP Machine Learning Robotics Sound

Catalog footprint

What is connected

8works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

123D: Unifying Multi-Modal Autonomous Driving Data at Scale

The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D and 3D modalities, such as cameras, lidar, ego states, annotations, traffic lights, and HD maps, with different rates and synchronization schemes. They come in fragmented formats requiring complex dependencies that cannot natively coexist in the same development environment. Further, major inconsistencies in annotation conventions prevent training or measuring generalization across multiple datasets. We present 123D, an open-source framework that unifies such multi-modal driving data through a single API. To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets. Using 123D, we consolidate eight real-world driving datasets spanning 3,300 hours and 90,000 kilometers, together with a synthetic dataset with configurable collection scripts, and provide tools for data analysis and visualization. We conduct a systematic study comparing annotation statistics and assessing each dataset's pose and calibration accuracy. Further, we showcase two applications 123D enables: cross-dataset 3D object detection transfer and reinforcement learning for planning, and offer recommendations for future directions. Code and documentation are available at https://github.com/kesai-labs/py123d.

preprint2026arXiv

Mutual Forcing: Dual-Mode Self-Evolution for Fast Autoregressive Audio-Video Character Generation

In this work, we propose Mutual Forcing, a framework for fast autoregressive audio-video generation with long-horizon audio-video synchronization. Our approach addresses two key challenges: joint audio-video modeling and fast autoregressive generation. To ease joint audio-video optimization, we adopt a two-stage training strategy: we first train uni-modal generators and then couple them into a unified audio-video model for joint training on paired data. For streaming generation, we ask whether a native fast causal audio-video model can be trained directly, instead of following existing streaming distillation pipelines that typically train a bidirectional model first and then convert it into a causal generator through multiple distillation stages. Our answer is Mutual Forcing, which builds directly on native autoregressive model and integrates few-step and multi-step generation within a single weight-shared model, enabling self-distillation and improved training-inference consistency. The multi-step mode improves the few-step mode via self-distillation, while the few-step mode generates historical context during training to improve training-inference consistency; because the two modes share parameters, these two effects reinforce each other within a single model. Compared with prior approaches such as Self-Forcing, Mutual Forcing removes the need for an additional bidirectional teacher model, supports more flexible training sequence lengths, reduces training overhead, and allows the model to improve directly from real paired data rather than a fixed teacher. Experiments show that Mutual Forcing matches or surpasses strong baselines that require around 50 sampling steps while using only 4 to 8 steps, demonstrating substantial advantages in both efficiency and quality. The project page is available at https://mutualforcing.github.io.

preprint2022arXiv

Anchor-free Oriented Proposal Generator for Object Detection

Oriented object detection is a practical and challenging task in remote sensing image interpretation. Nowadays, oriented detectors mostly use horizontal boxes as intermedium to derive oriented boxes from them. However, the horizontal boxes are inclined to get small Intersection-over-Unions (IoUs) with ground truths, which may have some undesirable effects, such as introducing redundant noise, mismatching with ground truths, detracting from the robustness of detectors, etc. In this paper, we propose a novel Anchor-free Oriented Proposal Generator (AOPG) that abandons horizontal box-related operations from the network architecture. AOPG first produces coarse oriented boxes by a Coarse Location Module (CLM) in an anchor-free manner and then refines them into high-quality oriented proposals. After AOPG, we apply a Fast R-CNN head to produce the final detection results. Furthermore, the shortage of large-scale datasets is also a hindrance to the development of oriented object detection. To alleviate the data insufficiency, we release a new dataset on the basis of our DIOR dataset and name it DIOR-R. Massive experiments demonstrate the effectiveness of AOPG. Particularly, without bells and whistles, we achieve the accuracy of 64.41%, 75.24% and 96.22% mAP on the DIOR-R, DOTA and HRSC2016 datasets respectively. Code and models are available at https://github.com/jbwang1997/AOPG.

preprint2022arXiv

Bridge the Gap between Supervised and Unsupervised Learning for Fine-Grained Classification

Unsupervised learning technology has caught up with or even surpassed supervised learning technology in general object classification (GOC) and person re-identification (re-ID). However, it is found that the unsupervised learning of fine-grained visual classification (FGVC) is more challenging than GOC and person re-ID. In order to bridge the gap between unsupervised and supervised learning for FGVC, we investigate the essential factors (including feature extraction, clustering, and contrastive learning) for the performance gap between supervised and unsupervised FGVC. Furthermore, we propose a simple, effective, and practical method, termed as UFCL, to alleviate the gap. Three key issues are concerned and improved: First, we introduce a robust and powerful backbone, ResNet50-IBN, which has an ability of domain adaptation when we transfer ImageNet pre-trained models to FGVC tasks. Next, we propose to introduce HDBSCAN instead of DBSCAN to do clustering, which can generate better clusters for adjacent categories with fewer hyper-parameters. Finally, we propose a weighted feature agent and its updating mechanism to do contrastive learning by using the pseudo labels with inevitable noise, which can improve the optimization process of learning the parameters of the network. The effectiveness of our UFCL is verified on CUB-200-2011, Oxford-Flowers, Oxford-Pets, Stanford-Dogs, Stanford-Cars and FGVC-Aircraft datasets. Under the unsupervised FGVC setting, we achieve state-of-the-art results, and analyze the key factors and the important parameters to provide a practical guidance.

preprint2022arXiv

DAGAM: A Domain Adversarial Graph Attention Model for Subject Independent EEG-Based Emotion Recognition

One of the most significant challenges of EEG-based emotion recognition is the cross-subject EEG variations, leading to poor performance and generalizability. This paper proposes a novel EEG-based emotion recognition model called the domain adversarial graph attention model (DAGAM). The basic idea is to generate a graph to model multichannel EEG signals using biological topology. Graph theory can topologically describe and analyze relationships and mutual dependency between channels of EEG. Then, unlike other graph convolutional networks, self-attention pooling is applied to benefit salient EEG feature extraction from the graph, which effectively improves the performance. Finally, after graph pooling, the domain adversarial based on the graph is employed to identify and handle EEG variation across subjects, efficiently reaching good generalizability. We conduct extensive evaluations on two benchmark datasets (SEED and SEED IV) and obtain state-of-the-art results in subject-independent emotion recognition. Our model boosts the SEED accuracy to 92.59% (4.69% improvement) with the lowest standard deviation of 3.21% (2.92% decrements) and SEED IV accuracy to 80.74% (6.90% improvement) with the lowest standard deviation of 4.14% (3.88% decrements) respectively.

preprint2022arXiv

MMRotate: A Rotated Object Detection Benchmark using PyTorch

We present an open-source toolbox, named MMRotate, which provides a coherent algorithm framework of training, inferring, and evaluation for the popular rotated object detection algorithm based on deep learning. MMRotate implements 18 state-of-the-art algorithms and supports the three most frequently used angle definition methods. To facilitate future research and industrial applications of rotated object detection-related problems, we also provide a large number of trained models and detailed benchmarks to give insights into the performance of rotated object detection. MMRotate is publicly released at https://github.com/open-mmlab/mmrotate.

preprint2020arXiv

A heterogeneous branch and multi-level classification network for person re-identification

Convolutional neural networks with multiple branches have recently been proved highly effective in person re-identification (re-ID). Researchers design multi-branch networks using part models, yet they always attribute the effectiveness to multiple parts. In addition, existing multi-branch networks always have isomorphic branches, which lack structural diversity. In order to improve this problem, we propose a novel Heterogeneous Branch and Multi-level Classification Network (HBMCN), which is designed based on the pre-trained ResNet-50 model. A new heterogeneous branch, SE-Res-Branch, is proposed based on the SE-Res module, which consists of the Squeeze-and-Excitation block and the residual block. Furthermore, a new multi-level classification joint objective function is proposed for the supervised learning of HBMCN, whereby multi-level features are extracted from multiple high-level layers and concatenated to represent a person. Based on three public person re-ID benchmarks (Market1501, DukeMTMC-reID and CUHK03), experimental results show that the proposed HBMCN reaches 94.4%, 85.7% and 73.8% in Rank-1, and 85.7%, 74.6% and 69.0% in mAP, achieving a state-of-the-art performance. Further analysis demonstrates that the specially designed heterogeneous branch performs better than an isomorphic branch, and multi-level classification provides more discriminative features compared to single-level classification. As a result, HBMCN provides substantial further improvements in person re-ID tasks.

preprint2020arXiv

Grafted network for person re-identification

Convolutional neural networks have shown outstanding effectiveness in person re-identification (re-ID). However, the models always have large number of parameters and much computation for mobile application. In order to relieve this problem, we propose a novel grafted network (GraftedNet), which is designed by grafting a high-accuracy rootstock and a light-weighted scion. The rootstock is based on the former parts of ResNet-50 to provide a strong baseline, while the scion is a new designed module, composed of the latter parts of SqueezeNet, to compress the parameters. To extract more discriminative feature representation, a joint multi-level and part-based feature is proposed. In addition, to train GraftedNet efficiently, we propose an accompanying learning method, by adding an accompanying branch to train the model in training and removing it in testing for saving parameters and computation. On three public person re-ID benchmarks (Market1501, DukeMTMC-reID and CUHK03), the effectiveness of GraftedNet are evaluated and its components are analyzed. Experimental results show that the proposed GraftedNet achieves 93.02%, 85.3% and 76.2% in Rank-1 and 81.6%, 74.7% and 71.6% in mAP, with only 4.6M parameters.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint