Researcher profile

Wu Liu

Wu Liu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
20works
0followers
8topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

20 published item(s)

preprint2026arXiv

ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion

Multimodal Knowledge Graphs (MKGs) extend traditional knowledge graphs by incorporating visual and textual modalities, enabling richer and more expressive entity representations. However, existing MKGs often suffer from incompleteness, which hinder their effectiveness in downstream tasks. Therefore, multimodal knowledge graph completion (MKGC) task is receiving increasing attention. While large language models (LLMs) have shown promise for knowledge graph completion (KGC), their application to the multimodal setting remains underexplored. Moreover, applying Multimodal Large Language Models (MLLMs) to the task of MKGC introduces significant challenges: (1) the large number of image tokens per entity leads to semantic noise and modality conflicts, and (2) the high computational cost of processing large token inputs. To address these issues, we propose Efficient Lightweight Multimodal Large Language Models (ELMM) for MKGC. ELMM proposes a Multi-view Visual Token Compressor (MVTC) based on multi-head attention mechanism, which adaptively compresses image tokens from both textual and visual views, thereby effectively reducing redundancy while retaining necessary information and avoiding modality conflicts. Additionally, we design an attention pruning strategy to remove redundant attention layers from MLLMs, thereby significantly reducing the inference cost. We further introduce a linear projection to compensate for the performance degradation caused by pruning. Extensive experiments on four benchmark datasets demonstrate that ELMM achieves state-of-the-art performance.

preprint2026arXiv

GASim: A Graph-Accelerated Hybrid Framework for Social Simulation

Large-scale social simulators are essential for studying complex social patterns. Prior work explores hybrid methods to scale up simulations, combining large language models (LLM)-based agents with numerical agent-based models (ABM). However, this incurs high latency due to expensive memory retrieval and sequential ABM execution. To address this challenge, we propose GASim, a graph-accelerated hybrid multi-agent framework for large-scale social simulations. For core agents driven by LLM, GASim introduces Graph-Optimized Memory (GOM) to replace intensive LLM-based retrieval pipelines with lightweight propagation over a sparse memory graph. For the majority of ordinary agents, GASim employs Graph Message Passing (GMP), substituting sequential ABM execution with parallel updates by fine-grained feature aggregation and Graph Attention Network. We further introduce Entropy-Driven Grouping (EDG) that coordinates this hybrid partitioning, leveraging information entropy to dynamically identify emergent core agents situated in information-diverse neighborhoods. Extensive experiments show that GASim not only delivers a substantial 9.94-fold end-to-end speedup over the traditional hybrid framework but also consumes less than 20% of baseline tokens, significantly reducing costs while preserving strong alignment with real-world public opinion trends. Our code is available at https://github.com/Jasmine0201/GASim.

preprint2026arXiv

GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to adaptively determine when, whether, and how to observe the interface. We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks. To acquire more informative observations, the agent learns to make strategic decisions on both whether and how to invoke visual tools, such as cropping or zooming, within a two-stage reasoning process. To support this behavior, we introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding, coordinated by a two-level policy. In addition, we design a spatially continuous reward function tailored to tool usage, which integrates both location proximity and region overlap to provide dense supervision and alleviate the reward sparsity common in GUI environments. On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples, significantly outperforming both supervised and RL-based baselines. These results highlight that tool-aware active perception, enabled by staged policy reasoning and fine-grained reward feedback, is critical for building robust and data-efficient GUI agents.

preprint2026arXiv

RumorSphere: A Framework for Million-scale Agent-based Dynamic Simulation of Rumor Propagation

Rumor propagation modeling is critical for understanding and mitigating misinformation. Existing approaches combining rule-based regular agents with LLM-driven core agents provide a promising paradigm for large-scale rumor simulation. However, overlooking the dynamic nature of core agents and the importance of network topology on rumor spread significantly undermines the simulation performance. To address these issues, we present RumorSphere, a dynamic and hierarchical resonance framework for effective rumor simulation at the million-agent scale. Considering the dynamic role of core agents in rumor evolution, we propose a multi-agent dynamic interaction strategy based on the information cocoon theory, which adaptively identifies and activates critical core agents at conflict boundaries using LLMs, effectively supporting simulations with millions of agents. In addition, we design a hierarchical resonance network that integrates opinion leaders and localized community structures, enabling more realistic modeling of explosive rumor spread in real-world scenarios. Experiments on real-world datasets show that RumorSphere outperforms state-of-the-art methods, reducing simulation bias by an average of 26.5%.

preprint2022arXiv

A Baseline Framework for Part-level Action Parsing and Action Recognition

This technical report introduces our 2nd place solution to Kinetics-TPS Track on Part-level Action Parsing in ICCV DeeperAction Workshop 2021. Our entry is mainly based on YOLOF for instance and part detection, HRNet for human pose estimation, and CSN for video-level action recognition and frame-level part state parsing. We describe technical details for the Kinetics-TPS dataset, together with some experimental results. In the competition, we achieved 61.37% mAP on the test set of Kinetics-TPS.

preprint2022arXiv

Delving into the Frequency: Temporally Consistent Human Motion Transfer in the Fourier Space

Human motion transfer refers to synthesizing photo-realistic and temporally coherent videos that enable one person to imitate the motion of others. However, current synthetic videos suffer from the temporal inconsistency in sequential frames that significantly degrades the video quality, yet is far from solved by existing methods in the pixel domain. Recently, some works on DeepFake detection try to distinguish the natural and synthetic images in the frequency domain because of the frequency insufficiency of image synthesizing methods. Nonetheless, there is no work to study the temporal inconsistency of synthetic videos from the aspects of the frequency-domain gap between natural and synthetic videos. In this paper, we propose to delve into the frequency space for temporally consistent human motion transfer. First of all, we make the first comprehensive analysis of natural and synthetic videos in the frequency domain to reveal the frequency gap in both the spatial dimension of individual frames and the temporal dimension of the video. To close the frequency gap between the natural and synthetic videos, we propose a novel Frequency-based human MOtion TRansfer framework, named FreMOTR, which can effectively mitigate the spatial artifacts and the temporal inconsistency of the synthesized videos. FreMOTR explores two novel frequency-based regularization modules: 1) the Frequency-domain Appearance Regularization (FAR) to improve the appearance of the person in individual frames and 2) Temporal Frequency Regularization (TFR) to guarantee the temporal consistency between adjacent frames. Finally, comprehensive experiments demonstrate that the FreMOTR not only yields superior performance in temporal consistency metrics but also improves the frame-level visual quality of synthetic videos. In particular, the temporal consistency metrics are improved by nearly 30% than the state-of-the-art model.

preprint2022arXiv

Development of a compact high-resolution absolute gravity gradiometer based on atom interferometers

We present a compact high-resolution gravity gradiometer based on dual Rb-85 atom interferometers using stimulated Raman transitions. A baseline L=44.5 cm and an interrogation time T=130 ms are realized in a sensor head with volume of less than 100 liters. Experimental parameters are optimized to improve the short-term sensitivity while a rejection algorithm relying on inversion of the Raman wave vector is implemented to improve the long-term stability. After an averaging time of 17000 s, a phase resolution of 104 μrad is achieved, which corresponds to a gravity gradient resolution of 0.86 E. As far as we know, this is the sub-E atom gravity gradiometer with the highest level of compactness to date. After the evaluation and correction of system errors induced by light shift, residual Zeeman shift, Coriolis effect and self-attraction effect, the instrument serves as an absolute gravity gradiometer and with it the local gravity gradient is measured to be 3114 (53) E.

preprint2022arXiv

Gait Recognition in the Wild with Dense 3D Representations and A Benchmark

Existing studies for gait recognition are dominated by 2D representations like the silhouette or skeleton of the human body in constrained scenes. However, humans live and walk in the unconstrained 3D space, so projecting the 3D human body onto the 2D plane will discard a lot of crucial information like the viewpoint, shape, and dynamics for gait recognition. Therefore, this paper aims to explore dense 3D representations for gait recognition in the wild, which is a practical yet neglected problem. In particular, we propose a novel framework to explore the 3D Skinned Multi-Person Linear (SMPL) model of the human body for gait recognition, named SMPLGait. Our framework has two elaborately-designed branches of which one extracts appearance features from silhouettes, the other learns knowledge of 3D viewpoints and shapes from the 3D SMPL model. In addition, due to the lack of suitable datasets, we build the first large-scale 3D representation-based gait recognition dataset, named Gait3D. It contains 4,000 subjects and over 25,000 sequences extracted from 39 cameras in an unconstrained indoor scene. More importantly, it provides 3D SMPL models recovered from video frames which can provide dense 3D information of body shape, viewpoint, and dynamics. Based on Gait3D, we comprehensively compare our method with existing gait recognition approaches, which reflects the superior performance of our framework and the potential of 3D representations for gait recognition in the wild. The code and dataset are available at https://gait3d.github.io.

preprint2022arXiv

Gait Recognition in the Wild with Multi-hop Temporal Switch

Existing studies for gait recognition are dominated by in-the-lab scenarios. Since people live in real-world senses, gait recognition in the wild is a more practical problem that has recently attracted the attention of the community of multimedia and computer vision. Current methods that obtain state-of-the-art performance on in-the-lab benchmarks achieve much worse accuracy on the recently proposed in-the-wild datasets because these methods can hardly model the varied temporal dynamics of gait sequences in unconstrained scenes. Therefore, this paper presents a novel multi-hop temporal switch method to achieve effective temporal modeling of gait patterns in real-world scenes. Concretely, we design a novel gait recognition network, named Multi-hop Temporal Switch Network (MTSGait), to learn spatial features and multi-scale temporal features simultaneously. Different from existing methods that use 3D convolutions for temporal modeling, our MTSGait models the temporal dynamics of gait sequences by 2D convolutions. By this means, it achieves high efficiency with fewer model parameters and reduces the difficulty in optimization compared with 3D convolution-based models. Based on the specific design of the 2D convolution kernels, our method can eliminate the misalignment of features among adjacent frames. In addition, a new sampling strategy, i.e., non-cyclic continuous sampling, is proposed to make the model learn more robust temporal features. Finally, the proposed method achieves superior performance on two public gait in-the-wild datasets, i.e., GREW and Gait3D, compared with state-of-the-art methods.

preprint2022arXiv

MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition

Recognizing human actions from point cloud videos has attracted tremendous attention from both academia and industry due to its wide applications like automatic driving, robotics, and so on. However, current methods for point cloud action recognition usually require a huge amount of data with manual annotations and a complex backbone network with high computation costs, which makes it impractical for real-world applications. Therefore, this paper considers the task of semi-supervised point cloud action recognition. We propose a Masked Pseudo-Labeling autoEncoder (\textbf{MAPLE}) framework to learn effective representations with much fewer annotations for point cloud action recognition. In particular, we design a novel and efficient \textbf{De}coupled \textbf{s}patial-\textbf{t}emporal Trans\textbf{Former} (\textbf{DestFormer}) as the backbone of MAPLE. In DestFormer, the spatial and temporal dimensions of the 4D point cloud videos are decoupled to achieve efficient self-attention for learning both long-term and short-term features. Moreover, to learn discriminative features from fewer annotations, we design a masked pseudo-labeling autoencoder structure to guide the DestFormer to reconstruct features of masked frames from the available frames. More importantly, for unlabeled data, we exploit the pseudo-labels from the classification head as the supervision signal for the reconstruction of features from the masked frames. Finally, comprehensive experiments demonstrate that MAPLE achieves superior results on three public benchmarks and outperforms the state-of-the-art method by 8.08\% accuracy on the MSR-Action3D dataset.

preprint2022arXiv

Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework

Action recognition from videos, i.e., classifying a video into one of the pre-defined action types, has been a popular topic in the communities of artificial intelligence, multimedia, and signal processing. However, existing methods usually consider an input video as a whole and learn models, e.g., Convolutional Neural Networks (CNNs), with coarse video-level class labels. These methods can only output an action class for the video, but cannot provide fine-grained and explainable cues to answer why the video shows a specific action. Therefore, researchers start to focus on a new task, Part-level Action Parsing (PAP), which aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video. To this end, we propose a coarse-to-fine framework for this challenging task. In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action. Moreover, to balance the accuracy and computation in part-level action parsing, we propose to recognize the part-level actions by segment-level features. Furthermore, to overcome the ambiguity of body parts, we propose a pose-guided positional embedding method to accurately localize body parts. Through comprehensive experiments on a large-scale dataset, i.e., Kinetics-TPS, our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.

preprint2022arXiv

Patient-specific mean teacher UNet for enhancing PET image and low-dose PET reconstruction on RefleXion X1 biology-guided radiotherapy system

The RefleXion X1 is the first biology-guided radiotherapy (BgRT) system. Its dual 90-degree PET detector collects fewer pair production events compared to a full-ring diagnostic PET system. In the proposed BgRT workflow, a short scan is acquired before treatment delivery to ensure image quality and consistency. The shorter scan time, a quarter of the simulation scan time, also leads to fewer coincidence events and hence reduced image quality. In this study, we proposed a patient-specific mean teacher UNet (MT-UNet) to enhance PET image quality and low-dose PET reconstruction on RefleXion X1. PET/CT scans of nine cancer patients were acquired using RefleXion X1. Every patient had one simulation scan. Five patients had additional scans acquired during the first and the final treatment fractions. Treatment scans were acquired using the same imaging protocol as the simulation scan. For each scan, we reconstructed a full-dose image and evenly split coincidence events into four sessions to reconstruct four quarter-dose PET images. For each patient, our proposed MT-UNet was trained using quarter-dose and full-dose images of the simulation scan. For the image quality enhancement task, we applied nine trained MT-UNets to full-dose simulation PET images of the nine patients to generate enhanced images, respectively. The enhanced images were compared with the original full-dose images using CNR and SNR. For the low-dose image reconstruction task, we applied five trained MT-UNets to ten quarter-dose treatment images of five patients to predict full-dose images, respectively. The predicted and ground truth full-dose images were compared using SSIM and PSNR. We also trained and evaluated patient-specific UNets for model comparison. Our proposed patient-specific MT-UNet achieved better performance in improving the quality of RefleXion low-dose and full-dose images compared to the patient-specific UNet.

preprint2022arXiv

Putting People in their Place: Monocular Regression of 3D People in Depth

Given an image with multiple people, our goal is to directly regress the pose and shape of all the people as well as their relative depth. Inferring the depth of a person in an image, however, is fundamentally ambiguous without knowing their height. This is particularly problematic when the scene contains people of very different sizes, e.g. from infants to adults. To solve this, we need several things. First, we develop a novel method to infer the poses and depth of multiple people in a single image. While previous work that estimates multiple people does so by reasoning in the image plane, our method, called BEV, adds an additional imaginary Bird's-Eye-View representation to explicitly reason about depth. BEV reasons simultaneously about body centers in the image and in depth and, by combing these, estimates 3D body position. Unlike prior work, BEV is a single-shot method that is end-to-end differentiable. Second, height varies with age, making it impossible to resolve depth without also estimating the age of people in the image. To do so, we exploit a 3D body model space that lets BEV infer shapes from infants to adults. Third, to train BEV, we need a new dataset. Specifically, we create a "Relative Human" (RH) dataset that includes age labels and relative depth relationships between the people in the images. Extensive experiments on RH and AGORA demonstrate the effectiveness of the model and training scheme. BEV outperforms existing methods on depth reasoning, child shape estimation, and robustness to occlusion. The code and dataset are released for research purposes.

preprint2022arXiv

REMOT: A Region-to-Whole Framework for Realistic Human Motion Transfer

Human Video Motion Transfer (HVMT) aims to, given an image of a source person, generate his/her video that imitates the motion of the driving person. Existing methods for HVMT mainly exploit Generative Adversarial Networks (GANs) to perform the warping operation based on the flow estimated from the source person image and each driving video frame. However, these methods always generate obvious artifacts due to the dramatic differences in poses, scales, and shifts between the source person and the driving person. To overcome these challenges, this paper presents a novel REgionto-whole human MOtion Transfer (REMOT) framework based on GANs. To generate realistic motions, the REMOT adopts a progressive generation paradigm: it first generates each body part in the driving pose without flow-based warping, then composites all parts into a complete person of the driving motion. Moreover, to preserve the natural global appearance, we design a Global Alignment Module to align the scale and position of the source person with those of the driving person based on their layouts. Furthermore, we propose a Texture Alignment Module to keep each part of the person aligned according to the similarity of the texture. Finally, through extensive quantitative and qualitative experiments, our REMOT achieves state-of-the-art results on two public benchmarks.

preprint2022arXiv

SiRi: A Simple Selective Retraining Mechanism for Transformer-based Visual Grounding

In this paper, we investigate how to achieve better visual grounding with modern vision-language transformers, and propose a simple yet powerful Selective Retraining (SiRi) mechanism for this challenging task. Particularly, SiRi conveys a significant principle to the research of visual grounding, i.e., a better initialized vision-language encoder would help the model converge to a better local minimum, advancing the performance accordingly. In specific, we continually update the parameters of the encoder as the training goes on, while periodically re-initialize rest of the parameters to compel the model to be better optimized based on an enhanced encoder. SiRi can significantly outperform previous approaches on three popular benchmarks. Specifically, our method achieves 83.04% Top1 accuracy on RefCOCO+ testA, outperforming the state-of-the-art approaches (training from scratch) by more than 10.21%. Additionally, we reveal that SiRi performs surprisingly superior even with limited training data. We also extend it to transformer-based visual grounding models and other vision-language tasks to verify the validity.

preprint2022arXiv

Structured Two-stream Attention Network for Video Question Answering

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich long-range temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset \textit{TGIF-QA} show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.

preprint2021arXiv

TraND: Transferable Neighborhood Discovery for Unsupervised Cross-domain Gait Recognition

Gait, i.e., the movement pattern of human limbs during locomotion, is a promising biometric for the identification of persons. Despite significant improvement in gait recognition with deep learning, existing studies still neglect a more practical but challenging scenario -- unsupervised cross-domain gait recognition which aims to learn a model on a labeled dataset then adapts it to an unlabeled dataset. Due to the domain shift and class gap, directly applying a model trained on one source dataset to other target datasets usually obtains very poor results. Therefore, this paper proposes a Transferable Neighborhood Discovery (TraND) framework to bridge the domain gap for unsupervised cross-domain gait recognition. To learn effective prior knowledge for gait representation, we first adopt a backbone network pre-trained on the labeled source data in a supervised manner. Then we design an end-to-end trainable approach to automatically discover the confident neighborhoods of unlabeled samples in the latent space. During training, the class consistency indicator is adopted to select confident neighborhoods of samples based on their entropy measurements. Moreover, we explore a high-entropy-first neighbor selection strategy, which can effectively transfer prior knowledge to the target domain. Our method achieves state-of-the-art results on two public datasets, i.e., CASIA-B and OU-LP.

preprint2020arXiv

A Real-time Action Representation with Temporal Encoding and Deep Compression

Deep neural networks have achieved remarkable success for video-based action recognition. However, most of existing approaches cannot be deployed in practice due to the high computational cost. To address this challenge, we propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Specifically, we propose a residual 3D Convolutional Neural Network (CNN) to capture complementary information on the appearance of a single frame and the motion between consecutive frames. Based on this CNN, we develop a new temporal encoding method to explore the temporal dynamics of the whole video. Furthermore, we integrate deep compression techniques with T-C3D to further accelerate the deployment of models via reducing the size of the model. By these means, heavy calculations can be avoided when doing the inference, which enables the method to deal with videos beyond real-time speed while keeping promising performance. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model. We validate our approach by studying its action representation performance on four different benchmarks over three different tasks. Extensive experiments demonstrate comparable recognition performance to the state-of-the-art methods. The source code and the pre-trained models are publicly available at https://github.com/tc3d.

preprint2020arXiv

Black Re-ID: A Head-shoulder Descriptor for the Challenging Problem of Person Re-Identification

Person re-identification (Re-ID) aims at retrieving an input person image from a set of images captured by multiple cameras. Although recent Re-ID methods have made great success, most of them extract features in terms of the attributes of clothing (e.g., color, texture). However, it is common for people to wear black clothes or be captured by surveillance systems in low light illumination, in which cases the attributes of the clothing are severely missing. We call this problem the Black Re-ID problem. To solve this problem, rather than relying on the clothing information, we propose to exploit head-shoulder features to assist person Re-ID. The head-shoulder adaptive attention network (HAA) is proposed to learn the head-shoulder feature and an innovative ensemble method is designed to enhance the generalization of our model. Given the input person image, the ensemble method would focus on the head-shoulder feature by assigning a larger weight if the individual insides the image is in black clothing. Due to the lack of a suitable benchmark dataset for studying the Black Re-ID problem, we also contribute the first Black-reID dataset, which contains 1274 identities in training set. Extensive evaluations on the Black-reID, Market1501 and DukeMTMC-reID datasets show that our model achieves the best result compared with the state-of-the-art Re-ID methods on both Black and conventional Re-ID problems. Furthermore, our method is also proved to be effective in dealing with person Re-ID in similar clothing. Our code and dataset are avaliable on https://github.com/xbq1994/.

preprint2020arXiv

FastReID: A Pytorch Toolbox for General Instance Re-identification

General Instance Re-identification is a very important task in the computer vision, which can be widely used in many practical applications, such as person/vehicle re-identification, face recognition, wildlife protection, commodity tracing, and snapshop, etc.. To meet the increasing application demand for general instance re-identification, we present FastReID as a widely used software system in JD AI Research. In FastReID, highly modular and extensible design makes it easy for the researcher to achieve new research ideas. Friendly manageable system configuration and engineering deployment functions allow practitioners to quickly deploy models into productions. We have implemented some state-of-the-art projects, including person re-id, partial re-id, cross-domain re-id and vehicle re-id, and plan to release these pre-trained models on multiple benchmark datasets. FastReID is by far the most general and high-performance toolbox that supports single and multiple GPU servers, you can reproduce our project results very easily and are very welcome to use it, the code and models are available at https://github.com/JDAI-CV/fast-reid.