Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
49works
0followers
22topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

49 published item(s)

preprint2026arXiv

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for direct continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromise their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion models, where the LLM/VLM serves as the encoder, can inherit its encoder's in-context capabilities. This enables us to formulate the training as an on-policy self-distillation process. Specifically, during training, we make the model act as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimizing on the model's own trajectory and under its own supervision, D-OPSD enables the model to learn new concepts, styles, etc., without sacrificing the original few-step capacity.

preprint2026arXiv

REBENCH: A Procedural, Fair-by-Construction Benchmark for LLMs on Stripped-Binary Types and Names (Extended Version)

Large Language Models (LLMs) have achieved remarkable progress in recent years, driving their adoption across a wide range of domains, including computer security. In reverse engineering, LLMs are increasingly applied to critical tasks such as function and variable name recovery and type inference. However, despite the rapid growth of research in this area, progress has been hindered by the absence of a standardized dataset. Existing studies rely on disparate datasets, preprocessing pipelines, and evaluation metrics, making fair comparisons between approaches difficult and obscuring a clear understanding of LLM capabilities in binary analysis. To address these challenges, we present REBench, a comprehensive benchmark dataset for evaluating LLMs on binary reverse engineering tasks. REBench consolidates a superset of existing datasets, comprising hundreds of millions of lines of source code and a diverse collection of binaries spanning multiple architectures and optimization levels. REBench adopts a knowledge-base-driven methodology that stores byte-level stack information to generate ground truth, ensuring that task difficulty is preserved while maintaining universal applicability. This design enables fair evaluation across tasks while avoiding simplifications that could bias results. As a use case, we apply REBench to measure the reverse engineering performance of LLMs and the result demonstrates difficulties in complex tasks.

preprint2025arXiv

Hierarchical Context Alignment with Disentangled Geometric and Temporal Modeling for Semantic Occupancy Prediction

Camera-based 3D Semantic Occupancy Prediction (SOP) is crucial for understanding complex 3D scenes from limited 2D image observations. Existing SOP methods typically aggregate contextual features to assist the occupancy representation learning, alleviating issues like occlusion or ambiguity. However, these solutions often face misalignment issues wherein the corresponding features at the same position across different frames may have different semantic meanings during the aggregation process, which leads to unreliable contextual fusion results and an unstable representation learning process. To address this problem, we introduce a new Hierarchical context alignment paradigm for a more accurate SOP (Hi-SOP). Hi-SOP first disentangles the geometric and temporal context for separate alignment, which two branches are then composed to enhance the reliability of SOP. This parsing of the visual input into a local-global alignment hierarchy includes: (I) disentangled geometric and temporal separate alignment, within each leverages depth confidence and camera pose as prior for relevant feature matching respectively; (II) global alignment and composition of the transformed geometric and temporal volumes based on semantics consistency. Our method outperforms SOTAs for semantic scene completion on the SemanticKITTI & NuScenes-Occupancy datasets and LiDAR semantic segmentation on the NuScenes dataset. The project website is available at https://arlo0o.github.io/hisop.github.io/.

preprint2024arXiv

MusicAOG: an Energy-Based Model for Learning and Sampling a Hierarchical Representation of Symbolic Music

In addressing the challenge of interpretability and generalizability of artificial music intelligence, this paper introduces a novel symbolic representation that amalgamates both explicit and implicit musical information across diverse traditions and granularities. Utilizing a hierarchical and-or graph representation, the model employs nodes and edges to encapsulate a broad spectrum of musical elements, including structures, textures, rhythms, and harmonies. This hierarchical approach expands the representability across various scales of music. This representation serves as the foundation for an energy-based model, uniquely tailored to learn musical concepts through a flexible algorithm framework relying on the minimax entropy principle. Utilizing an adapted Metropolis-Hastings sampling technique, the model enables fine-grained control over music generation. A comprehensive empirical evaluation, contrasting this novel approach with existing methodologies, manifests considerable advancements in interpretability and controllability. This study marks a substantial contribution to the fields of music analysis, composition, and computational musicology.

preprint2023arXiv

An Order-Complexity Model for Aesthetic Quality Assessment of Symbolic Homophony Music Scores

Computational aesthetics evaluation has made great achievements in the field of visual arts, but the research work on music still needs to be explored. Although the existing work of music generation is very substantial, the quality of music score generated by AI is relatively poor compared with that created by human composers. The music scores created by AI are usually monotonous and devoid of emotion. Based on Birkhoff's aesthetic measure, this paper proposes an objective quantitative evaluation method for homophony music score aesthetic quality assessment. The main contributions of our work are as follows: first, we put forward a homophony music score aesthetic model to objectively evaluate the quality of music score as a baseline model; second, we put forward eight basic music features and four music aesthetic features.

preprint2022arXiv

Aesthetic Attribute Assessment of Images Numerically on Mixed Multi-attribute Datasets

With the continuous development of social software and multimedia technology, images have become a kind of important carrier for spreading information and socializing. How to evaluate an image comprehensively has become the focus of recent researches. The traditional image aesthetic assessment methods often adopt single numerical overall assessment scores, which has certain subjectivity and can no longer meet the higher aesthetic requirements. In this paper, we construct an new image attribute dataset called aesthetic mixed dataset with attributes(AMD-A) and design external attribute features for fusion. Besides, we propose a efficient method for image aesthetic attribute assessment on mixed multi-attribute dataset and construct a multitasking network architecture by using the EfficientNet-B0 as the backbone network. Our model can achieve aesthetic classification, overall scoring and attribute scoring. In each sub-network, we improve the feature extraction through ECA channel attention module. As for the final overall scoring, we adopt the idea of the teacher-student network and use the classification sub-network to guide the aesthetic overall fine-grain regression. Experimental results, using the MindSpore, show that our proposed method can effectively improve the performance of the aesthetic overall and attribute assessment.

preprint2022arXiv

Aesthetic Attributes Assessment of Images with AMANv2 and DPC-CaptionsV2

Image aesthetic quality assessment is popular during the last decade. Besides numerical assessment, nature language assessment (aesthetic captioning) has been proposed to describe the generally aesthetic impression of an image. In this paper, we propose aesthetic attribute assessment, which is the aesthetic attributes captioning, i.e., to assess the aesthetic attributes such as composition, lighting usage and color arrangement. It is a non-trivial task to label the comments of aesthetic attributes, which limit the scale of the corresponding datasets. We construct a novel dataset, named DPC-CaptionsV2, by a semi-automatic way. The knowledge is transferred from a small-scale dataset with full annotations to large-scale professional comments from a photography website. Images of DPC-CaptionsV2 contain comments up to 4 aesthetic attributes: composition, lighting, color, and subject. Then, we propose a new version of Aesthetic Multi-Attributes Networks (AMANv2) based on the BUTD model and the VLPSA model. AMANv2 fuses features of a mixture of small-scale PCCD dataset with full annotations and large-scale DPCCaptionsV2 dataset with full annotations. The experimental results of DPCCaptionsV2 show that our method can predict the comments on 4 aesthetic attributes, which are closer to aesthetic topics than those produced by the previous AMAN model. Through the evaluation criteria of image captioning, the specially designed AMANv2 model is better to the CNN-LSTM model and the AMAN model.

preprint2022arXiv

Aesthetic Language Guidance Generation of Images Using Attribute Comparison

With the vigorous development of mobile photography technology, major mobile phone manufacturers are scrambling to improve the shooting ability of equipments and the photo beautification algorithm of software. However, the improvement of intelligent equipments and algorithms cannot replace human subjective photography technology. In this paper, we propose the aesthetic language guidance of image (ALG). We divide ALG into ALG-T and ALG-I according to whether the guiding rules are based on photography templates or guidance images. Whether it is ALG-T or ALG-I, we guide photography from three attributes of color, lighting and composition of the images. The differences of the three attributes between the input images and the photography templates or the guidance images are described in natural language, which is aesthetic natural language guidance (ALG). Also, because of the differences in lighting and composition between landscape images and portrait images, we divide the input images into landscape images and portrait images. Both ALG-T and ALG-I conduct aesthetic language guidance respectively for the two types of input images (landscape images and portrait images).

preprint2022arXiv

Aesthetic Visual Question Answering of Photographs

Aesthetic assessment of images can be categorized into two main forms: numerical assessment and language assessment. Aesthetics caption of photographs is the only task of aesthetic language assessment that has been addressed. In this paper, we propose a new task of aesthetic language assessment: aesthetic visual question and answering (AVQA) of images. If we give a question of images aesthetics, model can predict the answer. We use images from \textit{www.flickr.com}. The objective QA pairs are generated by the proposed aesthetic attributes analysis algorithms. Moreover, we introduce subjective QA pairs that are converted from aesthetic numerical labels and sentiment analysis from large-scale pre-train models. We build the first aesthetic visual question answering dataset, AesVQA, that contains 72,168 high-quality images and 324,756 pairs of aesthetic questions. Two methods for adjusting the data distribution have been proposed and proved to improve the accuracy of existing models. This is the first work that both addresses the task of aesthetic VQA and introduces subjectiveness into VQA tasks. The experimental results reveal that our methods outperform other VQA models on this new task.

preprint2022arXiv

Aesthetics Driven Autonomous Time-Lapse Photography Generation by Virtual and Real Robots

Time-lapse photography is employed in movies and promotional films because it can reflect the passage of time in a short time and strengthen the visual attraction. However, since it takes a long time and requires the stable shooting, it is a great challenge for the photographer. In this article, we propose a time-lapse photography system with virtual and real robots. To help users shoot time-lapse videos efficiently, we first parameterize the time-lapse photography and propose a parameter optimization method. For different parameters, different aesthetic models, including image and video aesthetic quality assessment networks, are used to generate optimal parameters. Then we propose a time-lapse photography interface to facilitate users to view and adjust parameters and use virtual robots to conduct virtual photography in a three-dimensional scene. The system can also export the parameters and provide them to real robots so that the time-lapse videos can be filmed in the real world. In addition, we propose a time-lapse photography aesthetic assessment method that can automatically evaluate the aesthetic quality of time-lapse video. The experimental results show that our method can efficiently obtain the time-lapse videos. We also conduct a user study. The results show that our system has the similar effect as professional photographers and is more efficient.

preprint2022arXiv

Attribute Controllable Beautiful Caucasian Face Generation by Aesthetics Driven Reinforcement Learning

In recent years, image generation has made great strides in improving the quality of images, producing high-fidelity ones. Also, quite recently, there are architecture designs, which enable GAN to unsupervisedly learn the semantic attributes represented in different layers. However, there is still a lack of research on generating face images more consistent with human aesthetics. Based on EigenGAN [He et al., ICCV 2021], we build the techniques of reinforcement learning into the generator of EigenGAN. The agent tries to figure out how to alter the semantic attributes of the generated human faces towards more preferable ones. To accomplish this, we trained an aesthetics scoring model that can conduct facial beauty prediction. We also can utilize this scoring model to analyze the correlation between face attributes and aesthetics scores. Empirically, using off-the-shelf techniques from reinforcement learning would not work well. So instead, we present a new variant incorporating the ingredients emerging in the reinforcement learning communities in recent years. Compared to the original generated images, the adjusted ones show clear distinctions concerning various attributes. Experimental results using the MindSpore, show the effectiveness of the proposed method. Altered facial images are commonly more attractive, with significantly improved aesthetic levels.

preprint2022arXiv

Attribute Group Editing for Reliable Few-shot Image Generation

Few-shot image generation is a challenging task even using the state-of-the-art Generative Adversarial Networks (GANs). Due to the unstable GAN training process and the limited training data, the generated images are often of low quality and low diversity. In this work, we propose a new editing-based method, i.e., Attribute Group Editing (AGE), for few-shot image generation. The basic assumption is that any image is a collection of attributes and the editing direction for a specific attribute is shared across all categories. AGE examines the internal representation learned in GANs and identifies semantically meaningful directions. Specifically, the class embedding, i.e., the mean vector of the latent codes from a specific category, is used to represent the category-relevant attributes, and the category-irrelevant attributes are learned globally by Sparse Dictionary Learning on the difference between the sample embedding and the class embedding. Given a GAN well trained on seen categories, diverse images of unseen categories can be synthesized through editing category-irrelevant attributes while keeping category-relevant attributes unchanged. Without re-training the GAN, AGE is capable of not only producing more realistic and diverse images for downstream visual applications with limited data but achieving controllable image editing with interpretable category-irrelevant directions.

preprint2022arXiv

Can we imitate the principal investor's behavior to learn option price?

This paper presents a framework of imitating the principal investor's behavior for optimal pricing and hedging options. We construct a non-deterministic Markov decision process for modeling stock price change driven by the principal investor's decision making. However, low signal-to-noise ratio and instability that are inherent in equity markets pose challenges to determine the state transition (stock price change) after executing an action (the principal investor's decision) as well as decide an action based on current state (spot price). In order to conquer these challenges, we resort to a Bayesian deep neural network for computing the predictive distribution of the state transition led by an action. Additionally, instead of exploring a state-action relationship to formulate a policy, we seek for an episode based visible-hidden state-action relationship to probabilistically imitate the principal investor's successive decision making. Unlike conventional option pricing that employs analytical stochastic processes or utilizes time series analysis to model and sample underlying stock price movements, our algorithm simulates stock price paths by imitating the principal investor's behavior which requires no preset probability distribution and fewer predetermined parameters. Eventually the optimal option price is learned by reinforcement learning to maximize the cumulative risk-adjusted return of a dynamically hedged portfolio over simulated price paths.

preprint2022arXiv

Cloth-Changing Person Re-identification from A Single Image with Gait Prediction and Regularization

Cloth-Changing person re-identification (CC-ReID) aims at matching the same person across different locations over a long-duration, e.g., over days, and therefore inevitably meets challenge of changing clothing. In this paper, we focus on handling well the CC-ReID problem under a more challenging setting, i.e., just from a single image, which enables high-efficiency and latency-free pedestrian identify for real-time surveillance applications. Specifically, we introduce Gait recognition as an auxiliary task to drive the Image ReID model to learn cloth-agnostic representations by leveraging personal unique and cloth-independent gait information, we name this framework as GI-ReID. GI-ReID adopts a two-stream architecture that consists of a image ReID-Stream and an auxiliary gait recognition stream (Gait-Stream). The Gait-Stream, that is discarded in the inference for high computational efficiency, acts as a regulator to encourage the ReID-Stream to capture cloth-invariant biometric motion features during the training. To get temporal continuous motion cues from a single image, we design a Gait Sequence Prediction (GSP) module for Gait-Stream to enrich gait information. Finally, a high-level semantics consistency over two streams is enforced for effective knowledge regularization. Experiments on multiple image-based Cloth-Changing ReID benchmarks, e.g., LTCC, PRCC, Real28, and VC-Clothes, demonstrate that GI-ReID performs favorably against the state-of-the-arts. Codes are available at https://github.com/jinx-USTC/GI-ReID.

preprint2022arXiv

Design of Switchable Frequency-Selective Rasorber with A-R-A-T or A-T-A-R Operating Modes

This work presents a switchable frequency-selective rasorber (SFSR) with two operating modes. Switching from transmission to reflection can be achieved by appropriately adding feeders and PIN diodes based on cascaded two-dimensional lossy and lossless frequency-selective surface (FSS) screens. The proposed SFSR can realize out-of-band absorption. Analysis of the equivalent circuit model (ECM) can be useful for achieving a switchable rasorber. As the state of the PIN diodes changes, the working state of the SFSR can be switched from low-frequency reflection and high-frequency transmission to low-frequency transmission and high-frequency reflection, respectively. In the final-revision simulation of the working band of the SFSR, in the ON state, reflection and transmission peaks of -0.55 dB at 4.06 GHz and -0.52 dB at 5.97 GHz, respectively, are achieved; in the OFF state, the transmission and reflection peaks of -0.31 dB at 4.04GHz and -0.26 dB at 6.01 GHz, respectively, are obtained. A prototype sample is developed and validated. The results are in good agreement with those of the full-wave simulation. The proposed design can be used in intelligent anti-jamming communication.

preprint2022arXiv

Edge Security: Challenges and Issues

Edge computing is a paradigm that shifts data processing services to the network edge, where data are generated. While such an architecture provides faster processing and response, among other benefits, it also raises critical security issues and challenges that must be addressed. This paper discusses the security threats and vulnerabilities emerging from the edge network architecture spanning from the hardware layer to the system layer. We further discuss privacy and regulatory compliance challenges in such networks. Finally, we argue the need for a holistic approach to analyze edge network security posture, which must consider knowledge from each layer.

preprint2022arXiv

Event-Triggered Optimal Attitude Consensus of Multiple Rigid Body Networks with Unknown Dynamics

In this paper, an event-triggered Reinforcement Learning (RL) method is proposed for the optimal attitude consensus of multiple rigid body networks with unknown dynamics. Firstly, the consensus error is constructed through the attitude dynamics. According to the Bellman optimality principle, the implicit form of the optimal controller and the corresponding Hamilton-Jacobi-Bellman (HJB) equations are obtained. Because of the augmented system, the optimal controller can be obtained directly without relying on the system dynamics. Secondly, the self-triggered mechanism is applied to reduce the computing and communication burden when updating the controller. In order to address the problem that the HJB equations are difficult to solve analytically, a RL method which only requires measurement data at the event-triggered instants is proposed. For each agent, only one neural network is designed to approximate the optimal value function. Each neural network is updated only at the event triggered instants. Meanwhile, the Uniformly Ultimately Bounded (UUB) of the closed-loop system is obtained, and the Zeno behavior is also avoided. Finally, the simulation results on a multiple rigid body network demonstrate the validity of the proposed method.

preprint2022arXiv

Image Coding for Machines with Omnipotent Feature Learning

Image Coding for Machines (ICM) aims to compress images for AI tasks analysis rather than meeting human perception. Learning a kind of feature that is both general (for AI tasks) and compact (for compression) is pivotal for its success. In this paper, we attempt to develop an ICM framework by learning universal features while also considering compression. We name such features as omnipotent features and the corresponding framework as Omni-ICM. Considering self-supervised learning (SSL) improves feature generalization, we integrate it with the compression task into the Omni-ICM framework to learn omnipotent features. However, it is non-trivial to coordinate semantics modeling in SSL and redundancy removing in compression, so we design a novel information filtering (IF) module between them by co-optimization of instance distinguishment and entropy minimization to adaptively drop information that is weakly related to AI tasks (e.g., some texture redundancy). Different from previous task-specific solutions, Omni-ICM could directly support AI tasks analysis based on the learned omnipotent features without joint training or extra transformation. Albeit simple and intuitive, Omni-ICM significantly outperforms existing traditional and learning-based codecs on multiple fundamental vision tasks.

preprint2022arXiv

Learning with Recoverable Forgetting

Life-long learning aims at learning a sequence of tasks without forgetting the previously acquired knowledge. However, the involved training data may not be life-long legitimate due to privacy or copyright reasons. In practical scenarios, for instance, the model owner may wish to enable or disable the knowledge of specific tasks or specific samples from time to time. Such flexible control over knowledge transfer, unfortunately, has been largely overlooked in previous incremental or decremental learning methods, even at a problem-setup level. In this paper, we explore a novel learning scheme, termed as Learning wIth Recoverable Forgetting (LIRF), that explicitly handles the task- or sample-specific knowledge removal and recovery. Specifically, LIRF brings in two innovative schemes, namely knowledge deposit and withdrawal, which allow for isolating user-designated knowledge from a pre-trained network and injecting it back when necessary. During the knowledge deposit process, the specified knowledge is extracted from the target network and stored in a deposit module, while the insensitive or general knowledge of the target network is preserved and further augmented. During knowledge withdrawal, the taken-off knowledge is added back to the target network. The deposit and withdraw processes only demand for a few epochs of finetuning on the removal data, ensuring both data and time efficiency. We conduct experiments on several datasets, and demonstrate that the proposed LIRF strategy yields encouraging results with gratifying generalization capability.

preprint2022arXiv

Mandheling: Mixed-Precision On-Device DNN Training with DSP Offloading

This paper proposes Mandheling, the first system that enables highly resource-efficient on-device training by orchestrating the mixed-precision training with on-chip Digital Signal Processing (DSP) offloading. Mandheling fully explores the advantages of DSP in integer-based numerical calculation by four novel techniques: (1) a CPU-DSP co-scheduling scheme to mitigate the overhead from DSP-unfriendly operators; (2) a self-adaptive rescaling algorithm to reduce the overhead of dynamic rescaling in backward propagation; (3) a batch-splitting algorithm to improve the DSP cache efficiency; (4) a DSP-compute subgraph reusing mechanism to eliminate the preparation overhead on DSP. We have fully implemented Mandheling and demonstrate its effectiveness through extensive experiments. The results show that, compared to the state-of-the-art DNN engines from TFLite and MNN, Mandheling reduces the per-batch training time by 5.5$\times$ and the energy consumption by 8.9$\times$ on average. In end-to-end training tasks, Mandheling reduces up to 10.7$\times$ convergence time and 13.1$\times$ energy consumption, with only 1.9%-2.7% accuracy loss compared to the FP32 precision setting.

preprint2022arXiv

MC-LCR: Multi-modal contrastive classification by locally correlated representations for effective face forgery detection

As the remarkable development of facial manipulation technologies is accompanied by severe security concerns, face forgery detection has become a recent research hotspot. Most existing detection methods train a binary classifier under global supervision to judge real or fake. However, advanced manipulations only perform small-scale tampering, posing challenges to comprehensively capture subtle and local forgery artifacts, especially in high compression settings and cross-dataset scenarios. To address such limitations, we propose a novel framework named Multi-modal Contrastive Classification by Locally Correlated Representations(MC-LCR), for effective face forgery detection. Instead of specific appearance features, our MC-LCR aims to amplify implicit local discrepancies between authentic and forged faces from both spatial and frequency domains. Specifically, we design the shallow style representation block that measures the pairwise correlation of shallow feature maps, which encodes local style information to extract more discriminative features in the spatial domain. Moreover, we make a key observation that subtle forgery artifacts can be further exposed in the patch-wise phase and amplitude spectrum and exhibit different clues. According to the complementarity of amplitude and phase information, we develop a patch-wise amplitude and phase dual attention module to capture locally correlated inconsistencies with each other in the frequency domain. Besides the above two modules, we further introduce the collaboration of supervised contrastive loss with cross-entropy loss. It helps the network learn more discriminative and generalized representations. Through extensive experiments and comprehensive studies, we achieve state-of-the-art performance and demonstrate the robustness and generalization of our method.

preprint2022arXiv

Meta Clustering Learning for Large-scale Unsupervised Person Re-identification

Unsupervised Person Re-identification (U-ReID) with pseudo labeling recently reaches a competitive performance compared to fully-supervised ReID methods based on modern clustering algorithms. However, such clustering-based scheme becomes computationally prohibitive for large-scale datasets. How to efficiently leverage endless unlabeled data with limited computing resources for better U-ReID is under-explored. In this paper, we make the first attempt to the large-scale U-ReID and propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL). MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training. After that, the learned cluster centroids, termed as meta-prototypes in our MCL, are regarded as a proxy annotator to softly annotate the rest unlabeled data for further polishing the model. To alleviate the potential noisy labeling issue in the polishment phase, we enforce two well-designed loss constraints to promise intra-identity consistency and inter-identity strong correlation. For multiple widely-used U-ReID benchmarks, our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.

preprint2022arXiv

Orloj: Predictably Serving Unpredictable DNNs

Existing DNN serving solutions can provide tight latency SLOs while maintaining high throughput via careful scheduling of incoming requests, whose execution times are assumed to be highly predictable and data-independent. However, inference requests to emerging dynamic DNNs -- e.g., popular natural language processing (NLP) models and computer vision (CV) models that skip layers -- are data-dependent. They exhibit poor performance when served using existing solutions because they experience large variance in request execution times depending on the input -- the longest request in a batch inflates the execution times of the smaller ones, causing SLO misses in the absence of careful batching. In this paper, we present Orloj, a dynamic DNN serving system, that captures this variance in dynamic DNNs using empirical distributions of expected request execution times, and then efficiently batches and schedules them without knowing a request's precise execution time. Orloj significantly outperforms state-of-the-art serving solutions for high variance dynamic DNN workloads by 51--80% in finish rate under tight SLO constraints, and over 100% under more relaxed SLO settings. For well-studied static DNN workloads, Orloj keeps comparable performance with the state-of-the-art.

preprint2022arXiv

Pseudo-labelling and Meta Reweighting Learning for Image Aesthetic Quality Assessment

In the tasks of image aesthetic quality evaluation, it is difficult to reach both the high score area and low score area due to the normal distribution of aesthetic datasets. To reduce the error in labeling and solve the problem of normal data distribution, we propose a new aesthetic mixed dataset with classification and regression called AMD-CR, and we train a meta reweighting network to reweight the loss of training data differently. In addition, we provide a training strategy acccording to different stages, based on pseudo labels of the binary classification task, and then we use it for aesthetic training acccording to different stages in classification and regression tasks. In the construction of the network structure, we construct an aesthetic adaptive block (AAB) structure that can adapt to any size of the input images. Besides, we also use the efficient channel attention (ECA) to strengthen the feature extracting ability of each task. The experimental result shows that our method improves 0.1112 compared with the conventional methods in SROCC. The method can also help to find best aesthetic path planning for unmanned aerial vehicles (UAV) and vehicles.

preprint2022arXiv

Reusing the Task-specific Classifier as a Discriminator: Discriminator-free Adversarial Domain Adaptation

Adversarial learning has achieved remarkable performances for unsupervised domain adaptation (UDA). Existing adversarial UDA methods typically adopt an additional discriminator to play the min-max game with a feature extractor. However, most of these methods failed to effectively leverage the predicted discriminative information, and thus cause mode collapse for generator. In this work, we address this problem from a different perspective and design a simple yet effective adversarial paradigm in the form of a discriminator-free adversarial learning network (DALN), wherein the category classifier is reused as a discriminator, which achieves explicit domain alignment and category distinguishment through a unified objective, enabling the DALN to leverage the predicted discriminative information for sufficient feature alignment. Basically, we introduce a Nuclear-norm Wasserstein discrepancy (NWD) that has definite guidance meaning for performing discrimination. Such NWD can be coupled with the classifier to serve as a discriminator satisfying the K-Lipschitz constraint without the requirements of additional weight clipping or gradient penalty strategy. Without bells and whistles, DALN compares favorably against the existing state-of-the-art (SOTA) methods on a variety of public datasets. Moreover, as a plug-and-play technique, NWD can be directly used as a generic regularizer to benefit existing UDA algorithms. Code is available at https://github.com/xiaoachen98/DALN.

preprint2022arXiv

Robust Event Triggering Control for Lateral Dynamics of Intelligent Vehicles with Designable Inter-event Times

In this brief, an improved event-triggered update mechanism (ETM) for the linear quadratic regulator is proposed to solve the lateral motion control problem of intelligent vehicle under bounded disturbances. Based on a novel event function using a clock-like variable to determine the triggering time, we further introduce two new design parameters to improve control performance. Distinct from existing event-based control mechanisms, the inter-event times (IETs) derived from the above control framework are designable, meaning that the proposed ETM can be deployed on practical vehicle more easily and effectively. In addition, the improved IETs-designable ETM features a global robust event-separation property that is extremely required for practical lateral motion control of vehicle subject to diverse disturbances. Theoretical analysis proves the feasibility and stability of the proposed control strategy for trajectory tracking under bounded disturbances. Finally, simulation results verify the theoretical results and show the advantages of the proposed control strategy.

preprint2022arXiv

SADN: Learned Light Field Image Compression with Spatial-Angular Decorrelation

Light field image becomes one of the most promising media types for immersive video applications. In this paper, we propose a novel end-to-end spatial-angular-decorrelated network (SADN) for high-efficiency light field image compression. Different from the existing methods that exploit either spatial or angular consistency in the light field image, SADN decouples the angular and spatial information by dilation convolution and stride convolution in spatial-angular interaction, and performs feature fusion to compress spatial and angular information jointly. To train a stable and robust algorithm, a large-scale dataset consisting of 7549 light field images is proposed and built. The proposed method provides 2.137 times and 2.849 times higher compression efficiency relative to H.266/VVC and H.265/HEVC inter coding, respectively. It also outperforms the end-to-end image compression networks by an average of 79.6% bitrate saving with much higher subjective quality and light field consistency.

preprint2022arXiv

Semantically Video Coding: Instill Static-Dynamic Clues into Structured Bitstream for AI Tasks

Traditional media coding schemes typically encode image/video into a semantic-unknown binary stream, which fails to directly support downstream intelligent tasks at the bitstream level. Semantically Structured Image Coding (SSIC) framework makes the first attempt to enable decoding-free or partial-decoding image intelligent task analysis via a Semantically Structured Bitstream (SSB). However, the SSIC only considers image coding and its generated SSB only contains the static object information. In this paper, we extend the idea of semantically structured coding from video coding perspective and propose an advanced Semantically Structured Video Coding (SSVC) framework to support heterogeneous intelligent applications. Video signals contain more rich dynamic motion information and exist more redundancy due to the similarity between adjacent frames. Thus, we present a reformulation of semantically structured bitstream (SSB) in SSVC which contains both static object characteristics and dynamic motion clues. Specifically, we introduce optical flow to encode continuous motion information and reduce cross-frame redundancy via a predictive coding architecture, then the optical flow and residual information are reorganized into SSB, which enables the proposed SSVC could better adaptively support video-based downstream intelligent applications. Extensive experiments demonstrate that the proposed SSVC framework could directly support multiple intelligent tasks just depending on a partially decoded bitstream. This avoids the full bitstream decompression and thus significantly saves bitrate/bandwidth consumption for intelligent analytics. We verify this point on the tasks of image object detection, pose estimation, video action recognition, video object segmentation, etc.

preprint2022arXiv

Short Video Uprising: How #BlackLivesMatter Content on TikTok Challenges the Protest Paradigm

This study uses TikTok (N = 8,173) to examine how short-form video platforms challenge the protest paradigm in the recent Black Lives Matter movement. A computer-mediated visual analysis, computer vision, is employed to identify the presence of four visual frames of protest (riot, confrontation, spectacle, and debate) in multimedia content. Results of descriptive statistics and the t-test indicate that the three delegitimizing frames - riot, confrontation, and spectacle - are rarely found on TikTok, whereas the debate frame, that empowers marginalized communities, dominates the public sphere. However, although the three delegitimizing frames receive lower social media visibility, as measured by views, likes, shares, followers, and durations, legitimizing elements, such as the debate frame, minority identities, and unofficial sources, are not generally favored by TikTok audiences. This study concludes that while short-form video platforms could potentially challenge the protest paradigm on the content creators' side, the audiences' preference as measured by social media visibility might still be moderately associated with the protest paradigm.

preprint2022arXiv

Style Normalization and Restitution for Domain Generalization and Adaptation

For many practical computer vision applications, the learned models usually have high performance on the datasets used for training but suffer from significant performance degradation when deployed in new environments, where there are usually style differences between the training images and the testing images. An effective domain generalizable model is expected to be able to learn feature representations that are both generalizable and discriminative. In this paper, we design a novel Style Normalization and Restitution module (SNR) to simultaneously ensure both high generalization and discrimination capability of the networks. In the SNR module, particularly, we filter out the style variations (e.g, illumination, color contrast) by performing Instance Normalization (IN) to obtain style normalized features, where the discrepancy among different samples and domains is reduced. However, such a process is task-ignorant and inevitably removes some task-relevant discriminative information, which could hurt the performance. To remedy this, we propose to distill task-relevant discriminative features from the residual (i.e, the difference between the original feature and the style normalized feature) and add them back to the network to ensure high discrimination. Moreover, for better disentanglement, we enforce a dual causality loss constraint in the restitution step to encourage the better separation of task-relevant and task-irrelevant features. We validate the effectiveness of our SNR on different computer vision tasks, including classification, semantic segmentation, and object detection. Experiments demonstrate that our SNR module is capable of improving the performance of networks for domain generalization (DG) and unsupervised domain adaptation (UDA) on many tasks. Code are available at https://github.com/microsoft/SNR.

preprint2022arXiv

Unsupervised Coherent Video Cartoonization with Perceptual Motion Consistency

In recent years, creative content generations like style transfer and neural photo editing have attracted more and more attention. Among these, cartoonization of real-world scenes has promising applications in entertainment and industry. Different from image translations focusing on improving the style effect of generated images, video cartoonization has additional requirements on the temporal consistency. In this paper, we propose a spatially-adaptive semantic alignment framework with perceptual motion consistency for coherent video cartoonization in an unsupervised manner. The semantic alignment module is designed to restore deformation of semantic structure caused by spatial information lost in the encoder-decoder architecture. Furthermore, we devise the spatio-temporal correlative map as a style-independent, global-aware regularization on the perceptual motion consistency. Deriving from similarity measurement of high-level features in photo and cartoon frames, it captures global semantic information beyond raw pixel-value in optical flow. Besides, the similarity measurement disentangles temporal relationships from domain-specific style properties, which helps regularize the temporal consistency without hurting style effects of cartoon images. Qualitative and quantitative experiments demonstrate our method is able to generate highly stylistic and temporal consistent cartoon videos.

preprint2021arXiv

Band-Notched Frequency-Selective Absorber with Polarization Rotation Function

This paper presents the theoretical analysis, design, simulations, and experimental verification of a novel band-notched frequency-selective absorber (BFSA) with polarization rotation function and absorption out-of-band. The BFSA consists of multiple layers including one lossy layer, one polarization-rotary layer and an air layer. The lossy layer is loaded with lumped resistances for obtaining a wide absorption band. A new four-port equivalent circuit model (ECM) with lossy layer is developed for providing theoretical analysis. The BFSA realizes -0.41dB cross-polarization reflection at 4.5GHz. In the upper and lower bands, the BFSA realizes a broad absorption function from 2.2GHz to 4.1GHz and 5.03GHz to 6.5GHz. The fractional bandwidth of co-polarization reflection is 98.8% with 10dB reflection reduction. The full-wave simulation, ECM, and experimental measurements are conducted to validate the polarization conversion of the band-notched absorbers, and good agreement between theory and measurement results is observed.

preprint2021arXiv

FAST: FPGA-based Subgraph Matching on Massive Graphs

Subgraph matching is a basic operation widely used in many applications. However, due to its NP-hardness and the explosive growth of graph data, it is challenging to compute subgraph matching, especially in large graphs. In this paper, we aim at scaling up subgraph matching on a single machine using FPGAs. Specifically, we propose a CPU-FPGA co-designed framework. On the CPU side, we first develop a novel auxiliary data structure called candidate search tree (CST) which serves as a complete search space of subgraph matching. CST can be partitioned and fully loaded into FPGAs' on-chip memory. Then, a workload estimation technique is proposed to balance the load between the CPU and FPGA. On the FPGA side, we design and implement the first FPGA-based subgraph matching algorithm, called FAST. To take full advantage of the pipeline mechanism on FPGAs, task parallelism optimization and task generator separation strategy are proposed for FAST, achieving massive parallelism. Moreover, we carefully develop a BRAM-only matching process to fully utilize FPGA's on-chip memory, which avoids the expensive intermediate data transfer between FPGA's BRAM and DRAM. Comprehensive experiments show that FAST achieves up to 462.0x and 150.0x speedup compared with the state-of-the-art algorithm DAF and CECI, respectively. In addition, FAST is the only algorithm that can handle the billion-scale graph using one machine in our experiments.

preprint2021arXiv

FFR_FD: Effective and Fast Detection of DeepFakes Based on Feature Point Defects

The internet is filled with fake face images and videos synthesized by deep generative models. These realistic DeepFakes pose a challenge to determine the authenticity of multimedia content. As countermeasures, artifact-based detection methods suffer from insufficiently fine-grained features that lead to limited detection performance. DNN-based detection methods are not efficient enough, given that a DeepFake can be created easily by mobile apps and DNN-based models require high computational resources. For the first time, we show that DeepFake faces have fewer feature points than real ones, especially in certain facial regions. Inspired by feature point detector-descriptors to extract discriminative features at the pixel level, we propose the Fused Facial Region_Feature Descriptor (FFR_FD) for effective and fast DeepFake detection. FFR_FD is only a vector extracted from the face, and it can be constructed from any feature point detector-descriptors. We train a random forest classifier with FFR_FD and conduct extensive experiments on six large-scale DeepFake datasets, whose results demonstrate that our method is superior to most state of the art DNN-based models.

preprint2021arXiv

Investigation of meson-meson interaction

In the framework of the quark delocalization color screening model, we investigate the meson-meson interaction in the fully light four-quark system . The calculation of the effective potentials of all the $S-$wave states shows that for most states the interaction between two vector mesons is attractive; the one between a pseudoscalar meson and a vector meson is repulsive or weakly attractive; and the one between two pseudoscalar mesons is always repulsive. However, there is still some exception. The interaction of the $IJ=00$ $ππ$ channel is attractive, while the one of the $IJ=02$ $ϕϕ$ channel is repulsive. So it is difficult to use the $S-$wave $ϕϕ$ state to explain the $X(2239)$ at present calculation. The $S-$wave $ρρ$ states are more likely to be resonance states, which are worthy of investigating in future work. The study of the contribution of each interaction term shows that both the one-gluon exchange and the kinetic energy interaction are important in the interaction between two mesons. The study of the variation of the delocalization parameter indicates that the contribution of the kinetic energy relates to the intermediate-range attraction mechanism in QDCSM, which is achieved by the quark delocalization.

preprint2021arXiv

Learning Omni-frequency Region-adaptive Representations for Real Image Super-Resolution

Traditional single image super-resolution (SISR) methods that focus on solving single and uniform degradation (i.e., bicubic down-sampling), typically suffer from poor performance when applied into real-world low-resolution (LR) images due to the complicated realistic degradations. The key to solving this more challenging real image super-resolution (RealSR) problem lies in learning feature representations that are both informative and content-aware. In this paper, we propose an Omni-frequency Region-adaptive Network (ORNet) to address both challenges, here we call features of all low, middle and high frequencies omni-frequency features. Specifically, we start from the frequency perspective and design a Frequency Decomposition (FD) module to separate different frequency components to comprehensively compensate the information lost for real LR image. Then, considering the different regions of real LR image have different frequency information lost, we further design a Region-adaptive Frequency Aggregation (RFA) module by leveraging dynamic convolution and spatial attention to adaptively restore frequency components for different regions. The extensive experiments endorse the effective, and scenario-agnostic nature of our OR-Net for RealSR.

preprint2021arXiv

The MVTec 3D-AD Dataset for Unsupervised 3D Anomaly Detection and Localization

We introduce the first comprehensive 3D dataset for the task of unsupervised anomaly detection and localization. It is inspired by real-world visual inspection scenarios in which a model has to detect various types of defects on manufactured products, even if it is trained only on anomaly-free data. There are defects that manifest themselves as anomalies in the geometric structure of an object. These cause significant deviations in a 3D representation of the data. We employed a high-resolution industrial 3D sensor to acquire depth scans of 10 different object categories. For all object categories, we present a training and validation set, each of which solely consists of scans of anomaly-free samples. The corresponding test sets contain samples showing various defects such as scratches, dents, holes, contaminations, or deformations. Precise ground-truth annotations are provided for every anomalous test sample. An initial benchmark of 3D anomaly detection methods on our dataset indicates a considerable room for improvement.

preprint2020arXiv

Exploring Categorical Regularization for Domain Adaptive Object Detection

In this paper, we tackle the domain adaptive object detection problem, where the main challenge lies in significant domain gaps between source and target domains. Previous work seeks to plainly align image-level and instance-level shifts to eventually minimize the domain discrepancy. However, they still overlook to match crucial image regions and important instances across domains, which will strongly affect domain shift mitigation. In this work, we propose a simple but effective categorical regularization framework for alleviating this issue. It can be applied as a plug-and-play component on a series of Domain Adaptive Faster R-CNN methods which are prominent for dealing with domain adaptive detection. Specifically, by integrating an image-level multi-label classifier upon the detection backbone, we can obtain the sparse but crucial image regions corresponding to categorical information, thanks to the weakly localization ability of the classification manner. Meanwhile, at the instance level, we leverage the categorical consistency between image-level predictions (by the classifier) and instance-level predictions (by the detection head) as a regularization factor to automatically hunt for the hard aligned instances of target domains. Extensive experiments of various domain shift scenarios show that our method obtains a significant performance gain over original Domain Adaptive Faster R-CNN detectors. Furthermore, qualitative visualization and analyses can demonstrate the ability of our method for attending on the key regions/instances targeting on domain adaptation. Our code is open-source and available at \url{https://github.com/Megvii-Nanjing/CR-DA-DET}.

preprint2020arXiv

Feature Alignment and Restoration for Domain Generalization and Adaptation

For domain generalization (DG) and unsupervised domain adaptation (UDA), cross domain feature alignment has been widely explored to pull the feature distributions of different domains in order to learn domain-invariant representations. However, the feature alignment is in general task-ignorant and could result in degradation of the discrimination power of the feature representation and thus hinders the high performance. In this paper, we propose a unified framework termed Feature Alignment and Restoration (FAR) to simultaneously ensure high generalization and discrimination power of the networks for effective DG and UDA. Specifically, we perform feature alignment (FA) across domains by aligning the moments of the distributions of attentively selected features to reduce their discrepancy. To ensure high discrimination, we propose a Feature Restoration (FR) operation to distill task-relevant features from the residual information and use them to compensate for the aligned features. For better disentanglement, we enforce a dual ranking entropy loss constraint in the FR step to encourage the separation of task-relevant and task-irrelevant features. Extensive experiments on multiple classification benchmarks demonstrate the high performance and strong generalization of our FAR framework for both domain generalization and unsupervised domain adaptation.

preprint2020arXiv

Global Distance-distributions Separation for Unsupervised Person Re-identification

Supervised person re-identification (ReID) often has poor scalability and usability in real-world deployments due to domain gaps and the lack of annotations for the target domain data. Unsupervised person ReID through domain adaptation is attractive yet challenging. Existing unsupervised ReID approaches often fail in correctly identifying the positive samples and negative samples through the distance-based matching/ranking. The two distributions of distances for positive sample pairs (Pos-distr) and negative sample pairs (Neg-distr) are often not well separated, having large overlap. To address this problem, we introduce a global distance-distributions separation (GDS) constraint over the two distributions to encourage the clear separation of positive and negative samples from a global view. We model the two global distance distributions as Gaussian distributions and push apart the two distributions while encouraging their sharpness in the unsupervised training process. Particularly, to model the distributions from a global view and facilitate the timely updating of the distributions and the GDS related losses, we leverage a momentum update mechanism for building and maintaining the distribution parameters (mean and variance) and calculate the loss on the fly during the training. Distribution-based hard mining is proposed to further promote the separation of the two distributions. We validate the effectiveness of the GDS constraint in unsupervised ReID networks. Extensive experiments on multiple ReID benchmark datasets show our method leads to significant improvement over the baselines and achieves the state-of-the-art performance.

preprint2020arXiv

Is Network the Bottleneck of Distributed Training?

Recently there has been a surge of research on improving the communication efficiency of distributed training. However, little work has been done to systematically understand whether the network is the bottleneck and to what extent. In this paper, we take a first-principles approach to measure and analyze the network performance of distributed training. As expected, our measurement confirms that communication is the component that blocks distributed training from linear scale-out. However, contrary to the common belief, we find that the network is running at low utilization and that if the network can be fully utilized, distributed training can achieve a scaling factor of close to one. Moreover, while many recent proposals on gradient compression advocate over 100x compression ratio, we show that under full network utilization, there is no need for gradient compression in 100 Gbps network. On the other hand, a lower speed network like 10 Gbps requires only 2x--5x gradients compression ratio to achieve almost linear scale-out. Compared to application-level techniques like gradient compression, network-level optimizations do not require changes to applications and do not hurt the performance of trained models. As such, we advocate that the real challenge of distributed training is for the network community to develop high-performance network transport to fully utilize the network capacity and achieve linear scale-out.

preprint2020arXiv

Learned Video Compression with Feature-level Residuals

In this paper, we present an end-to-end video compression network for P-frame challenge on CLIC. We focus on deep neural network (DNN) based video compression, and improve the current frameworks from three aspects. First, we notice that pixel space residuals is sensitive to the prediction errors of optical flow based motion compensation. To suppress the relative influence, we propose to compress the residuals of image feature rather than the residuals of image pixels. Furthermore, we combine the advantages of both pixel-level and feature-level residual compression methods by model ensembling. Finally, we propose a step-by-step training strategy to improve the training efficiency of the whole framework. Experiment results indicate that our proposed method achieves 0.9968 MS-SSIM on CLIC validation set and 0.9967 MS-SSIM on test set.

preprint2020arXiv

Learning Disentangled Feature Representation for Hybrid-distorted Image Restoration

Hybrid-distorted image restoration (HD-IR) is dedicated to restore real distorted image that is degraded by multiple distortions. Existing HD-IR approaches usually ignore the inherent interference among hybrid distortions which compromises the restoration performance. To decompose such interference, we introduce the concept of Disentangled Feature Learning to achieve the feature-level divide-and-conquer of hybrid distortions. Specifically, we propose the feature disentanglement module (FDM) to distribute feature representations of different distortions into different channels by revising gain-control-based normalization. We also propose a feature aggregation module (FAM) with channel-wise attention to adaptively filter out the distortion representations and aggregate useful content information from different channels for the construction of raw image. The effectiveness of the proposed scheme is verified by visualizing the correlation matrix of features and channel responses of different distortions. Extensive experimental results also prove superior performance of our approach compared with the latest HD-IR schemes.

preprint2020arXiv

Relation-Aware Global Attention for Person Re-identification

For person re-identification (re-id), attention mechanisms have become attractive as they aim at strengthening discriminative features and suppressing irrelevant ones, which matches well the key of re-id, i.e., discriminative feature learning. Previous approaches typically learn attention using local convolutions, ignoring the mining of knowledge from global structure patterns. Intuitively, the affinities among spatial positions/nodes in the feature map provide clustering-like information and are helpful for inferring semantics and thus attention, especially for person images where the feasible human poses are constrained. In this work, we propose an effective Relation-Aware Global Attention (RGA) module which captures the global structural information for better attention learning. Specifically, for each feature position, in order to compactly grasp the structural information of global scope and local appearance information, we propose to stack the relations, i.e., its pairwise correlations/affinities with all the feature positions (e.g., in raster scan order), and the feature itself together to learn the attention with a shallow convolutional model. Extensive ablation studies demonstrate that our RGA can significantly enhance the feature representation power and help achieve the state-of-the-art performance on several popular benchmarks. The source code is available at https://github.com/microsoft/Relation-Aware-Global-Attention-Networks.

preprint2020arXiv

Semantics-Aligned Representation Learning for Person Re-identification

Person re-identification (reID) aims to match person images to retrieve the ones with the same identity. This is a challenging task, as the images to be matched are generally semantically misaligned due to the diversity of human poses and capture viewpoints, incompleteness of the visible bodies (due to occlusion), etc. In this paper, we propose a framework that drives the reID network to learn semantics-aligned feature representation through delicate supervision designs. Specifically, we build a Semantics Aligning Network (SAN) which consists of a base network as encoder (SA-Enc) for re-ID, and a decoder (SA-Dec) for reconstructing/regressing the densely semantics aligned full texture image. We jointly train the SAN under the supervisions of person re-identification and aligned texture generation. Moreover, at the decoder, besides the reconstruction loss, we add Triplet ReID constraints over the feature maps as the perceptual losses. The decoder is discarded in the inference and thus our scheme is computationally efficient. Ablation studies demonstrate the effectiveness of our design. We achieve the state-of-the-art performances on the benchmark datasets CUHK03, Market1501, MSMT17, and the partial person reID dataset Partial REID. Code for our proposed method is available at: https://github.com/microsoft/Semantics-Aligned-Representation-Learning-for-Person-Re-identification.

preprint2020arXiv

Strange hidden-charm tetraquarks in constituent quark models

Inspired by the newly reported $Z_{cs}(3985)^{-}$ by the BESIII Collaboration, we systematically investigate the strange hidden-charm tetraquark systems $cs\bar{c}\bar{u}$ with two structures: meson-meson and diquark-antidiquark. Two quark models: the chiral quark model (ChQM) and the quark delocalization color screening model (QDCSM) are used here. Similar results are obtained in both two quark models. There is no any bound state in either ChQM or QDCSM, which excludes the molecular state explanation ($D_{s}D^{*}/D_{s}^{*}D/D_{s}^{*}D^{*}$) of the reported $Z_{cs}(3985)^{-}$. However, the effective potentials for the diquark-antidiquark $cs\bar{c}\bar{u}$ systems shows the possibility of some resonance states with mass range of $3916.5\sim 3964.6$ MeV for $IJ^{P}=\frac{1}{2} 0^{+}$, $4008.8\sim 4091.2$ MeV for $IJ^{P}=\frac{1}{2} 1^{+}$, $4246.8\sim 4418.1$ MeV for $IJ^{P}=\frac{1}{2} 2^{+}$. So the observed $Z_{cs}(3985)^{-}$ state is possible to be explained as a compact resonance state composed of $cs\bar{c}\bar{u}$ with $IJ^{P}=\frac{1}{2} 0^{+}$ or $IJ^{P}=\frac{1}{2} 1^{+}$. The study of the scattering process of corresponding open channels is under way to check this conclusion.

preprint2020arXiv

Style Normalization and Restitution for Generalizable Person Re-identification

Existing fully-supervised person re-identification (ReID) methods usually suffer from poor generalization capability caused by domain gaps. The key to solving this problem lies in filtering out identity-irrelevant interference and learning domain-invariant person representations. In this paper, we aim to design a generalizable person ReID framework which trains a model on source domains yet is able to generalize/perform well on target domains. To achieve this goal, we propose a simple yet effective Style Normalization and Restitution (SNR) module. Specifically, we filter out style variations (e.g., illumination, color contrast) by Instance Normalization (IN). However, such a process inevitably removes discriminative information. We propose to distill identity-relevant feature from the removed information and restitute it to the network to ensure high discrimination. For better disentanglement, we enforce a dual causal loss constraint in SNR to encourage the separation of identity-relevant features and identity-irrelevant features. Extensive experiments demonstrate the strong generalization capability of our framework. Our models empowered by the SNR modules significantly outperform the state-of-the-art domain generalization approaches on multiple widely-used person ReID benchmarks, and also show superiority on unsupervised domain adaptation.

preprint2020arXiv

Uncertainty-Aware Multi-Shot Knowledge Distillation for Image-Based Object Re-Identification

Object re-identification (re-id) aims to identify a specific object across times or camera views, with the person re-id and vehicle re-id as the most widely studied applications. Re-id is challenging because of the variations in viewpoints, (human) poses, and occlusions. Multi-shots of the same object can cover diverse viewpoints/poses and thus provide more comprehensive information. In this paper, we propose exploiting the multi-shots of the same identity to guide the feature learning of each individual image. Specifically, we design an Uncertainty-aware Multi-shot Teacher-Student (UMTS) Network. It consists of a teacher network (T-net) that learns the comprehensive features from multiple images of the same object, and a student network (S-net) that takes a single image as input. In particular, we take into account the data dependent heteroscedastic uncertainty for effectively transferring the knowledge from the T-net to S-net. To the best of our knowledge, we are the first to make use of multi-shots of an object in a teacher-student learning manner for effectively boosting the single image based re-id. We validate the effectiveness of our approach on the popular vehicle re-id and person re-id datasets. In inference, the S-net alone significantly outperforms the baselines and achieves the state-of-the-art performance.