Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
34works
0followers
14topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

34 published item(s)

preprint2026arXiv

Advancing Aesthetic Image Generation via Composition Transfer

Composition is a cornerstone of visual aesthetics, influencing the appeal of an image. While its principles operate independently of specific content, in practice, composition is often coupled with semantics. As a result, existing methods often enhance composition either through implicit learning or by semantics-based layout control, rather than explicitly modeling composition itself. To address this gap, we introduce Composer, a framework rooted in aesthetic theory, designed to model composition in a semantic-agnostic manner. First, it supports composition transfer by extracting key composition-aware representations from a reference image and leveraging a tailored conditional guidance module to control composition based on pre-trained diffusion models. Second, when users specify only text themes without a composition reference, Composer supports theme-driven composition retrieval by leveraging the in-context learning capabilities of Large Vision-Language Models (LVLMs), achieving explicit composition planning. To enhance composition in a reference-free mode, we conduct text-to-composition fine-tuning on the trained control module to enable implicit composition planning. Furthermore, we curated a high-quality dataset comprising 2 million image-text pairs using state-of-the-art generative models to support model training. Experimental results demonstrate that Composer significantly enhances aesthetic quality in text-to-image tasks and facilitates personalized composition control and transfer, offering users precision and flexibility in the creative process.

preprint2026arXiv

Cross-modal Retrieval Models for Stripped Binary Analysis

Retrieving binary code via natural language queries is a pivotal capability for downstream tasks in the software security domain, such as vulnerability detection and malware analysis. However, it is challenging to identify binary functions semantically relevant to the user query from thousands of candidates, as the absence of symbolic information distinguishes this task from source code retrieval. In this paper, we introduce, BinSeek, a two-stage cross-modal retrieval framework for stripped binary code analysis. It consists of two models: BinSeek-Embedding is trained on large-scale dataset to learn the semantic relevance of the binary code and the natural language description, furthermore, BinSeek-Reranker learns to carefully judge the relevance of the candidate code to the description with context augmentation. To this end, we built an LLM-based data synthesis pipeline to automate training construction, also deriving a domain benchmark for future research. Our evaluation results show that BinSeek achieved the state-of-the-art performance, surpassing the the same scale models by 31.42% in Rec@3 and 27.17% in MRR@3, as well as leading the advanced general-purpose models that have 16 times larger parameters.

preprint2026arXiv

SAPL: Semantic-Agnostic Prompt Learning in CLIP for Weakly Supervised Image Manipulation Localization

Malicious image manipulation threatens public safety and requires efficient localization methods. Existing approaches depend on costly pixel-level annotations which make training expensive. Existing weakly supervised methods rely only on image-level binary labels and focus on global classification, often overlooking local edge cues that are critical for precise localization. We observe that feature variations at manipulated boundaries are substantially larger than in interior regions. To address this gap, we propose Semantic-Agnostic Prompt Learning (SAPL) in CLIP, which learns text prompts that intentionally encode non-semantic, boundary-centric cues so that CLIPs multimodal similarity highlights manipulation edges rather than high-level object semantics. SAPL combines two complementary modules Edge-aware Contextual Prompt Learning (ECPL) and Hierarchical Edge Contrastive Learning (HECL) to exploit edge information in both textual and visual spaces. The proposed ECPL leverages edge-enhanced image features to generate learnable textual prompts via an attention mechanism, embedding semantic-irrelevant information into text features, to guide CLIP focusing on manipulation edges. The proposed HECL extract genuine and manipulated edge patches, and utilize contrastive learning to boost the discrimination between genuine edge patches and manipulated edge patches. Finally, we predict the manipulated regions from the similarity map after processing. Extensive experiments on multiple public benchmarks demonstrate that SAPL significantly outperforms existing approaches, achieving state-of-the-art localization performance.

preprint2023arXiv

VQNet 2.0: A New Generation Machine Learning Framework that Unifies Classical and Quantum

With the rapid development of classical and quantum machine learning, a large number of machine learning frameworks have been proposed. However, existing machine learning frameworks usually only focus on classical or quantum, rather than both. Therefore, based on VQNet 1.0, we further propose VQNet 2.0, a new generation of unified classical and quantum machine learning framework that supports hybrid optimization. The core library of the framework is implemented in C++, and the user level is implemented in Python, and it supports deployment on quantum and classical hardware. In this article, we analyze the development trend of the new generation machine learning framework and introduce the design principles of VQNet 2.0 in detail: unity, practicality, efficiency, and compatibility, as well as full particulars of implementation. We illustrate the functions of VQNet 2.0 through several basic applications, including classical convolutional neural networks, quantum autoencoders, hybrid classical-quantum networks, etc. After that, through extensive experiments, we demonstrate that the operation speed of VQNet 2.0 is higher than the comparison method. Finally, through extensive experiments, we demonstrate that VQNet 2.0 can deploy on different hardware platforms, the overall calculation speed is faster than the comparison method. It also can be mixed and optimized with quantum circuits composed of multiple quantum computing libraries.

preprint2022arXiv

ATDD: Fine-Grained Assured Time-Sensitive Data Deletion Scheme in Cloud Storage

With the rapid development of general cloud services, more and more individuals or collectives use cloud platforms to store data. Assured data deletion deserves investigation in cloud storage. In time-sensitive data storage scenarios, it is necessary for cloud platforms to automatically destroy data after the data owner-specified expiration time. Therefore, assured timesensitive data deletion should be sought. In this paper, a finegrained assured time-sensitive data deletion (ATDD) scheme in cloud storage is proposed by embedding the time trapdoor in Ciphertext-Policy Attribute-Based Encryption (CP-ABE). Timesensitive data is self-destructed after the data owner-specified expiration time so that the authorized users cannot get access to the related data. In addition, a credential is returned to the data owner for data deletion verification. This proposed scheme provides solutions for fine-grained access control and verifiable data self-destruction. Detailed security and performance analysis demonstrate the security and the practicability of the proposed scheme.

preprint2022arXiv

Bootstrapped Masked Autoencoders for Vision BERT Pretraining

We propose bootstrapped masked autoencoders (BootMAE), a new approach for vision BERT pretraining. BootMAE improves the original masked autoencoders (MAE) with two core designs: 1) momentum encoder that provides online feature as extra BERT prediction targets; 2) target-aware decoder that tries to reduce the pressure on the encoder to memorize target-specific information in BERT pretraining. The first design is motivated by the observation that using a pretrained MAE to extract the features as the BERT prediction target for masked tokens can achieve better pretraining performance. Therefore, we add a momentum encoder in parallel with the original MAE encoder, which bootstraps the pretraining performance by using its own representation as the BERT prediction target. In the second design, we introduce target-specific information (e.g., pixel values of unmasked patches) from the encoder directly to the decoder to reduce the pressure on the encoder of memorizing the target-specific information. Thus, the encoder focuses on semantic modeling, which is the goal of BERT pretraining, and does not need to waste its capacity in memorizing the information of unmasked tokens related to the prediction target. Through extensive experiments, our BootMAE achieves $84.2\%$ Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming MAE by $+0.8\%$ under the same pre-training epochs. BootMAE also gets $+1.0$ mIoU improvements on semantic segmentation on ADE20K and $+1.3$ box AP, $+1.4$ mask AP improvement on object detection and segmentation on COCO dataset. Code is released at https://github.com/LightDXY/BootMAE.

preprint2022arXiv

CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows

We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4\% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 52.2 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and high segmentation performance on ADE20K with 55.7 mIoU. The code and models are available at https://github.com/microsoft/CSWin-Transformer.

preprint2022arXiv

E2Style: Improve the Efficiency and Effectiveness of StyleGAN Inversion

This paper studies the problem of StyleGAN inversion, which plays an essential role in enabling the pretrained StyleGAN to be used for real image editing tasks. The goal of StyleGAN inversion is to find the exact latent code of the given image in the latent space of StyleGAN. This problem has a high demand for quality and efficiency. Existing optimization-based methods can produce high-quality results, but the optimization often takes a long time. On the contrary, forward-based methods are usually faster but the quality of their results is inferior. In this paper, we present a new feed-forward network "E2Style" for StyleGAN inversion, with significant improvement in terms of efficiency and effectiveness. In our inversion network, we introduce: 1) a shallower backbone with multiple efficient heads across scales; 2) multi-layer identity loss and multi-layer face parsing loss to the loss function; and 3) multi-stage refinement. Combining these designs together forms an effective and efficient method that exploits all benefits of optimization-based and forward-based methods. Quantitative and qualitative results show that our E2Style performs better than existing forward-based methods and comparably to state-of-the-art optimization-based methods while maintaining the high efficiency as well as forward-based methods. Moreover, a number of real image editing applications demonstrate the efficacy of our E2Style. Our code is available at \url{https://github.com/wty-ustc/e2style}

preprint2022arXiv

Fidelity-Guarantee Entanglement Routing in Quantum Networks

Entanglement routing establishes remote entanglement connection between two arbitrary nodes, which is one of the most important functions in quantum networks. The existing routing mechanisms mainly improve the robustness and throughput facing the failure of entanglement generations, which, however, rarely include the considerations on the most important metric to evaluate the quality of connection, entanglement fidelity. To solve this problem, we propose purification-enabled entanglement routing designs to provide fidelity guarantee for multiple Source-Destination (SD) pairs in quantum networks. In our proposal, we first consider the single S-D pair scenario and design an iterative routing algorithm, Q-PATH, to find the optimal purification decisions along the routing path with minimum entangled pair cost. Further, a low-complexity routing algorithm using an extended Dijkstra algorithm, Q-LEAP, is designed to reduce the computational complexity by using a simple but effective purification decision method. Then we consider the common scenario with multiple S-D pairs and design a greedy-based algorithm considering resource allocation and rerouting process for multiple routing requests. To verify the effectiveness and superiority of the proposed algorithms, extensive simulations are conducted, and the simulation results show that the proposed algorithms not only can provide fidelity-guarantee routing solutions, but also has superior performance in terms of throughput, fidelity of end-to-end entanglement connection, and resource utilization ratio, compared with the existing routing scheme.

preprint2022arXiv

HairCLIP: Design Your Hair by Text and Reference Image

Hair editing is an interesting and challenging problem in computer vision and graphics. Many existing methods require well-drawn sketches or masks as conditional inputs for editing, however these interactions are neither straightforward nor efficient. In order to free users from the tedious interaction process, this paper proposes a new hair editing interaction mode, which enables manipulating hair attributes individually or jointly based on the texts or reference images provided by users. For this purpose, we encode the image and text conditions in a shared embedding space and propose a unified hair editing framework by leveraging the powerful image text representation capability of the Contrastive Language-Image Pre-Training (CLIP) model. With the carefully designed network structures and loss functions, our framework can perform high-quality hair editing in a disentangled manner. Extensive experiments demonstrate the superiority of our approach in terms of manipulation accuracy, visual realism of editing results, and irrelevant attribute preservation. Project repo is https://github.com/wty-ustc/HairCLIP.

preprint2022arXiv

Invertible Mask Network for Face Privacy-Preserving

Face privacy-preserving is one of the hotspots that arises dramatic interests of research. However, the existing face privacy-preserving methods aim at causing the missing of semantic information of face and cannot preserve the reusability of original facial information. To achieve the naturalness of the processed face and the recoverability of the original protected face, this paper proposes face privacy-preserving method based on Invertible "Mask" Network (IMN). In IMN, we introduce a Mask-net to generate "Mask" face firstly. Then, put the "Mask" face onto the protected face and generate the masked face, in which the masked face is indistinguishable from "Mask" face. Finally, "Mask" face can be put off from the masked face and obtain the recovered face to the authorized users, in which the recovered face is visually indistinguishable from the protected face. The experimental results show that the proposed method can not only effectively protect the privacy of the protected face, but also almost perfectly recover the protected face from the masked face.

preprint2022arXiv

Online Multi-Object Tracking with Unsupervised Re-Identification Learning and Occlusion Estimation

Occlusion between different objects is a typical challenge in Multi-Object Tracking (MOT), which often leads to inferior tracking results due to the missing detected objects. The common practice in multi-object tracking is re-identifying the missed objects after their reappearance. Though tracking performance can be boosted by the re-identification, the annotation of identity is required to train the model. In addition, such practice of re-identification still can not track those highly occluded objects when they are missed by the detector. In this paper, we focus on online multi-object tracking and design two novel modules, the unsupervised re-identification learning module and the occlusion estimation module, to handle these problems. Specifically, the proposed unsupervised re-identification learning module does not require any (pseudo) identity information nor suffer from the scalability issue. The proposed occlusion estimation module tries to predict the locations where occlusions happen, which are used to estimate the positions of missed objects by the detector. Our study shows that, when applied to state-of-the-art MOT methods, the proposed unsupervised re-identification learning is comparable to supervised re-identification learning, and the tracking performance is further improved by the proposed occlusion estimation module.

preprint2022arXiv

Poison Ink: Robust and Invisible Backdoor Attack

Recent research shows deep neural networks are vulnerable to different types of attacks, such as adversarial attack, data poisoning attack and backdoor attack. Among them, backdoor attack is the most cunning one and can occur in almost every stage of deep learning pipeline. Therefore, backdoor attack has attracted lots of interests from both academia and industry. However, most existing backdoor attack methods are either visible or fragile to some effortless pre-processing such as common data transformations. To address these limitations, we propose a robust and invisible backdoor attack called "Poison Ink". Concretely, we first leverage the image structures as target poisoning areas, and fill them with poison ink (information) to generate the trigger pattern. As the image structure can keep its semantic meaning during the data transformation, such trigger pattern is inherently robust to data transformations. Then we leverage a deep injection network to embed such trigger pattern into the cover image to achieve stealthiness. Compared to existing popular backdoor attack methods, Poison Ink outperforms both in stealthiness and robustness. Through extensive experiments, we demonstrate Poison Ink is not only general to different datasets and network architectures, but also flexible for different attack scenarios. Besides, it also has very strong resistance against many state-of-the-art defense techniques.

preprint2022arXiv

Protecting Celebrities from DeepFake with Identity Consistency Transformer

In this work we propose Identity Consistency Transformer, a novel face forgery detection method that focuses on high-level semantics, specifically identity information, and detecting a suspect face by finding identity inconsistency in inner and outer face regions. The Identity Consistency Transformer incorporates a consistency loss for identity consistency determination. We show that Identity Consistency Transformer exhibits superior generalization ability not only across different datasets but also across various types of image degradation forms found in real-world applications including deepfake videos. The Identity Consistency Transformer can be easily enhanced with additional identity information when such information is available, and for this reason it is especially well-suited for detecting face forgeries involving celebrities. Code will be released at \url{https://github.com/LightDXY/ICT_DeepFake}

preprint2022arXiv

Real-time Online Multi-Object Tracking in Compressed Domain

Recent online Multi-Object Tracking (MOT) methods have achieved desirable tracking performance. However, the tracking speed of most existing methods is rather slow. Inspired from the fact that the adjacent frames are highly relevant and redundant, we divide the frames into key and non-key frames respectively and track objects in the compressed domain. For the key frames, the RGB images are restored for detection and data association. To make data association more reliable, an appearance Convolutional Neural Network (CNN) which can be jointly trained with the detector is proposed. For the non-key frames, the objects are directly propagated by a tracking CNN based on the motion information provided in the compressed domain. Compared with the state-of-the-art online MOT methods,our tracker is about 6x faster while maintaining a comparable tracking performance.

preprint2022arXiv

Reduce Information Loss in Transformers for Pluralistic Image Inpainting

Transformers have achieved great success in pluralistic image inpainting recently. However, we find existing transformer based solutions regard each pixel as a token, thus suffer from information loss issue from two aspects: 1) They downsample the input image into much lower resolutions for efficiency consideration, incurring information loss and extra misalignment for the boundaries of masked regions. 2) They quantize $256^3$ RGB pixels to a small number (such as 512) of quantized pixels. The indices of quantized pixels are used as tokens for the inputs and prediction targets of transformer. Although an extra CNN network is used to upsample and refine the low-resolution results, it is difficult to retrieve the lost information back.To keep input information as much as possible, we propose a new transformer based framework "PUT". Specifically, to avoid input downsampling while maintaining the computation efficiency, we design a patch-based auto-encoder P-VQVAE, where the encoder converts the masked image into non-overlapped patch tokens and the decoder recovers the masked regions from inpainted tokens while keeping the unmasked regions unchanged. To eliminate the information loss caused by quantization, an Un-Quantized Transformer (UQ-Transformer) is applied, which directly takes the features from P-VQVAE encoder as input without quantization and regards the quantized tokens only as prediction targets. Extensive experiments show that PUT greatly outperforms state-of-the-art methods on image fidelity, especially for large masked regions and complex large-scale datasets. Code is available at https://github.com/liuqk3/PUT

preprint2022arXiv

Self-supervised Transformer for Deepfake Detection

The fast evolution and widespread of deepfake techniques in real-world scenarios require stronger generalization abilities of face forgery detectors. Some works capture the features that are unrelated to method-specific artifacts, such as clues of blending boundary, accumulated up-sampling, to strengthen the generalization ability. However, the effectiveness of these methods can be easily corrupted by post-processing operations such as compression. Inspired by transfer learning, neural networks pre-trained on other large-scale face-related tasks may provide useful features for deepfake detection. For example, lip movement has been proved to be a kind of robust and good-transferring highlevel semantic feature, which can be learned from the lipreading task. However, the existing method pre-trains the lip feature extraction model in a supervised manner, which requires plenty of human resources in data annotation and increases the difficulty of obtaining training data. In this paper, we propose a self-supervised transformer based audio-visual contrastive learning method. The proposed method learns mouth motion representations by encouraging the paired video and audio representations to be close while unpaired ones to be diverse. After pre-training with our method, the model will then be partially fine-tuned for deepfake detection task. Extensive experiments show that our self-supervised method performs comparably or even better than the supervised pre-training counterpart.

preprint2022arXiv

Shape-invariant 3D Adversarial Point Clouds

Adversary and invisibility are two fundamental but conflict characters of adversarial perturbations. Previous adversarial attacks on 3D point cloud recognition have often been criticized for their noticeable point outliers, since they just involve an "implicit constrain" like global distance loss in the time-consuming optimization to limit the generated noise. While point cloud is a highly structured data format, it is hard to constrain its perturbation with a simple loss or metric properly. In this paper, we propose a novel Point-Cloud Sensitivity Map to boost both the efficiency and imperceptibility of point perturbations. This map reveals the vulnerability of point cloud recognition models when encountering shape-invariant adversarial noises. These noises are designed along the shape surface with an "explicit constrain" instead of extra distance loss. Specifically, we first apply a reversible coordinate transformation on each point of the point cloud input, to reduce one degree of point freedom and limit its movement on the tangent plane. Then we calculate the best attacking direction with the gradients of the transformed point cloud obtained on the white-box model. Finally we assign each point with a non-negative score to construct the sensitivity map, which benefits both white-box adversarial invisibility and black-box query-efficiency extended in our work. Extensive evaluations prove that our method can achieve the superior performance on various point cloud recognition models, with its satisfying adversarial imperceptibility and strong resistance to different point cloud defense settings. Our code is available at: https://github.com/shikiw/SI-Adv.

preprint2022arXiv

Speech Pattern based Black-box Model Watermarking for Automatic Speech Recognition

As an effective method for intellectual property (IP) protection, model watermarking technology has been applied on a wide variety of deep neural networks (DNN), including speech classification models. However, how to design a black-box watermarking scheme for automatic speech recognition (ASR) models is still an unsolved problem, which is a significant demand for protecting remote ASR Application Programming Interface (API) deployed in cloud servers. Due to conditional independence assumption and label-detection-based evasion attack risk of ASR models, the black-box model watermarking scheme for speech classification models cannot apply to ASR models. In this paper, we propose the first black-box model watermarking framework for protecting the IP of ASR models. Specifically, we synthesize trigger audios by spreading the speech clips of model owners over the entire input audios and labeling the trigger audios with the stego texts, which hides the authorship information with linguistic steganography. Experiments on the state-of-the-art open-source ASR system DeepSpeech demonstrate the feasibility of the proposed watermarking scheme, which is robust against five kinds of attacks and has little impact on accuracy.

preprint2022arXiv

Towards Intrinsic Common Discriminative Features Learning for Face Forgery Detection using Adversarial Learning

Existing face forgery detection methods usually treat face forgery detection as a binary classification problem and adopt deep convolution neural networks to learn discriminative features. The ideal discriminative features should be only related to the real/fake labels of facial images. However, we observe that the features learned by vanilla classification networks are correlated to unnecessary properties, such as forgery methods and facial identities. Such phenomenon would limit forgery detection performance especially for the generalization ability. Motivated by this, we propose a novel method which utilizes adversarial learning to eliminate the negative effect of different forgery methods and facial identities, which helps classification network to learn intrinsic common discriminative features for face forgery detection. To leverage data lacking ground truth label of facial identities, we design a special identity discriminator based on similarity information derived from off-the-shelf face recognition model. With the help of adversarial learning, our face forgery detection model learns to extract common discriminative features through eliminating the effect of forgery methods and facial identities. Extensive experiments demonstrate the effectiveness of the proposed method under both intra-dataset and cross-dataset evaluation settings.

preprint2021arXiv

Adversarial Examples Detection beyond Image Space

Deep neural networks have been proved that they are vulnerable to adversarial examples, which are generated by adding human-imperceptible perturbations to images. To defend these adversarial examples, various detection based methods have been proposed. However, most of them perform poorly on detecting adversarial examples with extremely slight perturbations. By exploring these adversarial examples, we find that there exists compliance between perturbations and prediction confidence, which guides us to detect few-perturbation attacks from the aspect of prediction confidence. To detect both few-perturbation attacks and large-perturbation attacks, we propose a method beyond image space by a two-stream architecture, in which the image stream focuses on the pixel artifacts and the gradient stream copes with the confidence artifacts. The experimental results show that the proposed method outperforms the existing methods under oblivious attacks and is verified effective to defend omniscient attacks as well.

preprint2021arXiv

Deep Model Intellectual Property Protection via Deep Watermarking

Despite the tremendous success, deep neural networks are exposed to serious IP infringement risks. Given a target deep model, if the attacker knows its full information, it can be easily stolen by fine-tuning. Even if only its output is accessible, a surrogate model can be trained through student-teacher learning by generating many input-output training pairs. Therefore, deep model IP protection is important and necessary. However, it is still seriously under-researched. In this work, we propose a new model watermarking framework for protecting deep networks trained for low-level computer vision or image processing tasks. Specifically, a special task-agnostic barrier is added after the target model, which embeds a unified and invisible watermark into its outputs. When the attacker trains one surrogate model by using the input-output pairs of the barrier target model, the hidden watermark will be learned and extracted afterwards. To enable watermarks from binary bits to high-resolution images, a deep invisible watermarking mechanism is designed. By jointly training the target model and watermark embedding, the extra barrier can even be absorbed into the target model. Through extensive experiments, we demonstrate the robustness of the proposed framework, which can resist attacks with different network structures and objective functions.

preprint2021arXiv

Improved Image Matting via Real-time User Clicks and Uncertainty Estimation

Image matting is a fundamental and challenging problem in computer vision and graphics. Most existing matting methods leverage a user-supplied trimap as an auxiliary input to produce good alpha matte. However, obtaining high-quality trimap itself is arduous, thus restricting the application of these methods. Recently, some trimap-free methods have emerged, however, the matting quality is still far behind the trimap-based methods. The main reason is that, without the trimap guidance in some cases, the target network is ambiguous about which is the foreground target. In fact, choosing the foreground is a subjective procedure and depends on the user's intention. To this end, this paper proposes an improved deep image matting framework which is trimap-free and only needs several user click interactions to eliminate the ambiguity. Moreover, we introduce a new uncertainty estimation module that can predict which parts need polishing and a following local refinement module. Based on the computation budget, users can choose how many local parts to improve with the uncertainty guidance. Quantitative and qualitative results show that our method performs better than existing trimap-free methods and comparably to state-of-the-art trimap-based methods with minimal user effort.

preprint2021arXiv

Multi-attentional Deepfake Detection

Face forgery by deepfake is widely spread over the internet and has raised severe societal concerns. Recently, how to detect such forgery contents has become a hot research topic and many deepfake detection methods have been proposed. Most of them model deepfake detection as a vanilla binary classification problem, i.e, first use a backbone network to extract a global feature and then feed it into a binary classifier (real/fake). But since the difference between the real and fake images in this task is often subtle and local, we argue this vanilla solution is not optimal. In this paper, we instead formulate deepfake detection as a fine-grained classification problem and propose a new multi-attentional deepfake detection network. Specifically, it consists of three key components: 1) multiple spatial attention heads to make the network attend to different local parts; 2) textural feature enhancement block to zoom in the subtle artifacts in shallow features; 3) aggregate the low-level textural feature and high-level semantic features guided by the attention maps. Moreover, to address the learning difficulty of this network, we further introduce a new regional independence loss and an attention guided data augmentation strategy. Through extensive experiments on different datasets, we demonstrate the superiority of our method over the vanilla binary classifier counterparts, and achieve state-of-the-art performance.

preprint2021arXiv

Return-Based Contrastive Representation Learning for Reinforcement Learning

Recently, various auxiliary tasks have been proposed to accelerate representation learning and improve sample efficiency in deep reinforcement learning (RL). However, existing auxiliary tasks do not take the characteristics of RL problems into consideration and are unsupervised. By leveraging returns, the most important feedback signals in RL, we propose a novel auxiliary task that forces the learnt representations to discriminate state-action pairs with different returns. Our auxiliary loss is theoretically justified to learn representations that capture the structure of a new form of state-action abstraction, under which state-action pairs with similar return distributions are aggregated together. In low data regime, our algorithm outperforms strong baselines on complex tasks in Atari games and DeepMind Control suite, and achieves even better performance when combined with existing auxiliary tasks.

preprint2020arXiv

A Large Scale Urban Surveillance Video Dataset for Multiple-Object Tracking and Behavior Analysis

Multiple-object tracking and behavior analysis have been the essential parts of surveillance video analysis for public security and urban management. With billions of surveillance video captured all over the world, multiple-object tracking and behavior analysis by manual labor are cumbersome and cost expensive. Due to the rapid development of deep learning algorithms in recent years, automatic object tracking and behavior analysis put forward an urgent demand on a large scale well-annotated surveillance video dataset that can reflect the diverse, congested, and complicated scenarios in real applications. This paper introduces an urban surveillance video dataset (USVD) which is by far the largest and most comprehensive. The dataset consists of 16 scenes captured in 7 typical outdoor scenarios: street, crossroads, hospital entrance, school gate, park, pedestrian mall, and public square. Over 200k video frames are annotated carefully, resulting in more than 3:7 million object bounding boxes and about 7:1 thousand trajectories. We further use this dataset to evaluate the performance of typical algorithms for multiple-object tracking and anomaly behavior analysis and explore the robustness of these methods in urban congested scenarios.

preprint2020arXiv

An Enhanced Convolutional Neural Network in Side-Channel Attacks and Its Visualization

In recent years, the convolutional neural networks (CNNs) have received a lot of interest in the side-channel community. The previous work has shown that CNNs have the potential of breaking the cryptographic algorithm protected with masking or desynchronization. Before, several CNN models have been exploited, reaching the same or even better level of performance compared to the traditional side-channel attack (SCA). In this paper, we investigate the architecture of Residual Network and build a new CNN model called attention network. To enhance the power of the attention network, we introduce an attention mechanism - Convolutional Block Attention Module (CBAM) and incorporate CBAM into the CNN architecture. CBAM points out the informative points of the input traces and makes the attention network focus on the relevant leakages of the measurements. It is able to improve the performance of the CNNs. Because the irrelevant points will introduce the extra noises and cause a worse performance of attacks. We compare our attention network with the one designed for the masking AES implementation called ASCAD network in this paper. We show that the attention network has a better performance than the ASCAD network. Finally, a new visualization method, named Class Gradient Visualization (CGV) is proposed to recognize which points of the input traces have a positive influence on the predicted result of the neural networks. In another aspect, it can explain why the attention network is superior to the ASCAD network. We validate the attention network through extensive experiments on four public datasets and demonstrate that the attention network is efficient in different AES implementations.

preprint2020arXiv

Asynchronous Stochastic Gradient Descent with Delay Compensation

With the fast development of deep learning, it has become common to learn big neural networks using massive training data. Asynchronous Stochastic Gradient Descent (ASGD) is widely adopted to fulfill this task for its efficiency, which is, however, known to suffer from the problem of delayed gradients. That is, when a local worker adds its gradient to the global model, the global model may have been updated by other workers and this gradient becomes "delayed". We propose a novel technology to compensate this delay, so as to make the optimization behavior of ASGD closer to that of sequential SGD. This is achieved by leveraging Taylor expansion of the gradient function and efficient approximation to the Hessian matrix of the loss function. We call the new algorithm Delay Compensated ASGD (DC-ASGD). We evaluated the proposed algorithm on CIFAR-10 and ImageNet datasets, and the experimental results demonstrate that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly approaches the performance of sequential SGD.

preprint2020arXiv

Cross-modality Person re-identification with Shared-Specific Feature Transfer

Cross-modality person re-identification (cm-ReID) is a challenging but key technology for intelligent video analysis. Existing works mainly focus on learning common representation by embedding different modalities into a same feature space. However, only learning the common characteristics means great information loss, lowering the upper bound of feature distinctiveness. In this paper, we tackle the above limitation by proposing a novel cross-modality shared-specific feature transfer algorithm (termed cm-SSFT) to explore the potential of both the modality-shared information and the modality-specific characteristics to boost the re-identification performance. We model the affinities of different modality samples according to the shared features and then transfer both shared and specific features among and across modalities. We also propose a complementary feature learning strategy including modality adaption, project adversarial learning and reconstruction enhancement to learn discriminative and complementary shared and specific features of each modality, respectively. The entire cm-SSFT algorithm can be trained in an end-to-end manner. We conducted comprehensive experiments to validate the superiority of the overall algorithm and the effectiveness of each component. The proposed algorithm significantly outperforms state-of-the-arts by 22.5% and 19.3% mAP on the two mainstream benchmark datasets SYSU-MM01 and RegDB, respectively.

preprint2020arXiv

Density-Aware Graph for Deep Semi-Supervised Visual Recognition

Semi-supervised learning (SSL) has been extensively studied to improve the generalization ability of deep neural networks for visual recognition. To involve the unlabelled data, most existing SSL methods are based on common density-based cluster assumption: samples lying in the same high-density region are likely to belong to the same class, including the methods performing consistency regularization or generating pseudo-labels for the unlabelled images. Despite their impressive performance, we argue three limitations exist: 1) Though the density information is demonstrated to be an important clue, they all use it in an implicit way and have not exploited it in depth. 2) For feature learning, they often learn the feature embedding based on the single data sample and ignore the neighborhood information. 3) For label-propagation based pseudo-label generation, it is often done offline and difficult to be end-to-end trained with feature learning. Motivated by these limitations, this paper proposes to solve the SSL problem by building a novel density-aware graph, based on which the neighborhood information can be easily leveraged and the feature learning and label propagation can also be trained in an end-to-end way. Specifically, we first propose a new Density-aware Neighborhood Aggregation(DNA) module to learn more discriminative features by incorporating the neighborhood information in a density-aware manner. Then a novel Density-ascending Path based Label Propagation(DPLP) module is proposed to generate the pseudo-labels for unlabeled samples more efficiently according to the feature distribution characterized by density. Finally, the DNA module and DPLP module evolve and improve each other end-to-end.

preprint2020arXiv

Enhancing the Performance of Practical Profiling Side-Channel Attacks Using Conditional Generative Adversarial Networks

Recently, many profiling side-channel attacks based on Machine Learning and Deep Learning have been proposed. Most of them focus on reducing the number of traces required for successful attacks by optimizing the modeling algorithms. In previous work, relatively sufficient traces need to be used for training a model. However, in the practical profiling phase, it is difficult or impossible to collect sufficient traces due to the constraint of various resources. In this case, the performance of profiling attacks is inefficient even if proper modeling algorithms are used. In this paper, the main problem we consider is how to conduct more efficient profiling attacks when sufficient profiling traces cannot be obtained. To deal with this problem, we first introduce the Conditional Generative Adversarial Network (CGAN) in the context of side-channel attacks. We show that CGAN can generate new traces to enlarge the size of the profiling set, which improves the performance of profiling attacks. For both unprotected and protected cryptographic algorithms, we find that CGAN can effectively learn the leakage of traces collected in their implementations. We also apply it to different modeling algorithms. In our experiments, the model constructed with the augmented profiling set can reduce the required attack traces by more than half, which means the generated traces can provide useful information as the real traces.

preprint2020arXiv

Model Watermarking for Image Processing Networks

Deep learning has achieved tremendous success in numerous industrial applications. As training a good model often needs massive high-quality data and computation resources, the learned models often have significant business values. However, these valuable deep models are exposed to a huge risk of infringements. For example, if the attacker has the full information of one target model including the network structure and weights, the model can be easily finetuned on new datasets. Even if the attacker can only access the output of the target model, he/she can still train another similar surrogate model by generating a large scale of input-output training pairs. How to protect the intellectual property of deep models is a very important but seriously under-researched problem. There are a few recent attempts at classification network protection only. In this paper, we propose the first model watermarking framework for protecting image processing models. To achieve this goal, we leverage the spatial invisible watermarking mechanism. Specifically, given a black-box target model, a unified and invisible watermark is hidden into its outputs, which can be regarded as a special task-agnostic barrier. In this way, when the attacker trains one surrogate model by using the input-output pairs of the target model, the hidden watermark will be learned and extracted afterward. To enable watermarks from binary bits to high-resolution images, both traditional and deep spatial invisible watermarking mechanism are considered. Experiments demonstrate the robustness of the proposed watermarking mechanism, which can resist surrogate models learned with different network structures and objective functions. Besides deep models, the proposed method is also easy to be extended to protect data and traditional image processing algorithms.

preprint2020arXiv

Rethinking Spatially-Adaptive Normalization

Spatially-adaptive normalization is remarkably successful recently in conditional semantic image synthesis, which modulates the normalized activation with spatially-varying transformations learned from semantic layouts, to preserve the semantic information from being washed away. Despite its impressive performance, a more thorough understanding of the true advantages inside the box is still highly demanded, to help reduce the significant computation and parameter overheads introduced by these new structures. In this paper, from a return-on-investment point of view, we present a deep analysis of the effectiveness of SPADE and observe that its advantages actually come mainly from its semantic-awareness rather than the spatial-adaptiveness. Inspired by this point, we propose class-adaptive normalization (CLADE), a lightweight variant that is not adaptive to spatial positions or layouts. Benefited from this design, CLADE greatly reduces the computation cost while still being able to preserve the semantic information during the generation. Extensive experiments on multiple challenging datasets demonstrate that while the resulting fidelity is on par with SPADE, its overhead is much cheaper than SPADE. Take the generator for ADE20k dataset as an example, the extra parameter and computation cost introduced by CLADE are only 4.57% and 0.07% while that of SPADE are 39.21% and 234.73% respectively.

preprint2020arXiv

Self-supervised Adversarial Training

Recent work has demonstrated that neural networks are vulnerable to adversarial examples. To escape from the predicament, many works try to harden the model in various ways, in which adversarial training is an effective way which learns robust feature representation so as to resist adversarial attacks. Meanwhile, the self-supervised learning aims to learn robust and semantic embedding from data itself. With these views, we introduce self-supervised learning to against adversarial examples in this paper. Specifically, the self-supervised representation coupled with k-Nearest Neighbour is proposed for classification. To further strengthen the defense ability, self-supervised adversarial training is proposed, which maximizes the mutual information between the representations of original examples and the corresponding adversarial examples. Experimental results show that the self-supervised representation outperforms its supervised version in respect of robustness and self-supervised adversarial training can further improve the defense ability efficiently.