Researcher profile

Nannan Wang

Nannan Wang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
18works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

18 published item(s)

preprint2026arXiv

Bridging Data Trials and Task Barriers: A Unified Framework for Sketch Biometric Identification

Different from existing cross-modality identification tasks (e.g., heterogeneous face recognition, sketch re-identification, etc.), we introduce a novel yet practical setting for these related identification tasks, named \textbf{sketch biometric identification}, which aims to continually train a unified model across different data domains, even diverse identification tasks. Sketch biometric identification faces challenges, including scarce real sketch data, high annotation costs, privacy risks, and insufficient generalization ability of cross-task models. Existing methods usually rely on limited real data or single-task optimization, making it difficult to effectively address the joint challenges of cross-modality and cross-task. This paper proposes a unified framework that integrates efficient synthetic sketch generation and task-sequential continual learning. First, we design an efficient pipeline to generate a large-scale and high-quality synthetic person and face sketch data, which significantly reduces costs and avoids privacy risks. Meanwhile, we enhance the model's robustness by fusing real data. Second, we construct a universal unified framework for sketch biometric identification, which adopts a task-sequential training strategy: the model first completes sketch person re-identification learning on the person dataset; subsequently, it maintains the acquired person recognition capability through a trusted sample replay technique and seamlessly performs incremental training on the face dataset. This enables a single model to simultaneously handle the cross-task capabilities of multiple sketch biometric identification tasks. To support the study of the mentioned sketch biometric identification, we built a new large-scale benchmark, SketchUnified-BioID, with several practical evaluation protocols.

preprint2026arXiv

Stabilizing, Scaling & Enhancing MeanFlow for Large-scale Diffusion Distillation

Diffusion models exhibit remarkable generative capability, but their high latency limits practical deployment. Many studies have attempted to reduce sampling steps to accelerate inference. Among them, MeanFlow has attracted considerable attention due to its concise formulation and remarkable performance. Nevertheless, the instability of its optimization objective and the ''mean-seeking bias'' have limited its applicability to distill large-scale industrial models. To stabilize MeanFlow for distilling large-scale models, we first introduce a warm-up technique, in which the original differential solution of MeanFlow is replaced by a discrete solution. This design avoids training collapse caused by the MeanFlow target containing a stop-gradient term from an undertrained model. Once the model acquires a preliminary ability to fit the average velocity field, we switch the optimization objective back to the differential solution, enabling further refinement. Meanwhile, to alleviate the ''mean-seeking bias'' of MeanFlow under extremely few-step inference with complex target distributions, we incorporate trajectory distribution alignment as an auxiliary objective, encouraging the student model's trajectory distribution to align more closely with that of the teacher model. Our proposed distillation framework achieves superior performance compared to existing distillation approaches when applied to the text-to-image (T2I) model FLUX.1-dev (up to 12B parameters). Furthermore, when extended to the 80B-parameter state-of-the-art (SOTA) T2I model HunyuanImage 3.0, our method continues to demonstrate robust generalization and strong performance.

preprint2022arXiv

Exploring Set Similarity for Dense Self-supervised Representation Learning

By considering the spatial correspondence, dense self-supervised representation learning has achieved superior performance on various dense prediction tasks. However, the pixel-level correspondence tends to be noisy because of many similar misleading pixels, e.g., backgrounds. To address this issue, in this paper, we propose to explore \textbf{set} \textbf{sim}ilarity (SetSim) for dense self-supervised representation learning. We generalize pixel-wise similarity learning to set-wise one to improve the robustness because sets contain more semantic and structure information. Specifically, by resorting to attentional features of views, we establish corresponding sets, thus filtering out noisy backgrounds that may cause incorrect correspondences. Meanwhile, these attentional features can keep the coherence of the same image across different views to alleviate semantic inconsistency. We further search the cross-view nearest neighbours of sets and employ the structured neighbourhood information to enhance the robustness. Empirical evaluations demonstrate that SetSim is superior to state-of-the-art methods on object detection, keypoint detection, instance segmentation, and semantic segmentation.

preprint2022arXiv

Improving Adversarial Robustness via Mutual Information Estimation

Deep neural networks (DNNs) are found to be vulnerable to adversarial noise. They are typically misled by adversarial samples to make wrong predictions. To alleviate this negative effect, in this paper, we investigate the dependence between outputs of the target model and input adversarial samples from the perspective of information theory, and propose an adversarial defense method. Specifically, we first measure the dependence by estimating the mutual information (MI) between outputs and the natural patterns of inputs (called natural MI) and MI between outputs and the adversarial patterns of inputs (called adversarial MI), respectively. We find that adversarial samples usually have larger adversarial MI and smaller natural MI compared with those w.r.t. natural samples. Motivated by this observation, we propose to enhance the adversarial robustness by maximizing the natural MI and minimizing the adversarial MI during the training process. In this way, the target model is expected to pay more attention to the natural pattern that contains objective semantics. Empirical evaluations demonstrate that our method could effectively improve the adversarial accuracy against multiple attacks.

preprint2022arXiv

Instance-Dependent Label-Noise Learning with Manifold-Regularized Transition Matrix Estimation

In label-noise learning, estimating the transition matrix has attracted more and more attention as the matrix plays an important role in building statistically consistent classifiers. However, it is very challenging to estimate the transition matrix T(x), where x denotes the instance, because it is unidentifiable under the instance-dependent noise(IDN). To address this problem, we have noticed that, there are psychological and physiological evidences showing that we humans are more likely to annotate instances of similar appearances to the same classes, and thus poor-quality or ambiguous instances of similar appearances are easier to be mislabeled to the correlated or same noisy classes. Therefore, we propose assumption on the geometry of T(x) that "the closer two instances are, the more similar their corresponding transition matrices should be". More specifically, we formulate above assumption into the manifold embedding, to effectively reduce the degree of freedom of T(x) and make it stably estimable in practice. The proposed manifold-regularized technique works by directly reducing the estimation error without hurting the approximation error about the estimation problem of T(x). Experimental evaluations on four synthetic and two real-world datasets demonstrate that our method is superior to state-of-the-art approaches for label-noise learning under the challenging IDN.

preprint2022arXiv

Modeling Adversarial Noise for Adversarial Training

Deep neural networks have been demonstrated to be vulnerable to adversarial noise, promoting the development of defense against adversarial attacks. Motivated by the fact that adversarial noise contains well-generalizing features and that the relationship between adversarial data and natural data can help infer natural data and make reliable predictions, in this paper, we study to model adversarial noise by learning the transition relationship between adversarial labels (i.e. the flipped labels used to generate adversarial data) and natural labels (i.e. the ground truth labels of the natural data). Specifically, we introduce an instance-dependent transition matrix to relate adversarial labels and natural labels, which can be seamlessly embedded with the target model (enabling us to model stronger adaptive adversarial noise). Empirical evaluations demonstrate that our method could effectively improve adversarial accuracy.

preprint2022arXiv

Robust Single Image Dehazing Based on Consistent and Contrast-Assisted Reconstruction

Single image dehazing as a fundamental low-level vision task, is essential for the development of robust intelligent surveillance system. In this paper, we make an early effort to consider dehazing robustness under variational haze density, which is a realistic while under-studied problem in the research filed of singe image dehazing. To properly address this problem, we propose a novel density-variational learning framework to improve the robustness of the image dehzing model assisted by a variety of negative hazy images, to better deal with various complex hazy scenarios. Specifically, the dehazing network is optimized under the consistency-regularized framework with the proposed Contrast-Assisted Reconstruction Loss (CARL). The CARL can fully exploit the negative information to facilitate the traditional positive-orient dehazing objective function, by squeezing the dehazed image to its clean target from different directions. Meanwhile, the consistency regularization keeps consistent outputs given multi-level hazy images, thus improving the model robustness. Extensive experimental results on two synthetic and three real-world datasets demonstrate that our method significantly surpasses the state-of-the-art approaches.

preprint2022arXiv

Semi-parametric Makeup Transfer via Semantic-aware Correspondence

The large discrepancy between the source non-makeup image and the reference makeup image is one of the key challenges in makeup transfer. Conventional approaches for makeup transfer either learn disentangled representation or perform pixel-wise correspondence in a parametric way between two images. We argue that non-parametric techniques have a high potential for addressing the pose, expression, and occlusion discrepancies. To this end, this paper proposes a \textbf{S}emi-\textbf{p}arametric \textbf{M}akeup \textbf{T}ransfer (SpMT) method, which combines the reciprocal strengths of non-parametric and parametric mechanisms. The non-parametric component is a novel \textbf{S}emantic-\textbf{a}ware \textbf{C}orrespondence (SaC) module that explicitly reconstructs content representation with makeup representation under the strong constraint of component semantics. The reconstructed representation is desired to preserve the spatial and identity information of the source image while "wearing" the makeup of the reference image. The output image is synthesized via a parametric decoder that draws on the reconstructed representation. Extensive experiments demonstrate the superiority of our method in terms of visual quality, robustness, and flexibility. Code and pre-trained model are available at \url{https://github.com/AnonymScholar/SpMT.

preprint2022arXiv

Spatial-Temporal Frequency Forgery Clue for Video Forgery Detection in VIS and NIR Scenario

In recent years, with the rapid development of face editing and generation, more and more fake videos are circulating on social media, which has caused extreme public concerns. Existing face forgery detection methods based on frequency domain find that the GAN forged images have obvious grid-like visual artifacts in the frequency spectrum compared to the real images. But for synthesized videos, these methods only confine to single frame and pay little attention to the most discriminative part and temporal frequency clue among different frames. To take full advantage of the rich information in video sequences, this paper performs video forgery detection on both spatial and temporal frequency domains and proposes a Discrete Cosine Transform-based Forgery Clue Augmentation Network (FCAN-DCT) to achieve a more comprehensive spatial-temporal feature representation. FCAN-DCT consists of a backbone network and two branches: Compact Feature Extraction (CFE) module and Frequency Temporal Attention (FTA) module. We conduct thorough experimental assessments on two visible light (VIS) based datasets WildDeepfake and Celeb-DF (v2), and our self-built video forgery dataset DeepfakeNIR, which is the first video forgery dataset on near-infrared modality. The experimental results demonstrate the effectiveness of our method on detecting forgery videos in both VIS and NIR scenarios.

preprint2022arXiv

Towards Semi-Supervised Deep Facial Expression Recognition with An Adaptive Confidence Margin

Only parts of unlabeled data are selected to train models for most semi-supervised learning methods, whose confidence scores are usually higher than the pre-defined threshold (i.e., the confidence margin). We argue that the recognition performance should be further improved by making full use of all unlabeled data. In this paper, we learn an Adaptive Confidence Margin (Ada-CM) to fully leverage all unlabeled data for semi-supervised deep facial expression recognition. All unlabeled samples are partitioned into two subsets by comparing their confidence scores with the adaptively learned confidence margin at each training epoch: (1) subset I including samples whose confidence scores are no lower than the margin; (2) subset II including samples whose confidence scores are lower than the margin. For samples in subset I, we constrain their predictions to match pseudo labels. Meanwhile, samples in subset II participate in the feature-level contrastive objective to learn effective facial expression features. We extensively evaluate Ada-CM on four challenging datasets, showing that our method achieves state-of-the-art performance, especially surpassing fully-supervised baselines in a semi-supervised manner. Ablation study further proves the effectiveness of our method. The source code is available at https://github.com/hangyu94/Ada-CM.

preprint2022arXiv

TransFA: Transformer-based Representation for Face Attribute Evaluation

Face attribute evaluation plays an important role in video surveillance and face analysis. Although methods based on convolution neural networks have made great progress, they inevitably only deal with one local neighborhood with convolutions at a time. Besides, existing methods mostly regard face attribute evaluation as the individual multi-label classification task, ignoring the inherent relationship between semantic attributes and face identity information. In this paper, we propose a novel \textbf{trans}former-based representation for \textbf{f}ace \textbf{a}ttribute evaluation method (\textbf{TransFA}), which could effectively enhance the attribute discriminative representation learning in the context of attention mechanism. The multiple branches transformer is employed to explore the inter-correlation between different attributes in similar semantic regions for attribute feature learning. Specially, the hierarchical identity-constraint attribute loss is designed to train the end-to-end architecture, which could further integrate face identity discriminative information to boost performance. Experimental results on multiple face attribute benchmarks demonstrate that the proposed TransFA achieves superior performances compared with state-of-the-art methods.

preprint2021arXiv

Hybrid Dynamic Contrast and Probability Distillation for Unsupervised Person Re-Id

Unsupervised person re-identification (Re-Id) has attracted increasing attention due to its practical application in the read-world video surveillance system. The traditional unsupervised Re-Id are mostly based on the method alternating between clustering and fine-tuning with the classification or metric learning objectives on the grouped clusters. However, since person Re-Id is an open-set problem, the clustering based methods often leave out lots of outlier instances or group the instances into the wrong clusters, thus they can not make full use of the training samples as a whole. To solve these problems, we present the hybrid dynamic cluster contrast and probability distillation algorithm. It formulates the unsupervised Re-Id problem into an unified local-to-global dynamic contrastive learning and self-supervised probability distillation framework. Specifically, the proposed method can make the utmost of the self-supervised signals of all the clustered and un-clustered instances, from both the instances' self-contrastive level and the probability distillation respective, in the memory-based non-parametric manner. Besides, the proposed hybrid local-to-global contrastive learning can take full advantage of the informative and valuable training examples for effective and robust training. Extensive experiment results show that the proposed method achieves superior performances to state-of-the-art methods, under both the purely unsupervised and unsupervised domain adaptation experiment settings.

preprint2020arXiv

Facial Attribute Capsules for Noise Face Super Resolution

Existing face super-resolution (SR) methods mainly assume the input image to be noise-free. Their performance degrades drastically when applied to real-world scenarios where the input image is always contaminated by noise. In this paper, we propose a Facial Attribute Capsules Network (FACN) to deal with the problem of high-scale super-resolution of noisy face image. Capsule is a group of neurons whose activity vector models different properties of the same entity. Inspired by the concept of capsule, we propose an integrated representation model of facial information, which named Facial Attribute Capsule (FAC). In the SR processing, we first generated a group of FACs from the input LR face, and then reconstructed the HR face from this group of FACs. Aiming to effectively improve the robustness of FAC to noise, we generate FAC in semantic, probabilistic and facial attributes manners by means of integrated learning strategy. Each FAC can be divided into two sub-capsules: Semantic Capsule (SC) and Probabilistic Capsule (PC). Them describe an explicit facial attribute in detail from two aspects of semantic representation and probability distribution. The group of FACs model an image as a combination of facial attribute information in the semantic space and probabilistic space by an attribute-disentangling way. The diverse FACs could better combine the face prior information to generate the face images with fine-grained semantic attributes. Extensive benchmark experiments show that our method achieves superior hallucination results and outperforms state-of-the-art for very low resolution (LR) noise face image super resolution.

preprint2020arXiv

Multi-Class Classification from Noisy-Similarity-Labeled Data

A similarity label indicates whether two instances belong to the same class while a class label shows the class of the instance. Without class labels, a multi-class classifier could be learned from similarity-labeled pairwise data by meta classification learning. However, since the similarity label is less informative than the class label, it is more likely to be noisy. Deep neural networks can easily remember noisy data, leading to overfitting in classification. In this paper, we propose a method for learning from only noisy-similarity-labeled data. Specifically, to model the noise, we employ a noise transition matrix to bridge the class-posterior probability between clean and noisy data. We further estimate the transition matrix from only noisy data and build a novel learning system to learn a classifier which can assign noise-free class labels for instances. Moreover, we theoretically justify how our proposed method generalizes for learning classifiers. Experimental results demonstrate the superiority of the proposed method over the state-of-the-art method on benchmark-simulated and real-world noisy-label datasets.

preprint2020arXiv

Multi-Margin based Decorrelation Learning for Heterogeneous Face Recognition

Heterogeneous face recognition (HFR) refers to matching face images acquired from different domains with wide applications in security scenarios. This paper presents a deep neural network approach namely Multi-Margin based Decorrelation Learning (MMDL) to extract decorrelation representations in a hyperspherical space for cross-domain face images. The proposed framework can be divided into two components: heterogeneous representation network and decorrelation representation learning. First, we employ a large scale of accessible visual face images to train heterogeneous representation network. The decorrelation layer projects the output of the first component into decorrelation latent subspace and obtains decorrelation representation. In addition, we design a multi-margin loss (MML), which consists of quadruplet margin loss (QML) and heterogeneous angular margin loss (HAML), to constrain the proposed framework. Experimental results on two challenging heterogeneous face databases show that our approach achieves superior performance on both verification and recognition tasks, comparing with state-of-the-art methods.

preprint2020arXiv

Performance Analysis of RSU-based Multihomed Multilane Vehicular Networks

Motivated by the potentially high downlink traffic demands of commuters in future autonomous vehicles, we study a network architecture where vehicles use Vehicle-to-Vehicle (V2V) links to form relay network clusters, which in turn use Vehicle-to-Infrastructure (V2I) links to connect to one or more Road Side Units (RSUs). Such cluster-based multihoming offers improved performance, e.g., in coverage and per user shared rate, but depends on the penetration of V2V-V2I capable vehicles and possible blockage, by legacy vehicles, of line of sight based V2V links, such as those based on millimeter-wave and visible light technologies. This paper provides a performance analysis of a typical vehicle's connectivity and throughput on a highway in the free-flow regime, exploring its dependence on vehicle density, sensitivity to blockages, number of lanes and heterogeneity across lanes. The results show that, even with moderate vehicle densities and penetration of V2V-V2I capable vehicles, such architectures can achieve substantial improvements in connectivity and reduction in per-user rate variability as compared to V2I based networks. The typical vehicle's performance is also shown to improve considerably in the multilane highway setting as compared to a single lane road. This paper also sheds light on how the network performance is affected when vehicles can control their relative positions, by characterizing the connectivity-throughput tradeoff faced by the clusters of vehicles.

preprint2020arXiv

Video Face Super-Resolution with Motion-Adaptive Feedback Cell

Video super-resolution (VSR) methods have recently achieved a remarkable success due to the development of deep convolutional neural networks (CNN). Current state-of-the-art CNN methods usually treat the VSR problem as a large number of separate multi-frame super-resolution tasks, at which a batch of low resolution (LR) frames is utilized to generate a single high resolution (HR) frame, and running a slide window to select LR frames over the entire video would obtain a series of HR frames. However, duo to the complex temporal dependency between frames, with the number of LR input frames increase, the performance of the reconstructed HR frames become worse. The reason is in that these methods lack the ability to model complex temporal dependencies and hard to give an accurate motion estimation and compensation for VSR process. Which makes the performance degrade drastically when the motion in frames is complex. In this paper, we propose a Motion-Adaptive Feedback Cell (MAFC), a simple but effective block, which can efficiently capture the motion compensation and feed it back to the network in an adaptive way. Our approach efficiently utilizes the information of the inter-frame motion, the dependence of the network on motion estimation and compensation method can be avoid. In addition, benefiting from the excellent nature of MAFC, the network can achieve better performance in the case of extremely complex motion scenarios. Extensive evaluations and comparisons validate the strengths of our approach, and the experimental results demonstrated that the proposed framework is outperform the state-of-the-art methods.

preprint2020arXiv

Weakly Supervised Temporal Action Localization with Segment-Level Labels

Temporal action localization presents a trade-off between test performance and annotation-time cost. Fully supervised methods achieve good performance with time-consuming boundary annotations. Weakly supervised methods with cheaper video-level category label annotations result in worse performance. In this paper, we introduce a new segment-level supervision setting: segments are labeled when annotators observe actions happening here. We incorporate this segment-level supervision along with a novel localization module in the training. Specifically, we devise a partial segment loss regarded as a loss sampling to learn integral action parts from labeled segments. Since the labeled segments are only parts of actions, the model tends to overfit along with the training process. To tackle this problem, we first obtain a similarity matrix from discriminative features guided by a sphere loss. Then, a propagation loss is devised based on the matrix to act as a regularization term, allowing implicit unlabeled segments propagation during training. Experiments validate that our method can outperform the video-level supervision methods with almost same the annotation time.