Researcher profile

Lizhuang Ma

Lizhuang Ma contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
25works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

25 published item(s)

preprint2026arXiv

From Physical Degradation Models to Task-Aware All-in-One Image Restoration

All-in-one image restoration aims to adaptively handle multiple restoration tasks with a single trained model. Although existing methods achieve promising results by introducing prompt information or leveraging large models, the added learning modules increase system complexity and hinder real-time applicability. In this paper, we adopt a physical degradation modeling perspective and predict a task-aware inverse degradation operator for efficient all-in-one image restoration. The framework consists of two stages. In the first stage, the predicted inverse operator produces an initial restored image together with an uncertainty perception map that highlights regions difficult to reconstruct, ensuring restoration reliability. In the second stage, the restoration is further refined under the guidance of this uncertainty map. The same inverse operator prediction network is used in both stages, with task-aware parameters introduced after operator prediction to adapt to different degradation tasks. Moreover, by accelerating the convolution of the inverse operator, the proposed method achieves efficient all-in-one image restoration. The resulting tightly integrated architecture, termed OPIR, is extensively validated through experiments, demonstrating superior all-in-one restoration performance while remaining highly competitive on task-aligned restoration.

preprint2026arXiv

Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation

The landscape of joint audio and video generation has been fundamentally transformed by the advent of powerful foundation models. Despite these strides, achieving cohesive multimodal customization for the simultaneous preservation of visual identities and vocal timbres across multiple interacting subjects remains largely underexplored. To bridge this gap, we present Omni-Customizer, an end-to-end framework targeted at the precise binding and seamless fusion of multimodal identity information. Specifically, we introduce an Omni-Context Fusion (OCF) module that effectively enriches the base textual prompt with dense, multimodal identity cues, along with a Masked TTS Cross-Attention (MTP-CA) mechanism explicitly designed to prevent the severe "speech leakage" problem. Within this architecture, we propose Semantic-Anchored Multimodal RoPE (SA-MRoPE) to anchor visual and audio reference tokens, along with TTS embeddings, to their corresponding semantic descriptions, enabling structured multimodal fusion and robust identity binding. Furthermore, we devise a comprehensive training strategy that incorporates interleaved audio-video scheduling to rapidly adapt the audio branch to multilingual scenarios without degrading foundational priors, and a progressive in-pair to cross-pair curriculum to facilitate the learning of high-level and robust identity features. Extensive experiments demonstrate that Omni-Customizer achieves state-of-the-art performance in dual-modal customized generation, excelling across visual identity similarity, timbre consistency, precise audio-video synchronization, and overall video-audio fidelity.

preprint2023arXiv

DCS-RISR: Dynamic Channel Splitting for Efficient Real-world Image Super-Resolution

Real-world image super-resolution (RISR) has received increased focus for improving the quality of SR images under unknown complex degradation. Existing methods rely on the heavy SR models to enhance low-resolution (LR) images of different degradation levels, which significantly restricts their practical deployments on resource-limited devices. In this paper, we propose a novel Dynamic Channel Splitting scheme for efficient Real-world Image Super-Resolution, termed DCS-RISR. Specifically, we first introduce the light degradation prediction network to regress the degradation vector to simulate the real-world degradations, upon which the channel splitting vector is generated as the input for an efficient SR model. Then, a learnable octave convolution block is proposed to adaptively decide the channel splitting scale for low- and high-frequency features at each block, reducing computation overhead and memory cost by offering the large scale to low-frequency features and the small scale to the high ones. To further improve the RISR performance, Non-local regularization is employed to supplement the knowledge of patches from LR and HR subspace with free-computation inference. Extensive experiments demonstrate the effectiveness of DCS-RISR on different benchmark datasets. Our DCS-RISR not only achieves the best trade-off between computation/parameter and PSNR/SSIM metric, and also effectively handles real-world images with different degradation levels.

preprint2022arXiv

Canonical Voting: Towards Robust Oriented Bounding Box Detection in 3D Scenes

3D object detection has attracted much attention thanks to the advances in sensors and deep learning methods for point clouds. Current state-of-the-art methods like VoteNet regress direct offset towards object centers and box orientations with an additional Multi-Layer-Perceptron network. Both their offset and orientation predictions are not accurate due to the fundamental difficulty in rotation classification. In the work, we disentangle the direct offset into Local Canonical Coordinates (LCC), box scales and box orientations. Only LCC and box scales are regressed, while box orientations are generated by a canonical voting scheme. Finally, an LCC-aware back-projection checking algorithm iteratively cuts out bounding boxes from the generated vote maps, with the elimination of false positives. Our model achieves state-of-the-art performance on three standard real-world benchmarks: ScanNet, SceneNN and SUN RGB-D. Our code is available on https://github.com/qq456cvb/CanonicalVoting.

preprint2022arXiv

DMT: Dynamic Mutual Training for Semi-Supervised Learning

Recent semi-supervised learning methods use pseudo supervision as core idea, especially self-training methods that generate pseudo labels. However, pseudo labels are unreliable. Self-training methods usually rely on single model prediction confidence to filter low-confidence pseudo labels, thus remaining high-confidence errors and wasting many low-confidence correct labels. In this paper, we point out it is difficult for a model to counter its own errors. Instead, leveraging inter-model disagreement between different models is a key to locate pseudo label errors. With this new viewpoint, we propose mutual training between two different models by a dynamically re-weighted loss function, called Dynamic Mutual Training (DMT). We quantify inter-model disagreement by comparing predictions from two different models to dynamically re-weight loss in training, where a larger disagreement indicates a possible error and corresponds to a lower loss value. Extensive experiments show that DMT achieves state-of-the-art performance in both image classification and semantic segmentation. Our codes are released at https://github.com/voldemortX/DST-CBC .

preprint2022arXiv

Domain Adaptive Semantic Segmentation via Regional Contrastive Consistency Regularization

Unsupervised domain adaptation (UDA) for semantic segmentation has been well-studied in recent years. However, most existing works largely neglect the local regional consistency across different domains and are less robust to changes in outdoor environments. In this paper, we propose a novel and fully end-to-end trainable approach, called regional contrastive consistency regularization (RCCR) for domain adaptive semantic segmentation. Our core idea is to pull the similar regional features extracted from the same location of different images, i.e., the original image and augmented image, to be closer, and meanwhile push the features from the different locations of the two images to be separated. We innovatively propose a region-wise contrastive loss with two sampling strategies to realize effective regional consistency. Besides, we present momentum projection heads, where the teacher projection head is the exponential moving average of the student. Finally, a memory bank mechanism is designed to learn more robust and stable region-wise features under varying environments. Extensive experiments on two common UDA benchmarks, i.e., GTAV to Cityscapes and SYNTHIA to Cityscapes, demonstrate that our approach outperforms the state-of-the-art methods.

preprint2022arXiv

LAKe-Net: Topology-Aware Point Cloud Completion by Localizing Aligned Keypoints

Point cloud completion aims at completing geometric and topological shapes from a partial observation. However, some topology of the original shape is missing, existing methods directly predict the location of complete points, without predicting structured and topological information of the complete shape, which leads to inferior performance. To better tackle the missing topology part, we propose LAKe-Net, a novel topology-aware point cloud completion model by localizing aligned keypoints, with a novel Keypoints-Skeleton-Shape prediction manner. Specifically, our method completes missing topology using three steps: 1) Aligned Keypoint Localization. An asymmetric keypoint locator, including an unsupervised multi-scale keypoint detector and a complete keypoint generator, is proposed for localizing aligned keypoints from complete and partial point clouds. We theoretically prove that the detector can capture aligned keypoints for objects within a sub-category. 2) Surface-skeleton Generation. A new type of skeleton, named Surface-skeleton, is generated from keypoints based on geometric priors to fully represent the topological information captured from keypoints and better recover the local details. 3) Shape Refinement. We design a refinement subnet where multi-scale surface-skeletons are fed into each recursive skeleton-assisted refinement module to assist the completion process. Experimental results show that our method achieves the state-of-the-art performance on point cloud completion.

preprint2022arXiv

MISSU: 3D Medical Image Segmentation via Self-distilling TransUNet

U-Nets have achieved tremendous success in medical image segmentation. Nevertheless, it may suffer limitations in global (long-range) contextual interactions and edge-detail preservation. In contrast, Transformer has an excellent ability to capture long-range dependencies by leveraging the self-attention mechanism into the encoder. Although Transformer was born to model the long-range dependency on the extracted feature maps, it still suffers from extreme computational and spatial complexities in processing high-resolution 3D feature maps. This motivates us to design the efficiently Transformer-based UNet model and study the feasibility of Transformer-based network architectures for medical image segmentation tasks. To this end, we propose to self-distill a Transformer-based UNet for medical image segmentation, which simultaneously learns global semantic information and local spatial-detailed features. Meanwhile, a local multi-scale fusion block is first proposed to refine fine-grained details from the skipped connections in the encoder by the main CNN stem through self-distillation, only computed during training and removed at inference with minimal overhead. Extensive experiments on BraTS 2019 and CHAOS datasets show that our MISSU achieves the best performance over previous state-of-the-art methods. Code and models are available at \url{https://github.com/wangn123/MISSU.git}

preprint2022arXiv

Night-time Scene Parsing with a Large Real Dataset

Although huge progress has been made on scene analysis in recent years, most existing works assume the input images to be in day-time with good lighting conditions. In this work, we aim to address the night-time scene parsing (NTSP) problem, which has two main challenges: 1) labeled night-time data are scarce, and 2) over- and under-exposures may co-occur in the input night-time images and are not explicitly modeled in existing pipelines. To tackle the scarcity of night-time data, we collect a novel labeled dataset, named {\it NightCity}, of 4,297 real night-time images with ground truth pixel-level semantic annotations. To our knowledge, NightCity is the largest dataset for NTSP. In addition, we also propose an exposure-aware framework to address the NTSP problem through augmenting the segmentation process with explicitly learned exposure features. Extensive experiments show that training on NightCity can significantly improve NTSP performances and that our exposure-aware model outperforms the state-of-the-art methods, yielding top performances on our dataset as well as existing datasets.

preprint2022arXiv

Variational Distillation for Multi-View Learning

Information Bottleneck (IB) based multi-view learning provides an information theoretic principle for seeking shared information contained in heterogeneous data descriptions. However, its great success is generally attributed to estimate the multivariate mutual information which is intractable when the network becomes complicated. Moreover, the representation learning tradeoff, {\it i.e.}, prediction-compression and sufficiency-consistency tradeoff, makes the IB hard to satisfy both requirements simultaneously. In this paper, we design several variational information bottlenecks to exploit two key characteristics ({\it i.e.}, sufficiency and consistency) for multi-view representation learning. Specifically, we propose a Multi-View Variational Distillation (MV$^2$D) strategy to provide a scalable, flexible and analytical solution to fitting MI by giving arbitrary input of viewpoints but without explicitly estimating it. Under rigorously theoretical guarantee, our approach enables IB to grasp the intrinsic correlation between observations and semantic labels, producing predictive and compact representations naturally. Also, our information-theoretic constraint can effectively neutralize the sensitivity to heterogeneous data by eliminating both task-irrelevant and view-specific information, preventing both tradeoffs in multiple view cases. To verify our theoretically grounded strategies, we apply our approaches to various benchmarks under three different applications. Extensive experiments to quantitatively and qualitatively demonstrate the effectiveness of our approach against state-of-the-art methods.

preprint2021arXiv

Boundary-Aware Geometric Encoding for Semantic Segmentation of Point Clouds

Boundary information plays a significant role in 2D image segmentation, while usually being ignored in 3D point cloud segmentation where ambiguous features might be generated in feature extraction, leading to misclassification in the transition area between two objects. In this paper, firstly, we propose a Boundary Prediction Module (BPM) to predict boundary points. Based on the predicted boundary, a boundary-aware Geometric Encoding Module (GEM) is designed to encode geometric information and aggregate features with discrimination in a neighborhood, so that the local features belonging to different categories will not be polluted by each other. To provide extra geometric information for boundary-aware GEM, we also propose a light-weight Geometric Convolution Operation (GCO), making the extracted features more distinguishing. Built upon the boundary-aware GEM, we build our network and test it on benchmarks like ScanNet v2, S3DIS. Results show our methods can significantly improve the baseline and achieve state-of-the-art performance. Code is available at https://github.com/JchenXu/BoundaryAwareGEM.

preprint2021arXiv

Feature Generation and Hypothesis Verification for Reliable Face Anti-Spoofing

Although existing face anti-spoofing (FAS) methods achieve high accuracy in intra-domain experiments, their effects drop severely in cross-domain scenarios because of poor generalization. Recently, multifarious techniques have been explored, such as domain generalization and representation disentanglement. However, the improvement is still limited by two issues: 1) It is difficult to perfectly map all faces to a shared feature space. If faces from unknown domains are not mapped to the known region in the shared feature space, accidentally inaccurate predictions will be obtained. 2) It is hard to completely consider various spoof traces for disentanglement. In this paper, we propose a Feature Generation and Hypothesis Verification framework to alleviate the two issues. Above all, feature generation networks which generate hypotheses of real faces and known attacks are introduced for the first time in the FAS task. Subsequently, two hypothesis verification modules are applied to judge whether the input face comes from the real-face space and the real-face distribution respectively. Furthermore, some analyses of the relationship between our framework and Bayesian uncertainty estimation are given, which provides theoretical support for reliable defense in unknown domains. Experimental results show our framework achieves promising results and outperforms the state-of-the-art approaches on extensive public datasets.

preprint2021arXiv

Monocular Human Pose and Shape Reconstruction using Part Differentiable Rendering

Superior human pose and shape reconstruction from monocular images depends on removing the ambiguities caused by occlusions and shape variance. Recent works succeed in regression-based methods which estimate parametric models directly through a deep neural network supervised by 3D ground truth. However, 3D ground truth is neither in abundance nor can efficiently be obtained. In this paper, we introduce body part segmentation as critical supervision. Part segmentation not only indicates the shape of each body part but helps to infer the occlusions among parts as well. To improve the reconstruction with part segmentation, we propose a part-level differentiable renderer that enables part-based models to be supervised by part segmentation in neural networks or optimization loops. We also introduce a general parametric model engaged in the rendering pipeline as an intermediate representation between skeletons and detailed shapes, which consists of primitive geometries for better interpretability. The proposed approach combines parameter regression, body model optimization, and detailed model registration altogether. Experimental results demonstrate that the proposed method achieves balanced evaluation on pose and shape, and outperforms the state-of-the-art approaches on Human3.6M, UP-3D and LSP datasets.

preprint2021arXiv

Weakly-Supervised Saliency Detection via Salient Object Subitizing

Salient object detection aims at detecting the most visually distinct objects and producing the corresponding masks. As the cost of pixel-level annotations is high, image tags are usually used as weak supervisions. However, an image tag can only be used to annotate one class of objects. In this paper, we introduce saliency subitizing as the weak supervision since it is class-agnostic. This allows the supervision to be aligned with the property of saliency detection, where the salient objects of an image could be from more than one class. To this end, we propose a model with two modules, Saliency Subitizing Module (SSM) and Saliency Updating Module (SUM). While SSM learns to generate the initial saliency masks using the subitizing information, without the need for any unsupervised methods or some random seeds, SUM helps iteratively refine the generated saliency masks. We conduct extensive experiments on five benchmark datasets. The experimental results show that our method outperforms other weakly-supervised methods and even performs comparably to some fully-supervised methods.

preprint2020arXiv

Acoustic anomaly detection via latent regularized gaussian mixture generative adversarial networks

Acoustic anomaly detection aims at distinguishing abnormal acoustic signals from the normal ones. It suffers from the class imbalance issue and the lacking in the abnormal instances. In addition, collecting all kinds of abnormal or unknown samples for training purpose is impractical and timeconsuming. In this paper, a novel Gaussian Mixture Generative Adversarial Network (GMGAN) is proposed under semi-supervised learning framework, in which the underlying structure of training data is not only captured in spectrogram reconstruction space, but also can be further restricted in the space of latent representation in a discriminant manner. Experiments show that our model has clear superiority over previous methods, and achieves the state-of-the-art results on DCASE dataset.

preprint2020arXiv

Anomaly Detection by One Class Latent Regularized Networks

Anomaly detection is a fundamental problem in computer vision area with many real-world applications. Given a wide range of images belonging to the normal class, emerging from some distribution, the objective of this task is to construct the model to detect out-of-distribution images belonging to abnormal instances. Semi-supervised Generative Adversarial Networks (GAN)-based methods have been gaining popularity in anomaly detection task recently. However, the training process of GAN is still unstable and challenging. To solve these issues, a novel adversarial dual autoencoder network is proposed, in which the underlying structure of training data is not only captured in latent feature space, but also can be further restricted in the space of latent representation in a discriminant manner, leading to a more accurate detector. In addition, the auxiliary autoencoder regarded as a discriminator could obtain an more stable training process. Experiments show that our model achieves the state-of-the-art results on MNIST and CIFAR10 datasets as well as GTSRB stop signs dataset.

preprint2020arXiv

Brain Tumor Anomaly Detection via Latent Regularized Adversarial Network

With the development of medical imaging technology, medical images have become an important basis for doctors to diagnose patients. The brain structure in the collected data is complicated, thence, doctors are required to spend plentiful energy when diagnosing brain abnormalities. Aiming at the imbalance of brain tumor data and the rare amount of labeled data, we propose an innovative brain tumor abnormality detection algorithm. The semi-supervised anomaly detection model is proposed in which only healthy (normal) brain images are trained. Model capture the common pattern of the normal images in the training process and detect anomalies based on the reconstruction error of latent space. Furthermore, the method first uses singular value to constrain the latent space and jointly optimizes the image space through multiple loss functions, which make normal samples and abnormal samples more separable in the feature-level. This paper utilizes BraTS, HCP, MNIST, and CIFAR-10 datasets to comprehensively evaluate the effectiveness and practicability. Extensive experiments on intra- and cross-dataset tests prove that our semi-supervised method achieves outperforms or comparable results to state-of-the-art supervised techniques.

preprint2020arXiv

Face Anti-Spoofing Via Disentangled Representation Learning

Face anti-spoofing is crucial to security of face recognition systems. Previous approaches focus on developing discriminative models based on the features extracted from images, which may be still entangled between spoof patterns and real persons. In this paper, motivated by the disentangled representation learning, we propose a novel perspective of face anti-spoofing that disentangles the liveness features and content features from images, and the liveness features is further used for classification. We also put forward a Convolutional Neural Network (CNN) architecture with the process of disentanglement and combination of low-level and high-level supervision to improve the generalization capabilities. We evaluate our method on public benchmark datasets and extensive experimental results demonstrate the effectiveness of our method against the state-of-the-art competitors. Finally, we further visualize some results to help understand the effect and advantage of disentanglement.

preprint2020arXiv

Fine-Grained Expression Manipulation via Structured Latent Space

Fine-grained facial expression manipulation is a challenging problem, as fine-grained expression details are difficult to be captured. Most existing expression manipulation methods resort to discrete expression labels, which mainly edit global expressions and ignore the manipulation of fine details. To tackle this limitation, we propose an end-to-end expression-guided generative adversarial network (EGGAN), which utilizes structured latent codes and continuous expression labels as input to generate images with expected expressions. Specifically, we adopt an adversarial autoencoder to map a source image into a structured latent space. Then, given the source latent code and the target expression label, we employ a conditional GAN to generate a new image with the target expression. Moreover, we introduce a perceptual loss and a multi-scale structural similarity loss to preserve identity and global shape during generation. Extensive experiments show that our method can manipulate fine-grained expressions, and generate continuous intermediate expressions between source and target expressions.

preprint2020arXiv

KeypointNet: A Large-scale 3D Keypoint Dataset Aggregated from Numerous Human Annotations

Detecting 3D objects keypoints is of great interest to the areas of both graphics and computer vision. There have been several 2D and 3D keypoint datasets aiming to address this problem in a data-driven way. These datasets, however, either lack scalability or bring ambiguity to the definition of keypoints. Therefore, we present KeypointNet: the first large-scale and diverse 3D keypoint dataset that contains 103,450 keypoints and 8,234 3D models from 16 object categories, by leveraging numerous human annotations. To handle the inconsistency between annotations from different people, we propose a novel method to aggregate these keypoints automatically, through minimization of a fidelity loss. Finally, ten state-of-the-art methods are benchmarked on our proposed dataset. Our code and data are available on https://github.com/qq456cvb/KeypointNet.

preprint2020arXiv

Novelty Detection via Non-Adversarial Generative Network

One-class novelty detection is the process of determining if a query example differs from the training examples (the target class). Most of previous strategies attempt to learn the real characteristics of target sample by using generative adversarial networks (GANs) methods. However, the training process of GANs remains challenging, suffering from instability issues such as mode collapse and vanishing gradients. In this paper, by adopting non-adversarial generative networks, a novel decoder-encoder framework is proposed for novelty detection task, insteading of classical encoder-decoder style. Under the non-adversarial framework, both latent space and image reconstruction space are jointly optimized, leading to a more stable training process with super fast convergence and lower training losses. During inference, inspired by cycleGAN, we design a new testing scheme to conduct image reconstruction, which is the reverse way of training sequence. Experiments show that our model has the clear superiority over cutting-edge novelty detectors and achieves the state-of-the-art results on the datasets.

preprint2020arXiv

NTIRE 2020 Challenge on NonHomogeneous Dehazing

This paper reviews the NTIRE 2020 Challenge on NonHomogeneous Dehazing of images (restoration of rich details in hazy image). We focus on the proposed solutions and their results evaluated on NH-Haze, a novel dataset consisting of 55 pairs of real haze free and nonhomogeneous hazy images recorded outdoor. NH-Haze is the first realistic nonhomogeneous haze dataset that provides ground truth images. The nonhomogeneous haze has been produced using a professional haze generator that imitates the real conditions of haze scenes. 168 participants registered in the challenge and 27 teams competed in the final testing phase. The proposed solutions gauge the state-of-the-art in image dehazing.

preprint2020arXiv

SceneEncoder: Scene-Aware Semantic Segmentation of Point Clouds with A Learnable Scene Descriptor

Besides local features, global information plays an essential role in semantic segmentation, while recent works usually fail to explicitly extract the meaningful global information and make full use of it. In this paper, we propose a SceneEncoder module to impose a scene-aware guidance to enhance the effect of global information. The module predicts a scene descriptor, which learns to represent the categories of objects existing in the scene and directly guides the point-level semantic segmentation through filtering out categories not belonging to this scene. Additionally, to alleviate segmentation noise in local region, we design a region similarity loss to propagate distinguishing features to their own neighboring points with the same label, leading to the enhancement of the distinguishing ability of point-wise features. We integrate our methods into several prevailing networks and conduct extensive experiments on benchmark datasets ScanNet and ShapeNet. Results show that our methods greatly improve the performance of baselines and achieve state-of-the-art performance.

preprint2020arXiv

SiTGRU: Single-Tunnelled Gated Recurrent Unit for Abnormality Detection

Abnormality detection is a challenging task due to the dependence on a specific context and the unconstrained variability of practical scenarios. In recent years, it has benefited from the powerful features learnt by deep neural networks, and handcrafted features specialized for abnormality detectors. However, these approaches with large complexity still have limitations in handling long term sequential data (e.g., videos), and their learnt features do not thoroughly capture useful information. Recurrent Neural Networks (RNNs) have been shown to be capable of robustly dealing with temporal data in long term sequences. In this paper, we propose a novel version of Gated Recurrent Unit (GRU), called Single Tunnelled GRU for abnormality detection. Particularly, the Single Tunnelled GRU discards the heavy weighted reset gate from GRU cells that overlooks the importance of past content by only favouring current input to obtain an optimized single gated cell model. Moreover, we substitute the hyperbolic tangent activation in standard GRUs with sigmoid activation, as the former suffers from performance loss in deeper networks. Empirical results show that our proposed optimized GRU model outperforms standard GRU and Long Short Term Memory (LSTM) networks on most metrics for detection and generalization tasks on CUHK Avenue and UCSD datasets. The model is also computationally efficient with reduced training and testing time over standard RNNs.

preprint2020arXiv

Spoof Face Detection Via Semi-Supervised Adversarial Training

Face spoofing causes severe security threats in face recognition systems. Previous anti-spoofing works focused on supervised techniques, typically with either binary or auxiliary supervision. Most of them suffer from limited robustness and generalization, especially in the cross-dataset setting. In this paper, we propose a semi-supervised adversarial learning framework for spoof face detection, which largely relaxes the supervision condition. To capture the underlying structure of live faces data in latent representation space, we propose to train the live face data only, with a convolutional Encoder-Decoder network acting as a Generator. Meanwhile, we add a second convolutional network serving as a Discriminator. The generator and discriminator are trained by competing with each other while collaborating to understand the underlying concept in the normal class(live faces). Since the spoof face detection is video based (i.e., temporal information), we intuitively take the optical flow maps converted from consecutive video frames as input. Our approach is free of the spoof faces, thus being robust and general to different types of spoof, even unknown spoof. Extensive experiments on intra- and cross-dataset tests show that our semi-supervised method achieves better or comparable results to state-of-the-art supervised techniques.