Researcher profile

Cunhang Fan

Cunhang Fan contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2022arXiv

Fully Automated End-to-End Fake Audio Detection

The existing fake audio detection systems often rely on expert experience to design the acoustic features or manually design the hyperparameters of the network structure. However, artificial adjustment of the parameters can have a relatively obvious influence on the results. It is almost impossible to manually set the best set of parameters. Therefore this paper proposes a fully automated end-toend fake audio detection method. We first use wav2vec pre-trained model to obtain a high-level representation of the speech. Furthermore, for the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS. It learns deep speech representations while automatically learning and optimizing complex neural structures consisting of convolutional operations and residual blocks. The experimental results on the ASVspoof 2019 LA dataset show that our proposed system achieves an equal error rate (EER) of 1.08%, which outperforms the state-of-the-art single system.

preprint2020arXiv

A Recursive Network with Dynamic Attention for Monaural Speech Enhancement

A person tends to generate dynamic attention towards speech under complicated environments. Based on this phenomenon, we propose a framework combining dynamic attention and recursive learning together for monaural speech enhancement. Apart from a major noise reduction network, we design a separated sub-network, which adaptively generates the attention distribution to control the information flow throughout the major network. To effectively decrease the number of trainable parameters, recursive learning is introduced, which means that the network is reused for multiple stages, where the intermediate output in each stage is correlated with a memory mechanism. As a result, a more flexible and better estimation can be obtained. We conduct experiments on TIMIT corpus. Experimental results show that the proposed architecture obtains consistently better performance than recent state-of-the-art models in terms of both PESQ and STOI scores.

preprint2020arXiv

Adversarial Transfer Learning for Punctuation Restoration

Previous studies demonstrate that word embeddings and part-of-speech (POS) tags are helpful for punctuation restoration tasks. However, two drawbacks still exist. One is that word embeddings are pre-trained by unidirectional language modeling objectives. Thus the word embeddings only contain left-to-right context information. The other is that POS tags are provided by an external POS tagger. So computation cost will be increased and incorrect predicted tags may affect the performance of restoring punctuation marks during decoding. This paper proposes adversarial transfer learning to address these problems. A pre-trained bidirectional encoder representations from transformers (BERT) model is used to initialize a punctuation model. Thus the transferred model parameters carry both left-to-right and right-to-left representations. Furthermore, adversarial multi-task learning is introduced to learn task invariant knowledge for punctuation prediction. We use an extra POS tagging task to help the training of the punctuation predicting task. Adversarial training is utilized to prevent the shared parameters from containing task specific information. We only use the punctuation predicting task to restore marks during decoding stage. Therefore, it will not need extra computation and not introduce incorrect tags from the POS tagger. Experiments are conducted on IWSLT2011 datasets. The results demonstrate that the punctuation predicting models obtain further performance improvement with task invariant knowledge from the POS tagging task. Our best model outperforms the previous state-of-the-art model trained only with lexical features by up to 9.2% absolute overall F_1-score on test set.

preprint2020arXiv

Deep Attention Fusion Feature for Speech Separation with End-to-End Post-filter Method

In this paper, we propose an end-to-end post-filter method with deep attention fusion features for monaural speaker-independent speech separation. At first, a time-frequency domain speech separation method is applied as the pre-separation stage. The aim of pre-separation stage is to separate the mixture preliminarily. Although this stage can separate the mixture, it still contains the residual interference. In order to enhance the pre-separated speech and improve the separation performance further, the end-to-end post-filter (E2EPF) with deep attention fusion features is proposed. The E2EPF can make full use of the prior knowledge of the pre-separated speech, which contributes to speech separation. It is a fully convolutional speech separation network and uses the waveform as the input features. Firstly, the 1-D convolutional layer is utilized to extract the deep representation features for the mixture and pre-separated signals in the time domain. Secondly, to pay more attention to the outputs of the pre-separation stage, an attention module is applied to acquire deep attention fusion features, which are extracted by computing the similarity between the mixture and the pre-separated speech. These deep attention fusion features are conducive to reduce the interference and enhance the pre-separated speech. Finally, these features are sent to the post-filter to estimate each target signals. Experimental results on the WSJ0-2mix dataset show that the proposed method outperforms the state-of-the-art speech separation method. Compared with the pre-separation method, our proposed method can acquire 64.1%, 60.2%, 25.6% and 7.5% relative improvements in scale-invariant source-to-noise ratio (SI-SNR), the signal-to-distortion ratio (SDR), the perceptual evaluation of speech quality (PESQ) and the short-time objective intelligibility (STOI) measures, respectively.

preprint2020arXiv

Dynamic Attention Based Generative Adversarial Network with Phase Post-Processing for Speech Enhancement

The generative adversarial networks (GANs) have facilitated the development of speech enhancement recently. Nevertheless, the performance advantage is still limited when compared with state-of-the-art models. In this paper, we propose a powerful Dynamic Attention Recursive GAN called DARGAN for noise reduction in the time-frequency domain. Different from previous works, we have several innovations. First, recursive learning, an iterative training protocol, is used in the generator, which consists of multiple steps. By reusing the network in each step, the noise components are progressively reduced in a step-wise manner. Second, the dynamic attention mechanism is deployed, which helps to re-adjust the feature distribution in the noise reduction module. Third, we exploit the deep Griffin-Lim algorithm as the module for phase postprocessing, which facilitates further improvement in speech quality. Experimental results on Voice Bank corpus show that the proposed GAN achieves state-of-the-art performance than previous GAN- and non-GAN-based models

preprint2020arXiv

Simultaneous Denoising and Dereverberation Using Deep Embedding Features

Monaural speech dereverberation is a very challenging task because no spatial cues can be used. When the additive noises exist, this task becomes more challenging. In this paper, we propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features, which is based on the deep clustering (DC). DC is a state-of-the-art method for speech separation that includes embedding learning and K-means clustering. As for our proposed method, it contains two stages: denoising and dereverberation. At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features. These embedding features are generated from the anechoic speech and residual reverberation signals. They can represent the inferred spectral masking patterns of the desired signals, which are discriminative features. At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another supervised neural network is utilized to estimate the anechoic speech from these deep embedding features. Finally, the denoising stage and dereverberation stage are optimized by the joint training method. Experimental results show that the proposed method outperforms the WPE and BLSTM baselines, especially in the low SNR condition.

preprint2020arXiv

Spatial and spectral deep attention fusion for multi-channel speech separation using deep embedding features

Multi-channel deep clustering (MDC) has acquired a good performance for speech separation. However, MDC only applies the spatial features as the additional information. So it is difficult to learn mutual relationship between spatial and spectral features. Besides, the training objective of MDC is defined at embedding vectors, rather than real separated sources, which may damage the separation performance. In this work, we propose a deep attention fusion method to dynamically control the weights of the spectral and spatial features and combine them deeply. In addition, to solve the training objective problem of MDC, the real separated sources are used as the training objectives. Specifically, we apply the deep clustering network to extract deep embedding features. Instead of using the unsupervised K-means clustering to estimate binary masks, another supervised network is utilized to learn soft masks from these deep embedding features. Our experiments are conducted on a spatialized reverberant version of WSJ0-2mix dataset. Experimental results show that the proposed method outperforms MDC baseline and even better than the oracle ideal binary mask (IBM).