Researcher profile

Chunlei Zhang

Chunlei Zhang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
10works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

10 published item(s)

preprint2022arXiv

LAE: Language-Aware Encoder for Monolingual and Multilingual ASR

Despite the rapid progress in automatic speech recognition (ASR) research, recognizing multilingual speech using a unified ASR system remains highly challenging. Previous works on multilingual speech recognition mainly focus on two directions: recognizing multiple monolingual speech or recognizing code-switched speech that uses different languages interchangeably within a single utterance. However, a pragmatic multilingual recognizer is expected to be compatible with both directions. In this work, a novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information and generating frame-level language-aware representations during encoding. In the LAE, the primary encoding is implemented by the shared block while the language-specific blocks are used to extract specific representations for each language. To learn language-specific information discriminatively, a language-aware training method is proposed to optimize the language-specific blocks in LAE. Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level and shows superior performance on both monolingual and multilingual ASR tasks. With either a real-recorded or simulated code-switched dataset, the proposed LAE achieves statistically significant improvements on both CTC and neural transducer systems. Code is released

preprint2022arXiv

NeuralEcho: A Self-Attentive Recurrent Neural Network For Unified Acoustic Echo Suppression And Speech Enhancement

Acoustic echo cancellation (AEC) plays an important role in the full-duplex speech communication as well as the front-end speech enhancement for recognition in the conditions when the loudspeaker plays back. In this paper, we present an all-deep-learning framework that implicitly estimates the second order statistics of echo/noise and target speech, and jointly solves echo and noise suppression through an attention based recurrent neural network. The proposed model outperforms the state-of-the-art joint echo cancellation and speech enhancement method F-T-LSTM in terms of objective speech quality metrics, speech recognition accuracy and model complexity. We show that this model can work with speaker embedding for better target speech enhancement and furthermore develop a branch for automatic gain control (AGC) task to form an all-in-one front-end speech enhancement system.

preprint2022arXiv

Robust Disentangled Variational Speech Representation Learning for Zero-shot Voice Conversion

Traditional studies on voice conversion (VC) have made progress with parallel training data and known speakers. Good voice conversion quality is obtained by exploring better alignment modules or expressive mapping functions. In this study, we investigate zero-shot VC from a novel perspective of self-supervised disentangled speech representation learning. Specifically, we achieve the disentanglement by balancing the information flow between global speaker representation and time-varying content representation in a sequential variational autoencoder (VAE). A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to the VAE decoder. Besides that, an on-the-fly data augmentation training strategy is applied to make the learned representation noise invariant. On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e., voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.

preprint2022arXiv

Towards Improved Zero-shot Voice Conversion with Conditional DSVAE

Disentangling content and speaking style information is essential for zero-shot non-parallel voice conversion (VC). Our previous study investigated a novel framework with disentangled sequential variational autoencoder (DSVAE) as the backbone for information decomposition. We have demonstrated that simultaneous disentangling content embedding and speaker embedding from one utterance is feasible for zero-shot VC. In this study, we continue the direction by raising one concern about the prior distribution of content branch in the DSVAE baseline. We find the random initialized prior distribution will force the content embedding to reduce the phonetic-structure information during the learning process, which is not a desired property. Here, we seek to achieve a better content embedding with more phonetic information preserved. We propose conditional DSVAE, a new model that enables content bias as a condition to the prior modeling and reshapes the content embedding sampled from the posterior distribution. In our experiment on the VCTK dataset, we demonstrate that content embeddings derived from the conditional DSVAE overcome the randomness and achieve a much better phoneme classification accuracy, a stabilized vocalization and a better zero-shot VC performance compared with the competitive DSVAE baseline.

preprint2021arXiv

Improving RNN Transducer With Target Speaker Extraction and Neural Uncertainty Estimation

Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers. This work presents a joint framework that combines time-domain target-speaker speech extraction and Recurrent Neural Network Transducer (RNN-T). To stabilize the joint-training, we propose a multi-stage training strategy that pre-trains and fine-tunes each module in the system before joint-training. Meanwhile, speaker identity and speech enhancement uncertainty measures are proposed to compensate for residual noise and artifacts from the target speech extraction module. Compared to a recognizer fine-tuned with a target speech extraction model, our experiments show that adding the neural uncertainty module significantly reduces 17% relative Character Error Rate (CER) on multi-speaker signals with background noise. The multi-condition experiments indicate that our method can achieve 9% relative performance gain in the noisy condition while maintaining the performance in the clean condition.

preprint2021arXiv

Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning

In this study, we investigate self-supervised representation learning for speaker verification (SV). First, we examine a simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples. We show that better speaker embeddings can be learned by momentum contrastive learning. Next, alternative augmentation strategies are explored to normalize extrinsic speaker variabilities of two random segments from the same speech utterance. Specifically, augmentation in the waveform largely improves the speaker representations for SV tasks. The proposed MoCo speaker embedding is further improved when a prototypical memory bank is introduced, which encourages the speaker embeddings to be closer to their assigned prototypes with an intermediate clustering step. In addition, we generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled. Comprehensive experiments on the Voxceleb dataset demonstrate that our proposed self-supervised approach achieves competitive performance compared with existing techniques, and can approach fully supervised results with partially labeled data.

preprint2020arXiv

DurIAN-SC: Duration Informed Attention Network based Singing Voice Conversion System

Singing voice conversion is converting the timbre in the source singing to the target speaker's voice while keeping singing content the same. However, singing data for target speaker is much more difficult to collect compared with normal speech data.In this paper, we introduce a singing voice conversion algorithm that is capable of generating high quality target speaker's singing using only his/her normal speech data. First, we manage to integrate the training and conversion process of speech and singing into one framework by unifying the features used in standard speech synthesis system and singing synthesis system. In this way, normal speech data can also contribute to singing voice conversion training, making the singing voice conversion system more robust especially when the singing database is small.Moreover, in order to achieve one-shot singing voice conversion, a speaker embedding module is developed using both speech and singing data, which provides target speaker identify information during conversion. Experiments indicate proposed sing conversion system can convert source singing to target speaker's high-quality singing with only 20 seconds of target speaker's enrollment speech data.

preprint2019arXiv

Overview to the Hard X-ray Modulation Telescope (Insight-HXMT) Satellite

As China's first X-ray astronomical satellite, the Hard X-ray Modulation Telescope (HXMT), which was dubbed as Insight-HXMT after the launch on June 15, 2017, is a wide-band (1-250 keV) slat-collimator-based X-ray astronomy satellite with the capability of all-sky monitoring in 0.2-3 MeV. It was designed to perform pointing, scanning and gamma-ray burst (GRB) observations and, based on the Direct Demodulation Method (DDM), the image of the scanned sky region can be reconstructed. Here we give an overview of the mission and its progresses, including payload, core sciences, ground calibration/facility, ground segment, data archive, software, in-orbit performance, calibration, background model, observations and some preliminary results.

preprint2019arXiv

Spin current pumped by resonant skyrmion

Spin pumping is a widely recognized method to generate the spin current in the spintronics, which is acknowledged as a fundamentally dynamic process equivalent to the spin-transfer torque. In this work, we theoretically verify that the oscillating spin current can be pumped from the microwave-motivated breathing skyrmion. The skyrmion spin pumping can be excited by a relatively low frequency compared with the ferromagnetic resonance (FMR) and the current density is larger than the ordinary FMR spin pumping. Based on the skyrmion spin pumping, we build a high reading-speed racetrack memory model whose reading speed is an order of magnitude higher than the SOT (spin-orbit torque) /STT (spin-transfer torque) skyrmion racetrack. Our work explored the spin pumping phenomenon in the skyrmion, and it may contribute to the applications of the skyrmion-based device.

preprint2019arXiv

The Medium Energy (ME) X-ray telescope onboard the Insight-HXMT astronomy satellite

The Medium Energy X-ray telescope (ME) is one of the three main telescopes on board the Insight Hard X-ray Modulation Telescope (Insight-HXMT) astronomy satellite. ME contains 1728 pixels of Si-PIN detectors sensitive in 5-30 keV with a total geometrical area of 952 cm2. Application Specific Integrated Circuit (ASIC) chips, VA32TA6, is used to achieve low power consumption and low readout noise. The collimators define three kinds of field of views (FOVs) for the telescope, 1°{\times}4°, 4°{\times}4°, and blocked ones. Combination of such FOVs can be used to estimate the in-orbit X-ray and particle background components. The energy resolution of ME is ~3 keV at 17.8 keV (FWHM) and the time resolution is 255 μs. In this paper, we introduce the design and performance of ME.