Researcher profile

Meng Yu

Meng Yu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
13works
0followers
6topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

13 published item(s)

preprint2025arXiv

AzeroS: Extending LLM to Speech with Self-Generated Instruction-Free Tuning

Extending large language models (LLMs) to the speech domain has recently gained significant attention. A typical approach connects a pretrained LLM with an audio encoder through a projection module and trains the resulting model on large-scale, task-specific instruction-tuning datasets. However, curating such instruction-tuning data for specific requirements is time-consuming, and models trained in this manner often generalize poorly to unseen tasks. In this work, we first formulate that the strongest generalization of a speech-LLM is achieved when it is trained with Self-Generated Instruction-Free Tuning (SIFT), in which supervision signals are generated by a frozen LLM using textual representations of speech as input. Our proposed SIFT paradigm eliminates the need for collecting task-specific question-answer pairs and yields the theoretically best generalization to unseen tasks. Building upon this paradigm, we introduce AZeroS (Auden Zero-instruction-tuned Speech-LLM), which is trained on speech-text pairs derived from publicly available corpora, including approximately 25,000 hours of speech with ASR transcripts and 3,000 hours of speech with paralinguistic labels. Built upon Qwen2.5-7B-Instruct, the model updates only two lightweight projection modules (23.8 million parameters each), while keeping both the LLM and audio encoders frozen. Despite the minimal training cost and modest data scale, AZeroS achieves state-of-the-art performance on both semantic and paralinguistic benchmarks, including VoiceBench, AIR-Bench Foundation (Speech), and AIR-Bench Chat (Speech).

preprint2022arXiv

Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE

Variational auto-encoder (VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings, then generate an utterance for a target speaker from that of a source speaker. This is possible by concatenating the identity embedding of the target speaker and the content embedding of the source speaker uttering a desired sentence. In this work, we propose to improve VAE models with self-attention and structural regularization (RGSM). Specifically, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker's identity. We applied relaxed group-wise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance. In experiments of zero-shot many-to-many voice conversion task on VCTK data set, with the self-attention layer and relaxed group-wise splitting method, our model achieves a gain of speaker classification accuracy on unseen speakers by 28.3\% while slightly improved conversion voice quality in terms of MOSNet scores. Our encouraging findings point to future research on integrating more variety of attention structures in VAE framework while controlling model size and overfitting for advancing zero-shot many-to-many voice conversions.

preprint2022arXiv

FAST-RIR: Fast neural diffuse room impulse response generator

We present a neural-network-based fast diffuse room impulse response generator (FAST-RIR) for generating room impulse responses (RIRs) for a given acoustic environment. Our FAST-RIR takes rectangular room dimensions, listener and speaker positions, and reverberation time as inputs and generates specular and diffuse reflections for a given acoustic environment. Our FAST-RIR is capable of generating RIRs for a given input reverberation time with an average error of 0.02s. We evaluate our generated RIRs in automatic speech recognition (ASR) applications using Google Speech API, Microsoft Speech API, and Kaldi tools. We show that our proposed FAST-RIR with batch size 1 is 400 times faster than a state-of-the-art diffuse acoustic simulator (DAS) on a CPU and gives similar performance to DAS in ASR experiments. Our FAST-RIR is 12 times faster than an existing GPU-based RIR generator (gpuRIR). We show that our FAST-RIR outperforms gpuRIR by 2.5% in an AMI far-field ASR benchmark.

preprint2022arXiv

Joint Neural AEC and Beamforming with Double-Talk Detection

Acoustic echo cancellation (AEC) in full-duplex communication systems eliminates acoustic feedback. However, nonlinear distortions induced by audio devices, background noise, reverberation, and double-talk reduce the efficiency of conventional AEC systems. Several hybrid AEC models were proposed to address this, which use deep learning models to suppress residual echo from standard adaptive filtering. This paper proposes deep learning-based joint AEC and beamforming model (JAECBF) building on our previous self-attentive recurrent neural network (RNN) beamformer. The proposed network consists of two modules: (i) multi-channel neural-AEC, and (ii) joint AEC-RNN beamformer with a double-talk detection (DTD) that computes time-frequency (T-F) beamforming weights. We train the proposed model in an end-to-end approach to eliminate background noise and echoes from far-end audio devices, which include nonlinear distortions. From experimental evaluations, we find the proposed network outperforms other multi-channel AEC and denoising systems in terms of speech recognition rate and overall speech quality.

preprint2022arXiv

NeuralEcho: A Self-Attentive Recurrent Neural Network For Unified Acoustic Echo Suppression And Speech Enhancement

Acoustic echo cancellation (AEC) plays an important role in the full-duplex speech communication as well as the front-end speech enhancement for recognition in the conditions when the loudspeaker plays back. In this paper, we present an all-deep-learning framework that implicitly estimates the second order statistics of echo/noise and target speech, and jointly solves echo and noise suppression through an attention based recurrent neural network. The proposed model outperforms the state-of-the-art joint echo cancellation and speech enhancement method F-T-LSTM in terms of objective speech quality metrics, speech recognition accuracy and model complexity. We show that this model can work with speaker embedding for better target speech enhancement and furthermore develop a branch for automatic gain control (AGC) task to form an all-in-one front-end speech enhancement system.

preprint2021arXiv

ADL-MVDR: All deep learning MVDR beamformer for target speech separation

Speech separation algorithms are often used to separate the target speech from other interfering sources. However, purely neural network based speech separation systems often cause nonlinear distortion that is harmful for automatic speech recognition (ASR) systems. The conventional mask-based minimum variance distortionless response (MVDR) beamformer can be used to minimize the distortion, but comes with high level of residual noise. Furthermore, the matrix operations (e.g., matrix inversion) involved in the conventional MVDR solution are sometimes numerically unstable when jointly trained with neural networks. In this paper, we propose a novel all deep learning MVDR framework, where the matrix inversion and eigenvalue decomposition are replaced by two recurrent neural networks (RNNs), to resolve both issues at the same time. The proposed method can greatly reduce the residual noise while keeping the target speech undistorted by leveraging on the RNN-predicted frame-wise beamforming weights. The system is evaluated on a Mandarin audio-visual corpus and compared against several state-of-the-art (SOTA) speech separation systems. Experimental results demonstrate the superiority of the proposed method across several objective metrics and ASR accuracy.

preprint2021arXiv

Improving RNN Transducer With Target Speaker Extraction and Neural Uncertainty Estimation

Target-speaker speech recognition aims to recognize target-speaker speech from noisy environments with background noise and interfering speakers. This work presents a joint framework that combines time-domain target-speaker speech extraction and Recurrent Neural Network Transducer (RNN-T). To stabilize the joint-training, we propose a multi-stage training strategy that pre-trains and fine-tunes each module in the system before joint-training. Meanwhile, speaker identity and speech enhancement uncertainty measures are proposed to compensate for residual noise and artifacts from the target speech extraction module. Compared to a recognizer fine-tuned with a target speech extraction model, our experiments show that adding the neural uncertainty module significantly reduces 17% relative Character Error Rate (CER) on multi-speaker signals with background noise. The multi-condition experiments indicate that our method can achieve 9% relative performance gain in the noisy condition while maintaining the performance in the clean condition.

preprint2021arXiv

Self-supervised Text-independent Speaker Verification using Prototypical Momentum Contrastive Learning

In this study, we investigate self-supervised representation learning for speaker verification (SV). First, we examine a simple contrastive learning approach (SimCLR) with a momentum contrastive (MoCo) learning framework, where the MoCo speaker embedding system utilizes a queue to maintain a large set of negative examples. We show that better speaker embeddings can be learned by momentum contrastive learning. Next, alternative augmentation strategies are explored to normalize extrinsic speaker variabilities of two random segments from the same speech utterance. Specifically, augmentation in the waveform largely improves the speaker representations for SV tasks. The proposed MoCo speaker embedding is further improved when a prototypical memory bank is introduced, which encourages the speaker embeddings to be closer to their assigned prototypes with an intermediate clustering step. In addition, we generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled. Comprehensive experiments on the Voxceleb dataset demonstrate that our proposed self-supervised approach achieves competitive performance compared with existing techniques, and can approach fully supervised results with partially labeled data.

preprint2020arXiv

Audio-Visual Speech Separation and Dereverberation with a Two-Stage Multimodal Network

Background noise, interfering speech and room reverberation frequently distort target speech in real listening environments. In this study, we address joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation. In order to tackle this fundamentally difficult problem, we propose a novel multimodal network that exploits both audio and visual signals. The proposed network architecture adopts a two-stage strategy, where a separation module is employed to attenuate background noise and interfering speech in the first stage and a dereverberation module to suppress room reverberation in the second stage. The two modules are first trained separately, and then integrated for joint training, which is based on a new multi-objective loss function. Our experimental results show that the proposed multimodal network yields consistently better objective intelligibility and perceptual quality than several one-stage and two-stage baselines. We find that our network achieves a 21.10% improvement in ESTOI and a 0.79 improvement in PESQ over the unprocessed mixtures. Moreover, our network architecture does not require the knowledge of the number of speakers.

preprint2020arXiv

Distortionless Multi-Channel Target Speech Enhancement for Overlapped Speech Recognition

Speech enhancement techniques based on deep learning have brought significant improvement on speech quality and intelligibility. Nevertheless, a large gain in speech quality measured by objective metrics, such as perceptual evaluation of speech quality (PESQ), does not necessarily lead to improved speech recognition performance due to speech distortion in the enhancement stage. In this paper, a multi-channel dilated convolutional network based frequency domain modeling is presented to enhance target speaker in the far-field, noisy and multi-talker conditions. We study three approaches towards distortionless waveforms for overlapped speech recognition: estimating complex ideal ratio mask with an infinite range, incorporating the fbank loss in a multi-objective learning and finetuning the enhancement model by an acoustic model. Experimental results proved the effectiveness of all three approaches on reducing speech distortions and improving recognition accuracy. Particularly, the jointly tuned enhancement model works very well with other standalone acoustic model on real test data.

preprint2020arXiv

End-to-End Multi-Look Keyword Spotting

The performance of keyword spotting (KWS), measured in false alarms and false rejects, degrades significantly under the far field and noisy conditions. In this paper, we propose a multi-look neural network modeling for speech enhancement which simultaneously steers to listen to multiple sampled look directions. The multi-look enhancement is then jointly trained with KWS to form an end-to-end KWS model which integrates the enhanced signals from multiple look directions and leverages an attention mechanism to dynamically tune the model's attention to the reliable sources. We demonstrate, on our large noisy and far-field evaluation sets, that the proposed approach significantly improves the KWS performance against the baseline KWS system and a recent beamformer based multi-beam KWS system.

preprint2020arXiv

Enhancing End-to-End Multi-channel Speech Separation via Spatial Feature Learning

Hand-crafted spatial features (e.g., inter-channel phase difference, IPD) play a fundamental role in recent deep learning based multi-channel speech separation (MCSS) methods. However, these manually designed spatial features are hard to incorporate into the end-to-end optimized MCSS framework. In this work, we propose an integrated architecture for learning spatial features directly from the multi-channel speech waveforms within an end-to-end speech separation framework. In this architecture, time-domain filters spanning signal channels are trained to perform adaptive spatial filtering. These filters are implemented by a 2d convolution (conv2d) layer and their parameters are optimized using a speech separation objective function in a purely data-driven fashion. Furthermore, inspired by the IPD formulation, we design a conv2d kernel to compute the inter-channel convolution differences (ICDs), which are expected to provide the spatial cues that help to distinguish the directional sources. Evaluation results on simulated multi-channel reverberant WSJ0 2-mix dataset demonstrate that our proposed ICD based MCSS model improves the overall signal-to-distortion ratio by 10.4% over the IPD based MCSS model.

preprint2020arXiv

Neural Spatio-Temporal Beamformer for Target Speech Separation

Purely neural network (NN) based speech separation and enhancement methods, although can achieve good objective scores, inevitably cause nonlinear speech distortions that are harmful for the automatic speech recognition (ASR). On the other hand, the minimum variance distortionless response (MVDR) beamformer with NN-predicted masks, although can significantly reduce speech distortions, has limited noise reduction capability. In this paper, we propose a multi-tap MVDR beamformer with complex-valued masks for speech separation and enhancement. Compared to the state-of-the-art NN-mask based MVDR beamformer, the multi-tap MVDR beamformer exploits the inter-frame correlation in addition to the inter-microphone correlation that is already utilized in prior arts. Further improvements include the replacement of the real-valued masks with the complex-valued masks and the joint training of the complex-mask NN. The evaluation on our multi-modal multi-channel target speech separation and enhancement platform demonstrates that our proposed multi-tap MVDR beamformer improves both the ASR accuracy and the perceptual speech quality against prior arts.