Researcher profile

Yefei Chen

Yefei Chen contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 13 - UnverifiedVerification L1Unclaimed author
2works
0followers
2topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

2 published item(s)

preprint2020arXiv

End-to-End Speaker-Dependent Voice Activity Detection

Voice activity detection (VAD) is an essential pre-processing step for tasks such as automatic speech recognition (ASR) and speaker recognition. A basic goal is to remove silent segments within an audio, while a more general VAD system could remove all the irrelevant segments such as noise and even unwanted speech from non-target speakers. We define the task, which only detects the speech from the target speaker, as speaker-dependent voice activity detection (SDVAD). This task is quite common in real applications and usually implemented by performing speaker verification (SV) on audio segments extracted from VAD. In this paper, we propose an end-to-end neural network based approach to address this problem, which explicitly takes the speaker identity into the modeling process. Moreover, inference can be performed in an online fashion, which leads to low system latency. Experiments are carried out on a conversational telephone dataset generated from the Switchboard corpus. Results show that our proposed online approach achieves significantly better performance than the usual VAD/SV system in terms of both frame accuracy and F-score. We also used our previously proposed segment-level metric for a more comprehensive analysis.

preprint2020arXiv

Voice activity detection in the wild via weakly supervised sound event detection

Traditional supervised voice activity detection (VAD) methods work well in clean and controlled scenarios, with performance severely degrading in real-world applications. One possible bottleneck is that speech in the wild contains unpredictable noise types, hence frame-level label prediction is difficult, which is required for traditional supervised VAD training. In contrast, we propose a general-purpose VAD (GPVAD) framework, which can be easily trained from noisy data in a weakly supervised fashion, requiring only clip-level labels. We proposed two GPVAD models, one full (GPV-F), trained on 527 Audioset sound events, and one binary (GPV-B), only distinguishing speech and noise. We evaluate the two GPV models against a CRNN based standard VAD model (VAD-C) on three different evaluation protocols (clean, synthetic noise, real data). Results show that our proposed GPV-F demonstrates competitive performance in clean and synthetic scenarios compared to traditional VAD-C. Further, in real-world evaluation, GPV-F largely outperforms VAD-C in terms of frame-level evaluation metrics as well as segment-level ones. With a much lower requirement for frame-labeled data, the naive binary clip-level GPV-B model can still achieve comparable performance to VAD-C in real-world scenarios.