Source author record

Yonghong Yan

Yonghong Yan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.AS Sound Computation and Language cond-mat.mes-hall Artificial Intelligence cond-mat.stat-mech Distributed, Parallel, and Cluster Computing Human-Computer Interaction Machine Learning Multimedia Neural and Evolutionary Computing

Catalog footprint

What is connected

12works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

A Novel Blind Source Separation Framework Towards Maximum Signal-To-Interference Ratio

This letter proposes a new blind source separation (BSS) framework termed minimum variance independent component analysis (MVICA), which can potentially achieve the maximum output signal-to-interference ratio (SIR) while also allowing more flexibility in real implementations. The statistical independence assumption has been the foundation of the most dominant BSS techniques in recent decades. However, this assumption does not always hold true and the accurate probabilistic modeling of source is inherently difficult. To overcome these limitations and improve the separation performance, the MVICA framework is rigorously derived by optimizing the design of these independence-based BSS algorithms with the maximum SIR criterion. A deep neural network-supported implementation of MVICA is subsequently described. Experimental results under various conditions show the superiority of MVICA over the state-of-the-art BSS algorithms, in terms of not only SIR but also signal-to-distortion ratio and automatic speech recognition rate.

preprint2022arXiv

Decoupled Federated Learning for ASR with Non-IID Data

Automatic speech recognition (ASR) with federated learning (FL) makes it possible to leverage data from multiple clients without compromising privacy. The quality of FL-based ASR could be measured by recognition performance, communication and computation costs. When data among different clients are not independently and identically distributed (non-IID), the performance could degrade significantly. In this work, we tackle the non-IID issue in FL-based ASR with personalized FL, which learns personalized models for each client. Concretely, we propose two types of personalized FL approaches for ASR. Firstly, we adapt the personalization layer based FL for ASR, which keeps some layers locally to learn personalization models. Secondly, to reduce the communication and computation costs, we propose decoupled federated learning (DecoupleFL). On one hand, DecoupleFL moves the computation burden to the server, thus decreasing the computation on clients. On the other hand, DecoupleFL communicates secure high-level features instead of model parameters, thus reducing communication cost when models are large. Experiments demonstrate two proposed personalized FL-based ASR approaches could reduce WER by 2.3% - 3.4% compared with FedAvg. Among them, DecoupleFL has only 11.4% communication and 75% computation cost compared with FedAvg, which is also significantly less than the personalization layer based FL.

preprint2022arXiv

Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

There is often a trade-off between performance and latency in streaming automatic speech recognition (ASR). Traditional methods such as look-ahead and chunk-based methods, usually require information from future frames to advance recognition accuracy, which incurs inevitable latency even if the computation is fast enough. A causal model that computes without any future frames can avoid this latency, but its performance is significantly worse than traditional methods. In this paper, we propose corresponding revision strategies to improve the causal model. Firstly, we introduce a real-time encoder states revision strategy to modify previous states. Encoder forward computation starts once the data is received and revises the previous encoder states after several frames, which is no need to wait for any right context. Furthermore, a CTC spike position alignment decoding algorithm is designed to reduce time costs brought by the revision strategy. Experiments are all conducted on Librispeech datasets. Fine-tuning on the CTC-based wav2vec2.0 model, our best method can achieve 3.7/9.2 WERs on test-clean/other sets, which is also competitive with the chunk-based methods and the knowledge distillation methods.

preprint2022arXiv

Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization

For online speaker diarization, samples arrive incrementally, and the overall distribution of the samples is invisible. Moreover, in most existing clustering-based methods, the training objective of the embedding extractor is not designed specially for clustering. To improve online speaker diarization performance, we propose a unified online clustering framework, which provides an interactive manner between embedding extractors and clustering algorithms. Specifically, the framework consists of two highly coupled parts: clustering-guided recurrent training (CGRT) and truncated beam searching clustering (TBSC). The CGRT introduces the clustering algorithm into the training process of embedding extractors, which could provide not only cluster-aware information for the embedding extractor, but also crucial parameters for the clustering process afterward. And with these parameters, which contain preliminary information of the metric space, the TBSC penalizes the probability score of each cluster, in order to output more accurate clustering results in online fashion with low latency. With the above innovations, our proposed online clustering system achieves 14.48\% DER with collar 0.25 at 2.5s latency on the AISHELL-4, while the DER of the offline agglomerative hierarchical clustering is 14.57\%.

preprint2022arXiv

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset

This paper introduces a high-quality rich annotated Mandarin conversational (RAMC) speech dataset called MagicData-RAMC. The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in MagicData-RAMC are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided. As a Mandarin speech dataset designed for dialog scenarios with high quality and rich annotations, MagicData-RAMC enriches the data diversity in the Mandarin speech community and allows extensive research on a series of speech-related tasks, including automatic speech recognition, speaker diarization, topic detection, keyword search, text-to-speech, etc. We also conduct several relevant tasks and provide experimental results to help evaluate the dataset.

preprint2022arXiv

The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines

The conversation scenario is one of the most important and most challenging scenarios for speech processing technologies because people in conversation respond to each other in a casual style. Detecting the speech activities of each person in a conversation is vital to downstream tasks, like natural language processing, machine translation, etc. People refer to the detection technology of "who speak when" as speaker diarization (SD). Traditionally, diarization error rate (DER) has been used as the standard evaluation metric of SD systems for a long time. However, DER fails to give enough importance to short conversational phrases, which are short but important on the semantic level. Also, a carefully and accurately manually-annotated testing dataset suitable for evaluating the conversational SD technologies is still unavailable in the speech community. In this paper, we design and describe the Conversational Short-phrases Speaker Diarization (CSSD) task, which consists of training and testing datasets, evaluation metric and baselines. In the dataset aspect, despite the previously open-sourced 180-hour conversational MagicData-RAMC dataset, we prepare an individual 20-hour conversational speech test dataset with carefully and artificially verified speakers timestamps annotations for the CSSD task. In the metric aspect, we design the new conversational DER (CDER) evaluation metric, which calculates the SD accuracy at the utterance level. In the baseline aspect, we adopt a commonly used method: Variational Bayes HMM x-vector system, as the baseline of the CSSD task. Our evaluation metric is publicly available at https://github.com/SpeechClub/CDER_Metric.

preprint2022arXiv

Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR

Self-supervised pre-training could effectively improve the performance of low-resource automatic speech recognition (ASR). However, existing self-supervised pre-training are task-agnostic, i.e., could be applied to various downstream tasks. Although it enlarges the scope of its application, the capacity of the pre-trained model is not fully utilized for the ASR task, and the learned representations may not be optimal for ASR. In this work, in order to build a better pre-trained model for low-resource ASR, we propose a pre-training approach called wav2vec-S, where we use task-specific semi-supervised pre-training to refine the self-supervised pre-trained model for the ASR task thus more effectively utilize the capacity of the pre-trained model to generate task-specific representations for ASR. Experiments show that compared to wav2vec 2.0, wav2vec-S only requires a marginal increment of pre-training time but could significantly improve ASR performance on in-domain, cross-domain and cross-lingual datasets. Average relative WER reductions are 24.5% and 6.6% for 1h and 10h fine-tuning, respectively. Furthermore, we show that semi-supervised pre-training could close the representation gap between the self-supervised pre-trained model and the corresponding fine-tuned model through canonical correlation analysis.

preprint2020arXiv

Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

Recently, Transformer has gained success in automatic speech recognition (ASR) field. However, it is challenging to deploy a Transformer-based end-to-end (E2E) model for online speech recognition. In this paper, we propose the Transformer-based online CTC/attention E2E ASR architecture, which contains the chunk self-attention encoder (chunk-SAE) and the monotonic truncated attention (MTA) based self-attention decoder (SAD). Firstly, the chunk-SAE splits the speech into isolated chunks. To reduce the computational cost and improve the performance, we propose the state reuse chunk-SAE. Sencondly, the MTA based SAD truncates the speech features monotonically and performs attention on the truncated features. To support the online recognition, we integrate the state reuse chunk-SAE and the MTA based SAD into online CTC/attention architecture. We evaluate the proposed online models on the HKUST Mandarin ASR benchmark and achieve a 23.66% character error rate (CER) with a 320 ms latency. Our online model yields as little as 0.19% absolute CER degradation compared with the offline baseline, and achieves significant improvement over our prior work on Long Short-Term Memory (LSTM) based online E2E models.

preprint2016arXiv

Optimizing human-interpretable dialog management policy using Genetic Algorithm

Automatic optimization of spoken dialog management policies that are robust to environmental noise has long been the goal for both academia and industry. Approaches based on reinforcement learning have been proved to be effective. However, the numerical representation of dialog policy is human-incomprehensible and difficult for dialog system designers to verify or modify, which limits its practical application. In this paper we propose a novel framework for optimizing dialog policies specified in domain language using genetic algorithm. The human-interpretable representation of policy makes the method suitable for practical employment. We present learning algorithms using user simulation and real human-machine dialogs respectively.Empirical experimental results are given to show the effectiveness of the proposed approach.

preprint2015arXiv

Noise Robust IOA/CAS Speech Separation and Recognition System For The Third 'CHIME' Challenge

This paper presents the contribution to the third 'CHiME' speech separation and recognition challenge including both front-end signal processing and back-end speech recognition. In the front-end, Multi-channel Wiener filter (MWF) is designed to achieve background noise reduction. Different from traditional MWF, optimized parameter for the tradeoff between noise reduction and target signal distortion is built according to the desired noise reduction level. In the back-end, several techniques are taken advantage to improve the noisy Automatic Speech Recognition (ASR) performance including Deep Neural Network (DNN), Convolutional Neural Network (CNN) and Long short-term memory (LSTM) using medium vocabulary, Lattice rescoring with a big vocabulary language model finite state transducer, and ROVER scheme. Experimental results show the proposed system combining front-end and back-end is effective to improve the ASR performance.

preprint2014arXiv

Energy spread and current-current correlation in quantum systems

We consider energy (heat) transport in quantum systems, and establish a relationship between energy spread and energy current-current correlation function. The energy current-current correlation is related to thermal conductivity by the Green-Kubo formula, and thus this relationship allows us to study conductivity directly from the energy spread process. As an example, we investigate a spinless fermion model; the numerical results confirm the relationship.

preprint2013arXiv

Inelastic transport dynamics through attractive impurity in charge Kondo regime

Within the frame of quantum dissipation theory, we develop a new hierarchical equations of motion theory, combined with the small polaron transformation. We fully investigate the electron transport of a single attractive impurity system with the strong electron-phonon coupling in charge- Kondo regime. Numerical results demonstrate the following facts: (i) The density of states curve shows that the attraction mechanism results in not only the charge-Kondo resonance (i.e. elastic pair-transition resonance), but also the inelastic pair-transition resonances and inelastic cotunneling resonances. These signals are separated discernibly in asymmetric levels about Fermi level; (ii) The differential conductance spectrum shows the distinct peaks or wiggles and the abnormal split of pair-transition peaks under the asymmetric bias; (iii) The improvement of bath-temperature can enhance the phonon-emission (absorption)-assisted sequential tunneling and also strengthen the signals of inelastic pair-transition under the asymmetric bias; (iv) The inelastic dynamics driven by the ramp-up time-dependent voltage presents clear steps which can be tailored by the duration-time and bath-temperature; (v) The linear-response spectrum obtained from the linear-response theory in the Liouville space reveals the excitation signals of electrons' dynamical transition. Briefly, the physics caused by attraction mechanism stems from the double-occupancy or vacant-occupation of impurity system

Yonghong Yan

What is connected

Connect this record

See the researcher in context

Building this map preview

12 published item(s)

A Novel Blind Source Separation Framework Towards Maximum Signal-To-Interference Ratio

Decoupled Federated Learning for ASR with Non-IID Data

Improving Streaming End-to-End ASR on Transformer-based Causal Models with Encoder States Revision Strategies

Interrelate Training and Searching: A Unified Online Clustering Framework for Speaker Diarization

Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset

The Conversational Short-phrase Speaker Diarization (CSSD) Task: Dataset, Evaluation Metric and Baselines

Wav2vec-S: Semi-Supervised Pre-Training for Low-Resource ASR

Transformer-based Online CTC/attention End-to-End Speech Recognition Architecture

Optimizing human-interpretable dialog management policy using Genetic Algorithm

Noise Robust IOA/CAS Speech Separation and Recognition System For The Third 'CHIME' Challenge

Energy spread and current-current correlation in quantum systems

Inelastic transport dynamics through attractive impurity in charge Kondo regime