Source author record

Huaming Wang

Huaming Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.AS Sound math.PR Computation and Language eess.SP

Catalog footprint

What is connected

5works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2023arXiv

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

preprint2022arXiv

CLAP: Learning Audio Concepts From Natural Language Supervision

Mainstream Audio Analytics models are trained to learn under the paradigm of one class label to many recordings focusing on one task. Learning under such restricted supervision limits the flexibility of models because they require labeled audio for training and can only predict the predefined categories. Instead, we propose to learn audio concepts from natural language supervision. We call our approach Contrastive Language-Audio Pretraining (CLAP), which learns to connect language and audio by using two encoders and a contrastive learning to bring audio and text descriptions into a joint multimodal space. We trained CLAP with 128k audio and text pairs and evaluated it on 16 downstream tasks across 8 domains, such as Sound Event Classification, Music tasks, and Speech-related tasks. Although CLAP was trained with significantly less pairs than similar computer vision models, it establishes SoTA for Zero-Shot performance. Additionally, we evaluated CLAP in a supervised learning setup and achieve SoTA in 5 tasks. Hence, CLAP's Zero-Shot capability removes the need of training with class labels, enables flexible class prediction at inference time, and generalizes to multiple downstream tasks.

preprint2022arXiv

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

This paper investigates how to improve the runtime speed of personalized speech enhancement (PSE) networks while maintaining the model quality. Our approach includes two aspects: architecture and knowledge distillation (KD). We propose an end-to-end enhancement (E3Net) model architecture, which is $3\times$ faster than a baseline STFT-based model. Besides, we use KD techniques to develop compressed student models without significantly degrading quality. In addition, we investigate using noisy data without reference clean signals for training the student models, where we combine KD with multi-task learning (MTL) using automatic speech recognition (ASR) loss. Our results show that E3Net provides better speech and transcription quality with a lower target speaker over-suppression (TSOS) rate than the baseline model. Furthermore, we show that the KD methods can yield student models that are $2-4\times$ faster than the teacher and provides reasonable quality. Combining KD and MTL improves the ASR and TSOS metrics without degrading the speech quality.

preprint2010arXiv

Branching structure for an (L-1) random walk in random environment and its applications

By decomposing the random walk path, we construct a multitype branching process with immigration in random environment for corresponding random walk with bounded jumps in random environment. Then we give two applications of the branching structure. Firstly, we specify the explicit invariant density by a method different with the one used in Brémont [3] and reprove the law of large numbers of the random walk by a method known as the environment viewed from particles". Secondly, the branching structure enables us to prove a stable limit law, generalizing the result of Kesten-Kozlov-Spitzer [11] for the nearest random walk in random environment. As a byproduct, we also prove that the total population of a multitype branching process in random environment with immigration before the first regeneration belongs to the domain of attraction of some κ-stable law.

preprint2010arXiv

Intrinsic branching structure within random walk on $\mathbb{Z}$

In this paper, we reveal the branching structure for a non-homogeneous random walk with bounded jumps. The ladder time $T_1,$ the first hitting time of $[1,\infty)$ by the walk starting from $0,$ could be expressed in terms of a non-homogeneous multitype branching process. As an application of the branching structure, we prove a law of large numbers of random walk in random environment with bounded jumps and specify the explicit invariant density for the Markov chain of ``the environment viewed from the particle" .The invariant density and the limit velocity could be expressed explicitly in terms of the environment.

Huaming Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

CLAP: Learning Audio Concepts From Natural Language Supervision

Fast Real-time Personalized Speech Enhancement: End-to-End Enhancement Network (E3Net) and Knowledge Distillation

Branching structure for an (L-1) random walk in random environment and its applications

Intrinsic branching structure within random walk on $\mathbb{Z}$