Researcher profile

Ziqiang Shi

Ziqiang Shi contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 19 - UnverifiedVerification L1Unclaimed author
5works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

5 published item(s)

preprint2022arXiv

ItôTTS and ItôWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation

In this paper, we propose to unify the two aspects of voice synthesis, namely text-to-speech (TTS) and vocoder, into one framework based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of this SDE pair are two stochastic processes, one of which turns the distribution of mel spectrogram (or wave), that we want to generate, into a simple and tractable distribution. The other is the generation procedure that turns this tractable simple signal into the target mel spectrogram (or wave). The model that generates mel spectrogram is called ItôTTS, and the model that generates wave is called ItôWave. ItôTTS and ItôWave use the Wiener process as a driver to gradually subtract the excess signal from the noise signal to generate realistic corresponding meaningful mel spectrogram and audio respectively, under the conditional inputs of original text or mel spectrogram. The results of the experiment show that the mean opinion scores (MOS) of ItôTTS and ItôWave can exceed the current state-of-the-art methods, and reached 3.925$\pm$0.160 and 4.35$\pm$0.115 respectively. The generated audio samples are available at https://wushoule.github.io/ItoAudio/. All authors contribute equally to this work.

preprint2022arXiv

ItôWave: Itô Stochastic Differential Equation Is All You Need For Wave Generation

In this paper, we propose a vocoder based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of this SDE pair are two stochastic processes, one of which turns the distribution of wave, that we want to generate, into a simple and tractable distribution. The other is the generation procedure that turns this tractable simple signal into the target wave. The model is called ItôWave. ItôWave use the Wiener process as a driver to gradually subtract the excess signal from the noise signal to generate realistic corresponding meaningful audio respectively, under the conditional inputs of original mel spectrogram. The results of the experiment show that the mean opinion scores (MOS) of ItôWave can exceed the current state-of-the-art (SOTA) methods, and reached 4.35$\pm$0.115. The generated audio samples are available online.

preprint2020arXiv

Hodge and Podge: Hybrid Supervised Sound Event Detection with Multi-Hot MixMatch and Composition Consistence Training

In this paper, we propose a method called Hodge and Podge for sound event detection. We demonstrate Hodge and Podge on the dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Challenge Task 4. This task aims to predict the presence or absence and the onset and offset times of sound events in home environments. Sound event detection is challenging due to the lack of large scale real strongly labeled data. Recently deep semi-supervised learning (SSL) has proven to be effective in modeling with weakly labeled and unlabeled data. This work explores how to extend deep SSL to result in a new, state-of-the-art sound event detection method called Hodge and Podge. With convolutional recurrent neural networks (CRNN) as the backbone network, first, a multi-scale squeeze-excitation mechanism is introduced and added to generate a pyramid squeeze-excitation CRNN. The pyramid squeeze-excitation layer can pay attention to the issue that different sound events have different durations, and to adaptively recalibrate channel-wise spectrogram responses. Further, in order to remedy the lack of real strongly labeled data problem, we propose multi-hot MixMatch and composition consistency training with temporal-frequency augmentation. Our experiments with the public DCASE2019 challenge task 4 validation data resulted in an event-based F-score of 43.4\%, and is about absolutely 1.6\% better than state-of-the-art methods in the challenge. While the F-score of the official baseline is 25.8\%.

preprint2020arXiv

SingCubic: Cyclic Incremental Newton-type Gradient Descent with Cubic Regularization for Non-Convex Optimization

In this work, we generalized and unified two recent completely different works of~\cite{shi2015large} and~\cite{cartis2012adaptive} respectively into one by proposing the cyclic incremental Newton-type gradient descent with cubic regularization (SingCubic) method for optimizing non-convex functions. Through the iterations of SingCubic, a cubic regularized global quadratic approximation using Hessian information is kept and solved. Preliminary numerical experiments show the encouraging performance of the SingCubic algorithm when compared to basic incremental or stochastic Newton-type implementations. The results and technique can be served as an initiate for the research on the incremental Newton-type gradient descent methods that employ cubic regularization. The methods and principles proposed in this paper can be used to do logistic regression, autoencoder training, independent components analysis, Ising model/Hopfield network training, multilayer perceptron, deep convolutional network training and so on. We will open-source parts of our implementations soon.

preprint2020arXiv

Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss

Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation. This work investigates how to extend dual-path BiLSTM to result in a new state-of-the-art approach, called TasTas, for multi-talker monaural speech separation (a.k.a cocktail party problem). TasTas introduces two simple but effective improvements, one is an iterative multi-stage refinement scheme, and the other is to correct the speech with imperfect separation through a loss of speaker identity consistency between the separated speech and original speech, to boost the performance of dual-path BiLSTM based networks. TasTas takes the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speaker's voice. Our experiments on the notable benchmark WSJ0-2mix data corpus result in 20.55dB SDR improvement, 20.35dB SI-SDR improvement, 3.69 of PESQ, and 94.86\% of ESTOI, which shows that our proposed networks can lead to big performance improvement on the speaker separation task. We have open sourced our re-implementation of the DPRNN-TasNet here (https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation), and our TasTas is realized based on this implementation of DPRNN-TasNet, it is believed that the results in this paper can be reproduced with ease.