Source author record

Ziqiang Shi

Ziqiang Shi appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.AS math.OC Sound Machine Learning Information Theory math.IT math.NA Numerical Analysis Artificial Intelligence Computer Vision math.PR Multimedia

Catalog footprint

What is connected

14works

12topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

ItôTTS and ItôWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation

In this paper, we propose to unify the two aspects of voice synthesis, namely text-to-speech (TTS) and vocoder, into one framework based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of this SDE pair are two stochastic processes, one of which turns the distribution of mel spectrogram (or wave), that we want to generate, into a simple and tractable distribution. The other is the generation procedure that turns this tractable simple signal into the target mel spectrogram (or wave). The model that generates mel spectrogram is called ItôTTS, and the model that generates wave is called ItôWave. ItôTTS and ItôWave use the Wiener process as a driver to gradually subtract the excess signal from the noise signal to generate realistic corresponding meaningful mel spectrogram and audio respectively, under the conditional inputs of original text or mel spectrogram. The results of the experiment show that the mean opinion scores (MOS) of ItôTTS and ItôWave can exceed the current state-of-the-art methods, and reached 3.925$\pm$0.160 and 4.35$\pm$0.115 respectively. The generated audio samples are available at https://wushoule.github.io/ItoAudio/. All authors contribute equally to this work.

preprint2022arXiv

ItôWave: Itô Stochastic Differential Equation Is All You Need For Wave Generation

In this paper, we propose a vocoder based on a pair of forward and reverse-time linear stochastic differential equations (SDE). The solutions of this SDE pair are two stochastic processes, one of which turns the distribution of wave, that we want to generate, into a simple and tractable distribution. The other is the generation procedure that turns this tractable simple signal into the target wave. The model is called ItôWave. ItôWave use the Wiener process as a driver to gradually subtract the excess signal from the noise signal to generate realistic corresponding meaningful audio respectively, under the conditional inputs of original mel spectrogram. The results of the experiment show that the mean opinion scores (MOS) of ItôWave can exceed the current state-of-the-art (SOTA) methods, and reached 4.35$\pm$0.115. The generated audio samples are available online.

preprint2020arXiv

Hodge and Podge: Hybrid Supervised Sound Event Detection with Multi-Hot MixMatch and Composition Consistence Training

In this paper, we propose a method called Hodge and Podge for sound event detection. We demonstrate Hodge and Podge on the dataset of Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Challenge Task 4. This task aims to predict the presence or absence and the onset and offset times of sound events in home environments. Sound event detection is challenging due to the lack of large scale real strongly labeled data. Recently deep semi-supervised learning (SSL) has proven to be effective in modeling with weakly labeled and unlabeled data. This work explores how to extend deep SSL to result in a new, state-of-the-art sound event detection method called Hodge and Podge. With convolutional recurrent neural networks (CRNN) as the backbone network, first, a multi-scale squeeze-excitation mechanism is introduced and added to generate a pyramid squeeze-excitation CRNN. The pyramid squeeze-excitation layer can pay attention to the issue that different sound events have different durations, and to adaptively recalibrate channel-wise spectrogram responses. Further, in order to remedy the lack of real strongly labeled data problem, we propose multi-hot MixMatch and composition consistency training with temporal-frequency augmentation. Our experiments with the public DCASE2019 challenge task 4 validation data resulted in an event-based F-score of 43.4\%, and is about absolutely 1.6\% better than state-of-the-art methods in the challenge. While the F-score of the official baseline is 25.8\%.

preprint2020arXiv

SingCubic: Cyclic Incremental Newton-type Gradient Descent with Cubic Regularization for Non-Convex Optimization

In this work, we generalized and unified two recent completely different works of~\cite{shi2015large} and~\cite{cartis2012adaptive} respectively into one by proposing the cyclic incremental Newton-type gradient descent with cubic regularization (SingCubic) method for optimizing non-convex functions. Through the iterations of SingCubic, a cubic regularized global quadratic approximation using Hessian information is kept and solved. Preliminary numerical experiments show the encouraging performance of the SingCubic algorithm when compared to basic incremental or stochastic Newton-type implementations. The results and technique can be served as an initiate for the research on the incremental Newton-type gradient descent methods that employ cubic regularization. The methods and principles proposed in this paper can be used to do logistic regression, autoencoder training, independent components analysis, Ising model/Hopfield network training, multilayer perceptron, deep convolutional network training and so on. We will open-source parts of our implementations soon.

preprint2020arXiv

Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss

Deep neural network with dual-path bi-directional long short-term memory (BiLSTM) block has been proved to be very effective in sequence modeling, especially in speech separation. This work investigates how to extend dual-path BiLSTM to result in a new state-of-the-art approach, called TasTas, for multi-talker monaural speech separation (a.k.a cocktail party problem). TasTas introduces two simple but effective improvements, one is an iterative multi-stage refinement scheme, and the other is to correct the speech with imperfect separation through a loss of speaker identity consistency between the separated speech and original speech, to boost the performance of dual-path BiLSTM based networks. TasTas takes the mixed utterance of two speakers and maps it to two separated utterances, where each utterance contains only one speaker's voice. Our experiments on the notable benchmark WSJ0-2mix data corpus result in 20.55dB SDR improvement, 20.35dB SI-SDR improvement, 3.69 of PESQ, and 94.86\% of ESTOI, which shows that our proposed networks can lead to big performance improvement on the speaker separation task. We have open sourced our re-implementation of the DPRNN-TasNet here (https://github.com/ShiZiqiang/dual-path-RNNs-DPRNNs-based-speech-separation), and our TasTas is realized based on this implementation of DPRNN-TasNet, it is believed that the results in this paper can be reproduced with ease.

preprint2016arXiv

A better convergence analysis of the block coordinate descent method for large scale machine learning

This paper considers the problems of unconstrained minimization of large scale smooth convex functions having block-coordinate-wise Lipschitz continuous gradients. The block coordinate descent (BCD) method are among the first optimization schemes suggested for solving such problems \cite{nesterov2012efficiency}. We obtain a new lower (to our best knowledge the lowest currently) bound that is $16p^3$ times smaller than the best known on the information-based complexity of BCD method based on an effective technique called Performance Estimation Problem (PEP) proposed by Drori and Teboulle \cite{drori2012performance} recently for analyzing the performance of first-order black box optimization methods. Numerical test confirms our analysis.

preprint2016arXiv

Empirical study of PROXTONE and PROXTONE$^+$ for Fast Learning of Large Scale Sparse Models

PROXTONE is a novel and fast method for optimization of large scale non-smooth convex problem \cite{shi2015large}. In this work, we try to use PROXTONE method in solving large scale \emph{non-smooth non-convex} problems, for example training of sparse deep neural network (sparse DNN) or sparse convolutional neural network (sparse CNN) for embedded or mobile device. PROXTONE converges much faster than first order methods, while first order method is easy in deriving and controlling the sparseness of the solutions. Thus in some applications, in order to train sparse models fast, we propose to combine the merits of both methods, that is we use PROXTONE in the first several epochs to reach the neighborhood of an optimal solution, and then use the first order method to explore the possibility of sparsity in the following training. We call such method PROXTONE plus (PROXTONE$^+$). Both PROXTONE and PROXTONE$^+$ are tested in our experiments, and which demonstrate both methods improved convergence speed twice as fast at least on diverse sparse model learning problems, and at the same time reduce the size to 0.5\% for DNN models. The source of all the algorithms is available upon request.

preprint2016arXiv

Online and stochastic Douglas-Rachford splitting method for large scale machine learning

Online and stochastic learning has emerged as powerful tool in large scale optimization. In this work, we generalize the Douglas-Rachford splitting (DRs) method for minimizing composite functions to online and stochastic settings (to our best knowledge this is the first time DRs been generalized to sequential version). We first establish an $O(1/\sqrt{T})$ regret bound for batch DRs method. Then we proved that the online DRs splitting method enjoy an $O(1)$ regret bound and stochastic DRs splitting has a convergence rate of $O(1/\sqrt{T})$. The proof is simple and intuitive, and the results and technique can be served as a initiate for the research on the large scale machine learning employ the DRs method. Numerical experiments of the proposed method demonstrate the effectiveness of the online and stochastic update rule, and further confirm our regret and convergence analysis.

preprint2014arXiv

Identifiability of multivariate logistic mixture models

Mixture models have been widely used in modeling of continuous observations. For the possibility to estimate the parameters of a mixture model consistently on the basis of observations from the mixture, identifiability is a necessary condition. In this study, we give some results on the identifiability of multivariate logistic mixture models.

preprint2014arXiv

Online and Stochastic Universal Gradient Methods for Minimizing Regularized Hölder Continuous Finite Sums

Online and stochastic gradient methods have emerged as potent tools in large scale optimization with both smooth convex and nonsmooth convex problems from the classes $C^{1,1}(\reals^p)$ and $C^{1,0}(\reals^p)$ respectively. However to our best knowledge, there is few paper to use incremental gradient methods to optimization the intermediate classes of convex problems with Hölder continuous functions $C^{1,v}(\reals^p)$. In order fill the difference and gap between methods for smooth and nonsmooth problems, in this work, we propose the several online and stochastic universal gradient methods, that we do not need to know the actual degree of smoothness of the objective function in advance. We expanded the scope of the problems involved in machine learning to Hölder continuous functions and to propose a general family of first-order methods. Regret and convergent analysis shows that our methods enjoy strong theoretical guarantees. For the first time, we establish an algorithms that enjoys a linear convergence rate for convex functions that have Hölder continuous gradients.

preprint2014arXiv

Proximal Stochastic Newton-type Gradient Descent Methods for Minimizing Regularized Finite Sums

In this work, we generalized and unified recent two completely different works of Jascha \cite{sohl2014fast} and Lee \cite{lee2012proximal} respectively into one by proposing the \textbf{prox}imal s\textbf{to}chastic \textbf{N}ewton-type gradient (PROXTONE) method for optimizing the sums of two convex functions: one is the average of a huge number of smooth convex functions, and the other is a non-smooth convex function. While a set of recently proposed proximal stochastic gradient methods, include MISO, Prox-SDCA, Prox-SVRG, and SAG, converge at linear rates, the PROXTONE incorporates second order information to obtain stronger convergence results, that it achieves a linear convergence rate not only in the value of the objective function, but also in the \emph{solution}. The proof is simple and intuitive, and the results and technique can be served as a initiate for the research on the proximal stochastic methods that employ second order information.

preprint2012arXiv

Guarantees of Augmented Trace Norm Models in Tensor Recovery

This paper studies the recovery guarantees of the models of minimizing $\|\mathcal{X}\|_*+\frac{1}{2α}\|\mathcal{X}\|_F^2$ where $\mathcal{X}$ is a tensor and $\|\mathcal{X}\|_*$ and $\|\mathcal{X}\|_F$ are the trace and Frobenius norm of respectively. We show that they can efficiently recover low-rank tensors. In particular, they enjoy exact guarantees similar to those known for minimizing $\|\mathcal{X}\|_*$ under the conditions on the sensing operator such as its null-space property, restricted isometry property, or spherical section property. To recover a low-rank tensor $\mathcal{X}^0$, minimizing $\|\mathcal{X}\|_*+\frac{1}{2α}\|\mathcal{X}\|_F^2$ returns the same solution as minimizing $\|\mathcal{X}\|_*$ almost whenever $α\geq10\mathop {\max}\limits_{i}\|X^0_{(i)}\|_2$.

preprint2011arXiv

Online Learning for Classification of Low-rank Representation Features and Its Applications in Audio Segment Classification

In this paper, a novel framework based on trace norm minimization for audio segment is proposed. In this framework, both the feature extraction and classification are obtained by solving corresponding convex optimization problem with trace norm regularization. For feature extraction, robust principle component analysis (robust PCA) via minimization a combination of the nuclear norm and the $\ell_1$-norm is used to extract low-rank features which are robust to white noise and gross corruption for audio segments. These low-rank features are fed to a linear classifier where the weight and bias are learned by solving similar trace norm constrained problems. For this classifier, most methods find the weight and bias in batch-mode learning, which makes them inefficient for large-scale problems. In this paper, we propose an online framework using accelerated proximal gradient method. This framework has a main advantage in memory cost. In addition, as a result of the regularization formulation of matrix classification, the Lipschitz constant was given explicitly, and hence the step size estimation of general proximal gradient method was omitted in our approach. Experiments on real data sets for laugh/non-laugh and applause/non-applause classification indicate that this novel framework is effective and noise robust.

preprint2011arXiv

Trace Norm Regularized Tensor Classification and Its Online Learning Approaches

In this paper we propose an algorithm to classify tensor data. Our methodology is built on recent studies about matrix classification with the trace norm constrained weight matrix and the tensor trace norm. Similar to matrix classification, the tensor classification is formulated as a convex optimization problem which can be solved by using the off-the-shelf accelerated proximal gradient (APG) method. However, there are no analytic solutions as the matrix case for the updating of the weight tensors via the proximal gradient. To tackle this problem, the Douglas-Rachford splitting technique and the alternating direction method of multipliers (ADM) used in tensor completion are adapted to update the weight tensors. Further more, due to the demand of real applications, we also propose its online learning approaches. Experiments demonstrate the efficiency of the methods.

Ziqiang Shi

What is connected

Connect this record

See the researcher in context

Building this map preview

14 published item(s)

ItôTTS and ItôWave: Linear Stochastic Differential Equation Is All You Need For Audio Generation

ItôWave: Itô Stochastic Differential Equation Is All You Need For Wave Generation

Hodge and Podge: Hybrid Supervised Sound Event Detection with Multi-Hot MixMatch and Composition Consistence Training

SingCubic: Cyclic Incremental Newton-type Gradient Descent with Cubic Regularization for Non-Convex Optimization

Speech Separation Based on Multi-Stage Elaborated Dual-Path Deep BiLSTM with Auxiliary Identity Loss

A better convergence analysis of the block coordinate descent method for large scale machine learning

Empirical study of PROXTONE and PROXTONE$^+$ for Fast Learning of Large Scale Sparse Models

Online and stochastic Douglas-Rachford splitting method for large scale machine learning

Identifiability of multivariate logistic mixture models

Online and Stochastic Universal Gradient Methods for Minimizing Regularized Hölder Continuous Finite Sums

Proximal Stochastic Newton-type Gradient Descent Methods for Minimizing Regularized Finite Sums

Guarantees of Augmented Trace Norm Models in Tensor Recovery

Online Learning for Classification of Low-rank Representation Features and Its Applications in Audio Segment Classification

Trace Norm Regularized Tensor Classification and Its Online Learning Approaches