Source author record

Simon King

Simon King appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computation and Language Neural and Evolutionary Computing Sound eess.AS Machine Learning math.AC math.CO math.GR math.GT math.RA

Catalog footprint

What is connected

9works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2020arXiv

Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis

We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech (TTS) synthesis. We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion score (MOS) ratings. First, we train and evaluate our NN model on 13 different TTS and voice conversion (VC) systems from the ASVSpoof 2019 Logical Access (LA) Dataset. Since it is not known how best to represent speech for this task, we compare 8 different representations alongside MOSNet frame-based features. Our representations include image-based spectrogram features and x-vector embeddings that explicitly model different types of noise such as T60 reverberation time. Our NN predicts MOS with a high correlation to human judgments. We report prediction correlation and error. A key finding is the quality achieved for certain speakers seems consistent, regardless of the TTS or VC system. It is widely accepted that some speakers give higher quality than others for building a TTS system: our method provides an automatic way to identify such speakers. Finally, to see if our quality prediction models generalize, we predict quality scores for synthetic speech using a separate multi-speaker TTS system that was trained on LibriTTS data, and conduct our own MOS listening test to compare human ratings with our NN predictions.

preprint2016arXiv

DNN-based Speech Synthesis for Indian Languages from ASCII text

Text-to-Speech synthesis in Indian languages has a seen lot of progress over the decade partly due to the annual Blizzard challenges. These systems assume the text to be written in Devanagari or Dravidian scripts which are nearly phonemic orthography scripts. However, the most common form of computer interaction among Indians is ASCII written transliterated text. Such text is generally noisy with many variations in spelling for the same word. In this paper we evaluate three approaches to synthesize speech from such noisy ASCII text: a naive Uni-Grapheme approach, a Multi-Grapheme approach, and a supervised Grapheme-to-Phoneme (G2P) approach. These methods first convert the ASCII text to a phonetic script, and then learn a Deep Neural Network to synthesize speech from that. We train and test our models on Blizzard Challenge datasets that were transliterated to ASCII using crowdsourcing. Our experiments on Hindi, Tamil and Telugu demonstrate that our models generate speech of competetive quality from ASCII text compared to the speech synthesized from the native scripts. All the accompanying transliterated datasets are released for public access.

preprint2016arXiv

Improving Trajectory Modelling for DNN-based Speech Synthesis by using Stacked Bottleneck Features and Minimum Generation Error Training

We propose two novel techniques --- stacking bottleneck features and minimum generation error training criterion --- to improve the performance of deep neural network (DNN)-based speech synthesis. The techniques address the related issues of frame-by-frame independence and ignorance of the relationship between static and dynamic features, within current typical DNN-based synthesis frameworks. Stacking bottleneck features, which are an acoustically--informed linguistic representation, provides an efficient way to include more detailed linguistic context at the input. The minimum generation error training criterion minimises overall output trajectory error across an utterance, rather than minimising the error per frame independently, and thus takes into account the interaction between static and dynamic features. The two techniques can be easily combined to further improve performance. We present both objective and subjective results that demonstrate the effectiveness of the proposed techniques. The subjective results show that combining the two techniques leads to significantly more natural synthetic speech than from conventional DNN or long short-term memory (LSTM) recurrent neural network (RNN) systems.

preprint2016arXiv

Investigating gated recurrent neural networks for speech synthesis

Recently, recurrent neural networks (RNNs) as powerful sequence models have re-emerged as a potential acoustic model for statistical parametric speech synthesis (SPSS). The long short-term memory (LSTM) architecture is particularly attractive because it addresses the vanishing gradient problem in standard RNNs, making them easier to train. Although recent studies have demonstrated that LSTMs can achieve significantly better performance on SPSS than deep feed-forward neural networks, little is known about why. Here we attempt to answer two questions: a) why do LSTMs work well as a sequence model for SPSS; b) which component (e.g., input gate, output gate, forget gate) is most important. We present a visual analysis alongside a series of experiments, resulting in a proposal for a simplified architecture. The simplified architecture has significantly fewer parameters than an LSTM, thus reducing generation complexity considerably without degrading quality.

preprint2016arXiv

Median-Based Generation of Synthetic Speech Durations using a Non-Parametric Approach

This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame). Unlike conventional approaches to duration modelling -- which assume that duration distributions have a particular form (e.g., a Gaussian) and use the mean of that distribution for synthesis -- our approach can in principle model any distribution supported on the non-negative integers. Generation from this model can be performed in many ways; here we consider output generation based on the median predicted duration. The median is more typical (more probable) than the conventional mean duration, is robust to training-data irregularities, and enables incremental generation. Furthermore, a frame-level approach to duration prediction is consistent with a longer-term goal of modelling durations and acoustic features together. Results indicate that the proposed method is competitive with baseline approaches in approximating the median duration of held-out natural speech.

preprint2015arXiv

The isomorphism problem for graded algebras and its application to mod-p cohomology rings of small p-groups

The mod-p cohomology ring of a non-trivial finite p-group is an infinite dimensional, finitely presented graded unital algebra over the field with p elements, with generators in positive degrees. We describe an effective algorithm to test if two such algebras are graded isomorphic. As application, we determine all graded isomorphisms between the mod-p cohomology rings of all p-groups of order at most 100.

preprint2012arXiv

Completeness criteria for modular cohomology rings of non prime power groups

We introduce a criterion for the completeness of ring approximations of modular cohomology rings of finite non prime power groups, and discuss how this criterion performs in practical computations, compared with other criteria.

preprint2010arXiv

A State Sum Link Invariant of Regular Isotopy

This paper has been withdrawn because there is a fundamental error in the computations; with the right computational scheme it seems to be just a version of the Jones polynomial

preprint2007arXiv

Minimal generating sets of non-modular invariant rings of finite groups

It is a classical problem to compute a minimal set of invariant polynomial generating the invariant ring of a finite group as an algebra. We present here an algorithm for the computation of minimal generating sets in the non-modular case. Apart from very few explicit computations of Groebner bases, the algorithm only involves very basic operations, and is thus rather fast. As a test bed for comparative benchmarks, we use transitive permutation groups on 7 and 8 variables. In most examples, our algorithm implemented in Singular works much faster than the one used in Magma, namely by factors between 50 and 1000. We also compute some further examples on more than 8 variables, including a minimal generating set for the natural action of the cyclic group of order 11 in characteristic 0 and of order 15 in characteristic 2. We also apply our algorithm to the computation of irreducible secondary invariants.

Simon King

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

Comparison of Speech Representations for Automatic Quality Estimation in Multi-Speaker Text-to-Speech Synthesis

DNN-based Speech Synthesis for Indian Languages from ASCII text

Improving Trajectory Modelling for DNN-based Speech Synthesis by using Stacked Bottleneck Features and Minimum Generation Error Training

Investigating gated recurrent neural networks for speech synthesis

Median-Based Generation of Synthetic Speech Durations using a Non-Parametric Approach

The isomorphism problem for graded algebras and its application to mod-p cohomology rings of small p-groups

Completeness criteria for modular cohomology rings of non prime power groups

A State Sum Link Invariant of Regular Isotopy

Minimal generating sets of non-modular invariant rings of finite groups