Source author record

David Haws

David Haws appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

math.CO math.ST Statistics Theory eess.AS Sound Computation and Language Computational Engineering, Finance, and Science Data Structures and Algorithms Genomics Machine Learning Populations and Evolution

Catalog footprint

What is connected

11works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis

Sequence-to-Sequence Text-to-Speech architectures that directly generate low level acoustic features from phonetic sequences are known to produce natural and expressive speech when provided with adequate amounts of training data. Such systems can learn and transfer desired speaking styles from one seen speaker to another (in multi-style multi-speaker settings), which is highly desirable for creating scalable and customizable Human-Computer Interaction systems. In this work we explore one-to-many style transfer from a dedicated single-speaker conversational corpus with style nuances and interjections. We elaborate on the corpus design and explore the feasibility of such style transfer when assisted with Voice-Conversion-based data augmentation. In a set of subjective listening experiments, this approach resulted in high-fidelity style transfer with no quality degradation. However, a certain voice persona shift was observed, requiring further improvements in voice conversion.

preprint2022arXiv

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

Beam search, which is the dominant ASR decoding algorithm for end-to-end models, generates tree-structured hypotheses. However, recent studies have shown that decoding with hypothesis merging can achieve a more efficient search with comparable or better performance. But, the full context in recurrent networks is not compatible with hypothesis merging. We propose to use vector-quantized long short-term memory units (VQ-LSTM) in the prediction network of RNN transducers. By training the discrete representation jointly with the ASR network, hypotheses can be actively merged for lattice generation. Our experiments on the Switchboard corpus show that the proposed VQ RNN transducers improve ASR performance over transducers with regular prediction networks while also producing denser lattices with a very low oracle word error rate (WER) for the same beam size. Additional language model rescoring experiments also demonstrate the effectiveness of the proposed lattice generation scheme.

preprint2015arXiv

Polyhedral aspects of score equivalence in Bayesian network structure learning

This paper deals with faces and facets of the family-variable polytope and the characteristic-imset polytope, which are special polytopes used in integer linear programming approaches to statistically learn Bayesian network structure. A common form of linear objectives to be maximized in this area leads to the concept of score equivalence (SE), both for linear objectives and for faces of the family-variable polytope. We characterize the linear space of SE objectives and establish a one-to-one correspondence between SE faces of the family-variable polytope, the faces of the characteristic-imset polytope, and standardized supermodular functions. The characterization of SE facets in terms of extremality of the corresponding supermodular function gives an elegant method to verify whether an inequality is SE-facet-defining for the family-variable polytope. We also show that when maximizing an SE objective one can eliminate linear constraints of the family-variable polytope that correspond to non-SE facets. However, we show that solely considering SE facets is not enough as a counter-example shows; one has to consider the linear inequality constraints that correspond to facets of the characteristic-imset polytope despite the fact that they may not define facets in the family-variable mode.

preprint2013arXiv

Markov degree of the three-state toric homogeneous Markov chain model

We consider the three-state toric homogeneous Markov chain model (THMC) without loops and initial parameters. At time $T$, the size of the design matrix is $6 \times 3\cdot 2^{T-1}$ and the convex hull of its columns is the model polytope. We study the behavior of this polytope for $T\geq 3$ and we show that it is defined by 24 facets for all $T\ge 5$. Moreover, we give a complete description of these facets. From this, we deduce that the toric ideal associated with the design matrix is generated by binomials of degree at most 6. Our proof is based on a result due to Sturmfels, who gave a bound on the degree of the generators of a toric ideal, provided the normality of the corresponding toric variety. In our setting, we established the normality of the toric variety associated to the THMC model by studying the geometric properties of the model polytope.

preprint2013arXiv

MINT: Mutual Information based Transductive Feature Selection for Genetic Trait Prediction

Whole genome prediction of complex phenotypic traits using high-density genotyping arrays has attracted a great deal of attention, as it is relevant to the fields of plant and animal breeding and genetic epidemiology. As the number of genotypes is generally much bigger than the number of samples, predictive models suffer from the curse-of-dimensionality. The curse-of-dimensionality problem not only affects the computational efficiency of a particular genomic selection method, but can also lead to poor performance, mainly due to correlation among markers. In this work we proposed the first transductive feature selection method based on the MRMR (Max-Relevance and Min-Redundancy) criterion which we call MINT. We applied MINT on genetic trait prediction problems and showed that in general MINT is a better feature selection method than the state-of-the-art inductive method mRMR.

preprint2013arXiv

QuickLexSort: An efficient algorithm for lexicographically sorting nested restrictions of a database

Lexicographical sorting is a fundamental problem with applications to contingency tables, databases, Bayesian networks, and more. A standard method to lexicographically sort general data is to iteratively use a stable sort -- a sort which preserves existing orders. Here we present a new method of lexicographical sorting called QuickLexSort. Whereas a stable sort based lexicographical sorting algorithm operates from the least important to most important features, in contrast, QuickLexSort sorts from the most important to least important features, refining the sort as it goes. QuickLexSort first requires a one-time modest pre-processing step where each feature of the data set is sorted independently. When lexicographically sorting a database, QuickLexSort (including pre-processing) has comparable running time to using a stable sort based approach. For a data base with $m$ rows and $n$ columns, and a sorting algorithm running in time $O(mlog(m))$, a stable sort based lexicographical sort and QuickLexSort will both take time $O(nmlog(m))$. However in many applications one has the need to lexicographically sort nested data, e.g.\ all possible sub-matrices up to a certain cardinality of columns. In such cases we show QuickLexSort gives a performance improvement of a log factor of the database length (rows in matrix) over using a standard stable sort based approach. E.g.\ to sort all sub-matrices up to cardinality $k$, QuickLexSort has running time $O(mn^k)$ whereas a stable sort based lexicographical sort will take time $O(mlog(m)n^k)$. After the pre-processing step that is run only once for the entire matrix, QuickLexSort has a running time linear in the number of nested sub-matrices to sort. We conclude with an application to Bayesian network scoring to detect epistasis using SNP marker data.

preprint2011arXiv

Degree Bounds for a Minimal Markov Basis for the Three-State Toric Homogeneous Markov Chain Model

We study the three state toric homogeneous Markov chain model and three special cases of it, namely: (i) when the initial state parameters are constant, (ii) without self-loops, and (iii) when both cases are satisfied at the same time. Using as a key tool a directed multigraph associated to the model, the state-graph, we give a bound on the number of vertices of the polytope associated to the model which does not depend on the time. Based on our computations, we also conjecture the stabilization of the f-vector of the polytope, analyze the normality of the semigroup, give conjectural bounds on the degree of the Markov bases.

preprint2011arXiv

Estimating the number of zero-one multi-way tables via sequential importance sampling

In 2005, Chen et al introduced a sequential importance sampling (SIS) procedure to analyze zero-one two-way tables with given fixed marginal sums (row and column sums) via the conditional Poisson (CP) distribution. They showed that compared with Monte Carlo Markov chain (MCMC)-based approaches, their importance sampling method is more efficient in terms of running time and also provides an easy and accurate estimate of the total number of contingency tables with fixed marginal sums. In this paper we extend their result to zero-one multi-way ($d$-way, $d \geq 2$) contingency tables under the no $d$-way interaction model, i.e., with fixed $d - 1$ marginal sums. Also we show by simulations that the SIS procedure with CP distribution to estimate the number of zero-one three-way tables under the no three-way interaction model given marginal sums works very well even with some rejections. We also applied our method to Samson's monks' data set. We end with further questions on the SIS procedure on zero-one multi-way tables.

preprint2011arXiv

On polyhedral approximations of polytopes for learning Bayes nets

We review three vector encodings of Bayesian network structures. The first one has recently been applied by Jaakkola 2010, the other two use special integral vectors formerly introduced, called imsets [Studeny 2005, Studeny 2010]. The central topic is the comparison of outer polyhedral approximations of the corresponding polytopes. We show how to transform the inequalities suggested by Jaakkola et al. to the framework of imsets. The result of our comparison is the observation that the implicit polyhedral approximation of the standard imset polytope suggested in [Studeny 2011] gives a closer approximation than the (transformed) explicit polyhedral approximation from [Jaakkola 2010]. Finally, we confirm a conjecture from [Studeny 2011] that the above-mentioned implicit polyhedral approximation of the standard imset polytope is an LP relaxation of the polytope.

preprint2011arXiv

Semigroups and sequential importance sampling for multiway tables and beyond

When an interval of integers between the lower bound l_i and the upper bounds u_i is the support of the marginal distribution n_i|(n_{i-1}, ...,n_1), Chen et al. 2005 noticed that sampling from the interval at each step, for n_i during the sequential importance sampling (SIS) procedure, always produces a table which satisfies the marginal constraints. However, in general, the interval may not be equal to the support of the marginal distribution. In this case, the SIS procedure may produce tables which do not satisfy the marginal constraints, leading to rejection [Chen et al. 2006]. Rejecting tables is computationally expensive and incorrect proposal distributions result in biased estimators for the number of tables given its marginal sums. This paper has two focuses; (1) we propose a correction coefficient which corrects an interval of integers between the lower bound l_i and the upper bounds u_i to the support of the marginal distribution asymptotically even with rejections and with the same time complexity as the original SIS procedure (2) using univariate and bivariate logistic regression models, we present extensive experiments on simulated data sets for estimating the number of tables, and (3) we applied the volume test proposed by Diaconis and Efron 1985 on 2x2x6 randomly generated tables to compare the performance of SIS versus MCMC. When estimating the number of tables in our simulation study, we used univariate and bivariate logistic regression models since under these models the SIS procedure seems to have higher rate of rejections even with small tables. We also apply our correction coefficients to data sets on coronary heart disease and occurrence of esophageal cancer.

preprint2010arXiv

Statistical Phylogenetic Tree Analysis Using Differences of Means

We propose a statistical method to test whether two phylogenetic trees with given alignments are significantly incongruent. Our method compares the two distributions of phylogenetic trees given by the input alignments, instead of comparing point estimations of trees. This statistical approach can be applied to gene tree analysis for example, detecting unusual events in genome evolution such as horizontal gene transfer and reshuffling. Our method uses difference of means to compare two distributions of trees, after embedding trees in a vector space. Bootstrapping alignment columns can then be applied to obtain p-values. To compute distances between means, we employ a "kernel trick" which speeds up distance calculations when trees are embedded in a high-dimensional feature space, e.g. splits or quartets feature space. In this pilot study, first we test our statistical method's ability to distinguish between sets of gene trees generated under coalescence models with species trees of varying dissimilarity. We follow our simulation results with applications to various data sets of gophers and lice, grasses and their endophytes, and different fungal genes from the same genome. A companion toolkit, {\tt Phylotree}, is provided to facilitate computational experiments.

David Haws

What is connected

Connect this record

See the researcher in context

Building this map preview

11 published item(s)

Transplantation of Conversational Speaking Style with Interjections in Sequence-to-Sequence Speech Synthesis

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States

Polyhedral aspects of score equivalence in Bayesian network structure learning

Markov degree of the three-state toric homogeneous Markov chain model

MINT: Mutual Information based Transductive Feature Selection for Genetic Trait Prediction

QuickLexSort: An efficient algorithm for lexicographically sorting nested restrictions of a database

Degree Bounds for a Minimal Markov Basis for the Three-State Toric Homogeneous Markov Chain Model

Estimating the number of zero-one multi-way tables via sequential importance sampling

On polyhedral approximations of polytopes for learning Bayes nets

Semigroups and sequential importance sampling for multiway tables and beyond

Statistical Phylogenetic Tree Analysis Using Differences of Means