Researcher profile

Zhenhua Ling

Zhenhua Ling contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
7topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2022arXiv

Cognitive Diagnosis with Explicit Student Vector Estimation and Unsupervised Question Matrix Learning

Cognitive diagnosis is an essential task in many educational applications. Many solutions have been designed in the literature. The deterministic input, noisy "and" gate (DINA) model is a classical cognitive diagnosis model and can provide interpretable cognitive parameters, e.g., student vectors. However, the assumption of the probabilistic part of DINA is too strong, because it assumes that the slip and guess rates of questions are student-independent. Besides, the question matrix (i.e., Q-matrix) recording the skill distribution of the questions in the cognitive diagnosis domain often requires precise labels given by domain experts. Thus, we propose an explicit student vector estimation (ESVE) method to estimate the student vectors of DINA with a local self-consistent test, which does not rely on any assumptions for the probabilistic part of DINA. Then, based on the estimated student vectors, the probabilistic part of DINA can be modified to a student dependent model that the slip and guess rates are related to student vectors. Furthermore, we propose an unsupervised method called heuristic bidirectional calibration algorithm (HBCA) to label the Q-matrix automatically, which connects the question difficulty relation and the answer results for initialization and uses the fault tolerance of ESVE-DINA for calibration. The experimental results on two real-world datasets show that ESVE-DINA outperforms the DINA model on accuracy and that the Q-matrix labeled automatically by HBCA can achieve performance comparable to that obtained with the manually labeled Q-matrix when using the same model structure.

preprint2022arXiv

Decoupled Pronunciation and Prosody Modeling in Meta-Learning-Based Multilingual Speech Synthesis

This paper presents a method of decoupled pronunciation and prosody modeling to improve the performance of meta-learning-based multilingual speech synthesis. The baseline meta-learning synthesis method adopts a single text encoder with a parameter generator conditioned on language embeddings and a single decoder to predict mel-spectrograms for all languages. In contrast, our proposed method designs a two-stream model structure that contains two encoders and two decoders for pronunciation and prosody modeling, respectively, considering that the pronunciation knowledge and the prosody knowledge should be shared in different ways among languages. In our experiments, our proposed method effectively improved the intelligibility and naturalness of multilingual speech synthesis comparing with the baseline meta-learning synthesis method.

preprint2022arXiv

Speaker Adaption with Intuitive Prosodic Features for Statistical Parametric Speech Synthesis

In this paper, we propose a method of speaker adaption with intuitive prosodic features for statistical parametric speech synthesis. The intuitive prosodic features employed in this method include pitch, pitch range, speech rate and energy considering that they are directly related with the overall prosodic characteristics of different speakers. The intuitive prosodic features are extracted at utterance-level or speaker-level, and are further integrated into the existing speaker-encoding-based and speaker-embedding-based adaptation frameworks respectively. The acoustic models are sequence-to-sequence ones based on Tacotron2. Intuitive prosodic features are concatenated with text encoder outputs and speaker vectors for decoding acoustic features.Experimental results have demonstrated that our proposed methods can achieve better objective and subjective performance than the baseline methods without intuitive prosodic features. Besides, the proposed speaker adaption method with utterance-level prosodic features has achieved the best similarity of synthetic speech among all compared methods.

preprint2020arXiv

Online Speaker Adaptation for WaveNet-based Neural Vocoders

In this paper, we propose an online speaker adaptation method for WaveNet-based neural vocoders in order to improve their performance on speaker-independent waveform generation. In this method, a speaker encoder is first constructed using a large speaker-verification dataset which can extract a speaker embedding vector from an utterance pronounced by an arbitrary speaker. At the training stage, a speaker-aware WaveNet vocoder is then built using a multi-speaker dataset which adopts both acoustic feature sequences and speaker embedding vectors as conditions.At the generation stage, we first feed the acoustic feature sequence from a test speaker into the speaker encoder to obtain the speaker embedding vector of the utterance. Then, both the speaker embedding vector and acoustic features pass the speaker-aware WaveNet vocoder to reconstruct speech waveforms. Experimental results demonstrate that our method can achieve a better objective and subjective performance on reconstructing waveforms of unseen speakers than the conventional speaker-independent WaveNet vocoder.

preprint2020arXiv

Predictions of Subjective Ratings and Spoofing Assessments of Voice Conversion Challenge 2020 Submissions

The Voice Conversion Challenge 2020 is the third edition under its flagship that promotes intra-lingual semiparallel and cross-lingual voice conversion (VC). While the primary evaluation of the challenge submissions was done through crowd-sourced listening tests, we also performed an objective assessment of the submitted systems. The aim of the objective assessment is to provide complementary performance analysis that may be more beneficial than the time-consuming listening tests. In this study, we examined five types of objective assessments using automatic speaker verification (ASV), neural speaker embeddings, spoofing countermeasures, predicted mean opinion scores (MOS), and automatic speech recognition (ASR). Each of these objective measures assesses the VC output along different aspects. We observed that the correlations of these objective assessments with the subjective results were high for ASV, neural speaker embedding, and ASR, which makes them more influential for predicting subjective test results. In addition, we performed spoofing assessments on the submitted systems and identified some of the VC methods showing a potentially high security risk.

preprint2020arXiv

Voice Conversion Challenge 2020: Intra-lingual semi-parallel and cross-lingual voice conversion

The voice conversion challenge is a bi-annual scientific event held to compare and understand different voice conversion (VC) systems built on a common dataset. In 2020, we organized the third edition of the challenge and constructed and distributed a new database for two tasks, intra-lingual semi-parallel and cross-lingual VC. After a two-month challenge period, we received 33 submissions, including 3 baselines built on the database. From the results of crowd-sourced listening tests, we observed that VC methods have progressed rapidly thanks to advanced deep learning methods. In particular, speaker similarity scores of several systems turned out to be as high as target speakers in the intra-lingual semi-parallel VC task. However, we confirmed that none of them have achieved human-level naturalness yet for the same task. The cross-lingual conversion task is, as expected, a more difficult task, and the overall naturalness and similarity scores were lower than those for the intra-lingual conversion task. However, we observed encouraging results, and the MOS scores of the best systems were higher than 4.0. We also show a few additional analysis results to aid in understanding cross-lingual VC better.

preprint2017arXiv

Enhanced LSTM for Natural Language Inference

Reasoning and inference are central to human and artificial intelligence. Modeling inference in human language is very challenging. With the availability of large annotated data (Bowman et al., 2015), it has recently become feasible to train neural network based inference models, which have shown to be very effective. In this paper, we present a new state-of-the-art result, achieving the accuracy of 88.6% on the Stanford Natural Language Inference Dataset. Unlike the previous top models that use very complicated network architectures, we first demonstrate that carefully designing sequential inference models based on chain LSTMs can outperform all previous models. Based on this, we further show that by explicitly considering recursive architectures in both local inference modeling and inference composition, we achieve additional improvement. Particularly, incorporating syntactic parsing information contributes to our best result---it further improves the performance even when added to the already very strong model.