Source author record

Liyang Chen

Liyang Chen appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Sound eess.AS Multimedia Artificial Intelligence Computation and Language Computer Vision cond-mat.str-el cond-mat.supr-con Graphics Human-Computer Interaction Machine Learning

Catalog footprint

What is connected

6works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

While previous speech-driven talking face generation methods have made significant progress in improving the visual quality and lip-sync quality of the synthesized videos, they pay less attention to lip motion jitters which greatly undermine the realness of talking face videos. What causes motion jitters, and how to mitigate the problem? In this paper, we conduct systematic analyses on the motion jittering problem based on a state-of-the-art pipeline that uses 3D face representations to bridge the input audio and output video, and improve the motion stability with a series of effective designs. We find that several issues can lead to jitters in synthesized talking face video: 1) jitters from the input 3D face representations; 2) training-inference mismatch; 3) lack of dependency modeling among video frames. Accordingly, we propose three effective solutions to address this issue: 1) we propose a gaussian-based adaptive smoothing module to smooth the 3D face representations to eliminate jitters in the input; 2) we add augmented erosions on the input data of the neural renderer in training to simulate the distortion in inference to reduce mismatch; 3) we develop an audio-fused transformer generator to model dependency among video frames. Besides, considering there is no off-the-shelf metric for measuring motion jitters in talking face video, we devise an objective metric (Motion Stability Index, MSI), to quantitatively measure the motion jitters by calculating the reciprocal of variance acceleration. Extensive experimental results show the superiority of our method on motion-stable face video generation, with better quality than previous systems.

preprint2022arXiv

The ReprGesture entry to the GENEA Challenge 2022

This paper describes the ReprGesture entry to the Generation and Evaluation of Non-verbal Behaviour for Embodied Agents (GENEA) challenge 2022. The GENEA challenge provides the processed datasets and performs crowdsourced evaluations to compare the performance of different gesture generation systems. In this paper, we explore an automatic gesture generation system based on multimodal representation learning. We use WavLM features for audio, FastText features for text and position and rotation matrix features for gesture. Each modality is projected to two distinct subspaces: modality-invariant and modality-specific. To learn inter-modality-invariant commonalities and capture the characters of modality-specific representations, gradient reversal layer based adversarial classifier and modality reconstruction decoders are used during training. The gesture decoder generates proper gestures using all representations and features related to the rhythm in the audio. Our code, pre-trained models and demo are available at https://github.com/YoungSeng/ReprGesture.

preprint2022arXiv

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Previous works on expressive speech synthesis mainly focus on current sentence. The context in adjacent sentences is neglected, resulting in inflexible speaking style for the same text, which lacks speech variations. In this paper, we propose a hierarchical framework to model speaking style from context. A hierarchical context encoder is proposed to explore a wider range of contextual information considering structural relationship in context, including inter-phrase and inter-sentence relations. Moreover, to encourage this encoder to learn style representation better, we introduce a novel training strategy with knowledge distillation, which provides the target for encoder training. Both objective and subjective evaluations on a Mandarin lecture dataset demonstrate that the proposed method can significantly improve the naturalness and expressiveness of the synthesized speech.

preprint2022arXiv

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Previous works on expressive speech synthesis focus on modelling the mono-scale style embedding from the current sentence or context, but the multi-scale nature of speaking style in human speech is neglected. In this paper, we propose a multi-scale speaking style modelling method to capture and predict multi-scale speaking style for improving the naturalness and expressiveness of synthetic speech. A multi-scale extractor is proposed to extract speaking style embeddings at three different levels from the ground-truth speech, and explicitly guide the training of a multi-scale style predictor based on hierarchical context information. Both objective and subjective evaluations on a Mandarin audiobooks dataset demonstrate that our proposed method can significantly improve the naturalness and expressiveness of the synthesized speech.

preprint2022arXiv

Transformer-S2A: Robust and Efficient Speech-to-Animation

We propose a novel robust and efficient Speech-to-Animation (S2A) approach for synchronized facial animation generation in human-computer interaction. Compared with conventional approaches, the proposed approach utilizes phonetic posteriorgrams (PPGs) of spoken phonemes as input to ensure the cross-language and cross-speaker ability, and introduces corresponding prosody features (i.e. pitch and energy) to further enhance the expression of generated animation. Mixture-of-experts (MOE)-based Transformer is employed to better model contextual information while provide significant optimization on computation efficiency. Experiments demonstrate the effectiveness of the proposed approach on both objective and subjective evaluation with 17x inference speedup compared with the state-of-the-art approach.

preprint2020arXiv

Tunneling spectroscopy of c-axis epitaxial cuprate junctions

Atomically precise epitaxial structures are unique systems for tunneling spectroscopy that minimize extrinsic effects of disorder. We present a systematic tunneling spectroscopy study, over a broad doping, temperature, and bias range, in epitaxial c-axis La$_{2-x}$Sr$_{x}$CuO$_{4}$/La$_{2}$CuO$_{4}$/La$_{2-x}$Sr$_{x}$CuO$_{4}$ heterostructures. The behavior of these superconductor/insulator/superconductor (SIS) devices is unusual. Down to 20 mK there is complete suppression of c-axis Josephson critical current with a barrier of only 2 nm of La$_{2}$CuO$_{4}$, and the zero-bias conductance remains at 20-30% of the normal-state conductance, implying a substantial population of in-gap states. Tunneling spectra show greatly suppressed coherence peaks. As the temperature is raised, the superconducting gap fills in rather than closing at $T_{c}$. For all doping levels, the spectra show an inelastic tunneling feature at $\sim$ 80 meV, suppressed as $T$ exceeds $T_{c}$. These nominally simple epitaxial cuprate junctions deviate markedly from expectations based on the standard Bardeen-Cooper-Schrieffer (BCS) theory.

Liyang Chen

What is connected

Connect this record

See the researcher in context

Building this map preview

6 published item(s)

StableFace: Analyzing and Improving Motion Stability for Talking Face Generation

The ReprGesture entry to the GENEA Challenge 2022

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Transformer-S2A: Robust and Efficient Speech-to-Animation

Tunneling spectroscopy of c-axis epitaxial cuprate junctions