Source author record

Zhiyong Wu

Zhiyong Wu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.AS Sound Computation and Language Machine Learning Artificial Intelligence Multimedia Human-Computer Interaction astro-ph.EP Graphics physics.space-ph Software Engineering

Catalog footprint

What is connected

32works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding

Non-verbal Vocalizations (NVs), such as laughter and sighs, are vital for conveying emotion and intention in human speech, yet most existing speech systems neglect them, which severely compromises communicative richness and emotional intelligence. Existing methods for NVs acquisition are either costly and unscalable (relying on manual annotation/recording) or unnatural (relying on rule-based synthesis). To address these limitations, we propose a highly scalable automatic annotation framework to label non-verbal phenomena from natural speech, which is low-cost, easily extendable, and inherently diverse and natural. This framework leverages a unified detection model to accurately identify NVs in natural speech and integrates them with transcripts via temporal-semantic alignment method. Using this framework, we created and released \textbf{NonVerbalSpeech-38K}, a diverse, real-world dataset featuring 38,718 samples across 10 NV categories collected from in-the-wild media. Experimental results demonstrate that our dataset provides superior controllability for NVs generation and achieves comparable performance for NVs understanding.

preprint2026arXiv

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

Full-duplex spoken dialogue requires a model to keep listening while generating its own spoken response. This is challenging for large language models (LLMs), which are designed to extend a single coherent sequence and do not naturally support user input arriving during generation. We argue that how the user stream is routed into the LLM is therefore a key architectural question for full-duplex modeling. To study this question, we extend a text-only LLM into a unified full-duplex spoken dialogue system and compare two routing strategies under a shared training pipeline: (i) channel fusion, which injects the user stream directly into the LLM input, and (ii) cross-attention routing, which keeps the user stream as external memory accessed through cross-attention adapters. Experiments on spoken question answering and full-duplex interaction benchmarks reveal a clear tradeoff. Channel fusion yields stronger semantic grounding and consistently better question-answering performance. However, under semantically overlapping conditions such as user interruptions, it is more vulnerable to context corruption: if the model fails to stop in time, the overlapping user stream can interfere with ongoing generation and lead to semantically incoherent continuations. Cross-attention routing underperforms on question answering, but better preserves the LLM generation context and is more robust to this failure mode. These results establish user-stream routing as a central design axis in full-duplex spoken dialogue and offer practical guidance on the tradeoff between semantic integration and context robustness. We provide a demo page for qualitative inspection.

preprint2026arXiv

OpenCompass: A Universal Evaluation Platform for Large Language Models

In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.

preprint2026arXiv

UniSRCodec: Unified and Low-Bitrate Single Codebook Codec with Sub-Band Reconstruction

Neural Audio Codecs (NACs) can reduce transmission overhead by performing compact compression and reconstruction, which also aim to bridge the gap between continuous and discrete signals. Existing NACs can be divided into two categories: multi-codebook and single-codebook codecs. Multi-codebook codecs face challenges such as structural complexity and difficulty in adapting to downstream tasks, while single-codebook codecs, though structurally simpler, suffer from low-fidelity, ineffective modeling of unified audio, and an inability to support modeling of high-frequency audio. We propose the UniSRCodec, a single-codebook codec capable of supporting high sampling rate, low-bandwidth, high fidelity, and unified. We analyze the inefficiency of waveform-based compression and introduce the time and frequency compression method using the Mel-spectrogram, and cooperate with a Vocoder to recover the phase information of the original audio. Moreover, we propose a sub-band reconstruction technique to achieve high-quality compression across both low and high frequency bands. Subjective and objective experimental results demonstrate that UniSRCodec achieves state-of-the-art (SOTA) performance among cross-domain single-codebook codecs with only a token rate of 40, and its reconstruction quality is comparable to that of certain multi-codebook methods. Our demo page is available at https://wxzyd123.github.io/unisrcodec.

preprint2024arXiv

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

Current talking avatars mostly generate co-speech gestures based on audio and text of the utterance, without considering the non-speaking motion of the speaker. Furthermore, previous works on co-speech gesture generation have designed network structures based on individual gesture datasets, which results in limited data volume, compromised generalizability, and restricted speaker movements. To tackle these issues, we introduce FreeTalker, which, to the best of our knowledge, is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions. Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions, utilizing heterogeneous data sourced from various motion datasets. During inference, we utilize classifier-free guidance to highly control the style in the clips. Additionally, to create smooth transitions between clips, we utilize DoubleTake, a method that leverages a generative prior and ensures seamless motion blending. Extensive experiments show that our method generates natural and controllable speaker movements. Our code, model, and demo are are available at \url{https://youngseng.github.io/FreeTalker/}.

preprint2024arXiv

Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion

Any-to-any singing voice conversion (SVC) is confronted with the challenge of ``timbre leakage'' issue caused by inadequate disentanglement between the content and the speaker timbre. To address this issue, this study introduces NeuCoSVC, a novel neural concatenative SVC framework. It consists of a self-supervised learning (SSL) representation extractor, a neural harmonic signal generator, and a waveform synthesizer. The SSL extractor condenses audio into fixed-dimensional SSL features, while the harmonic signal generator leverages linear time-varying filters to produce both raw and filtered harmonic signals for pitch information. The synthesizer reconstructs waveforms using SSL features, harmonic signals, and loudness information. During inference, voice conversion is performed by substituting source SSL features with their nearest counterparts from a matching pool which comprises SSL features extracted from the reference audio, while preserving raw harmonic signals and loudness from the source audio. By directly utilizing SSL features from the reference audio, the proposed framework effectively resolves the ``timbre leakage" issue caused by previous disentanglement-based approaches. Experimental results demonstrate that the proposed NeuCoSVC system outperforms the disentanglement-based speaker embedding approach in one-shot SVC across intra-language, cross-language, and cross-domain evaluations.

preprint2023arXiv

The Mars Orbiter Magnetometer of Tianwen-1: In-flight Performance and First Science Results

Mars Orbiter MAGnetometer (MOMAG) is a scientifc instrument onboard the orbiter of China's first mission for Mars -- Tianwen-1. It started to routinely measure the magnetic field from the solar wind to magnetic pile-up region surrounding Mars since November 13, 2021. Here we present its in-flight performance and first science results based on the first one and a half months' data. By comparing with the magnetic field data in the solar wind from the Mars Atmosphere and Volatile EvolutioN (MAVEN), the magnetic field by MOMAG is at the same level in magnitude, and the same magnetic structures with the similar variations in three components could be found in MOMAG data. In the first one and a half months, we recognize 158 clear bow shock (BS) crossings from MOMAG data, whose locations statistically match well with the modeled average BS. We also identify 5 pairs of simultaneous BS crossings of the Tianwen-1's orbiter and MAVEN. These BS crossings confirm the global shape of modeled BS as well as the south-north asymmetry of the Martian BS. Two presented cases in this paper suggest that the BS is probably more dynamic at flank than near the nose. So far, MOMAG performs well, and provides accurate magnetic field vectors. MOMAG is continuously scanning the magnetic field surrounding Mars. These measurements complemented by observations from MAVEN will undoubtedly advance our understanding of the plasma environment of Mars.

Zhiyong Wu

What is connected

Connect this record

See the researcher in context

Building this map preview

32 published item(s)

A Scalable Pipeline for Enabling Non-Verbal Speech Generation and Understanding

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

OpenCompass: A Universal Evaluation Platform for Large Language Models

UniSRCodec: Unified and Low-Bitrate Single Codebook Codec with Sub-Band Reconstruction

Freetalker: Controllable Speech and Text-Driven Gesture Generation Based on Diffusion Models for Enhanced Speaker Naturalness

Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion

The Mars Orbiter Magnetometer of Tianwen-1: In-flight Performance and First Science Results

A Character-level Span-based Model for Mandarin Prosodic Structure Prediction

Adversarial Sample Detection for Speaker Verification by Neural Vocoders

An Approach to Mispronunciation Detection and Diagnosis with Acoustic, Phonetic and Linguistic (APL) Embeddings

An End-to-end Chinese Text Normalization Model based on Rule-guided Flat-Lattice Transformer

Disentangleing Content and Fine-grained Prosody Information via Hybrid ASR Bottleneck Features for Voice Conversion

Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling

FullSubNet+: Channel Attention FullSubNet with Complex Spectrograms for Speech Enhancement

Lexical Knowledge Internalization for Neural Dialog Generation

NeuFA: Neural Network Based End-to-End Forced Alignment with Bidirectional Attention Mechanism

Neural Architecture Search for Speech Emotion Recognition

Ordinal Regression via Binary Preference vs Simple Regression: Statistical and Experimental Perspectives

Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion

Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion

The ReprGesture entry to the GENEA Challenge 2022

Towards Cross-speaker Reading Style Transfer on Audiobook Dataset

Towards Expressive Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Towards Multi-Scale Speaking Style Modelling with Hierarchical Context Information for Mandarin Speech Synthesis

Transformer-S2A: Robust and Efficient Speech-to-Animation

Adversarial defense for automatic speaker verification by cascaded self-supervised learning models

Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition

Industry Practice of Coverage-Guided Enterprise-Level DBMS Fuzzing

Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking Head Generation Using Phonetic Posteriorgrams

Speech-XLNet: Unsupervised Acoustic Model Pretraining For Self-Attention Networks

Study on Feature Subspace of Archetypal Emotions for Speech Emotion Recognition

Feature Learning with Gaussian Restricted Boltzmann Machine for Robust Speech Recognition