Source author record

Zhiyuan Zhao

Zhiyuan Zhao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

eess.AS Sound cond-mat.soft Computer Vision cond-mat.mes-hall cond-mat.stat-mech Machine Learning physics.flu-dyn quant-ph Robotics

Catalog footprint

What is connected

9works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.

preprint2026arXiv

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

Unified multimodal models (UMMs) integrate visual understanding and generation within a single framework. For text-to-image (T2I) tasks, this unified capability allows UMMs to refine outputs after their initial generation, potentially extending the performance upper bound. Current UMM-based refinement methods primarily follow a refinement-via-editing (RvE) paradigm, where UMMs produce editing instructions to modify misaligned regions while preserving aligned content. However, editing instructions often describe prompt-image misalignment only coarsely, leading to incomplete refinement. Moreover, pixel-level preservation, though necessary for editing, unnecessarily restricts the effective modification space for refinement. To address these limitations, we propose Refinement via Regeneration (RvR), a novel framework that reformulates refinement as conditional image regeneration rather than editing. Instead of relying on editing instructions and enforcing strict content preservation, RvR regenerates images conditioned on the target prompt and the semantic tokens of the initial image, enabling more complete semantic alignment with a larger modification space. Extensive experiments demonstrate the effectiveness of RvR, improving Geneval from 0.78 to 0.91, DPGBench from 84.02 to 87.21, and UniGenBench++ from 61.53 to 77.41.

preprint2022arXiv

An Anchor-Free Detector for Continuous Speech Keyword Spotting

Continuous Speech Keyword Spotting (CSKWS) is a task to detect predefined keywords in a continuous speech. In this paper, we regard CSKWS as a one-dimensional object detection task and propose a novel anchor-free detector, named AF-KWS, to solve the problem. AF-KWS directly regresses the center locations and lengths of the keywords through a single-stage deep neural network. In particular, AF-KWS is tailored for this speech task as we introduce an auxiliary unknown class to exclude other words from non-speech or silent background. We have built two benchmark datasets named LibriTop-20 and continuous meeting analysis keywords (CMAK) dataset for CSKWS. Evaluations on these two datasets show that our proposed AF-KWS outperforms reference schemes by a large margin, and therefore provides a decent baseline for future research.

preprint2022arXiv

Emergent Stripes of Active Rotors in Shear Flows

The shear-induced self-organization of active rotors into stripy aggregates is studied by carrying out computational simulations. The rotors, modeled by monolayers of frictional spheres, develop to stripy microstructures only when they counterrotate with respect to the vorticity of the imposed shear flow. The average width of the stripes is demonstrated to be linearly dependent on the relative intensity of active torque to the shear rate. By giving insight into three collective particle behaviors, i.e., shear-induced diffusion, rotation-induced rearrangement, and edge flows, we explain the mechanisms of formation of the particle stripes. Additionally, the rheological result shows the dependence of shear and rotational viscosities on the active torque direction and the oddness of the normal stress response. By exhibiting a collective phenomenon of active rotors, our study paves the way to understanding chiral active matter.

preprint2022arXiv

Odd Viscosity in Chiral Passive Suspensions

Prior studies have revealed that nonzero odd viscosity is an essential property for chiral active fluids. Here we report that such an odd viscosity also exists in suspensions of non-active or non-externally-driven but chirally-shaped particles. Computational simulations are carried out for monolayers of dense ratchets in simple shear and planar extensional flows. The contact between two ratchets can be either frictionless or infinitely-frictional, depending on their teeth and sliding directions at the contact point. Our results show that the ratchet suspension has the intermediate shear/extensional viscosity as compared with the suspensions of smooth and gear-like particles. Meanwhile, the ratchet suspensions show nonzero even and odd components of the first normal stress coefficient, which indicates the mixed feature of conventional complex fluids and chiral viscous fluids.

preprint2022arXiv

PetLock:A Genderless and Standard Interface for the Future On-orbit Construction

Modular design is the foundation of on orbit construction technology of large space facilities in the future.Standard interface is the key technology of modular design of the future space robotic systems and space facilities.This paper presents the designed and tested of PetLock,a standard and genderless interface which can transfer mechanical loads,power and data between the future modular space robotic manipulator and spacecraft.PetLock adopts a completely genderless design,including connection face,locking mechanism,data and power interface.The connection surface provides a large translation and rotation misalignment tolerance,due to its 120-degree symmetrical and 3D shape design.The locking mechanism features the three locking pins retraction structure design,which is simple and reliable.POGO pin connectors in the center of the interface provides the power and data transfer capabilities.Due to the advantages of high locking force,large tolerance,high reliability and low cost,PetLock has the very big application potential in future on orbit construction missions.

preprint2022arXiv

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

This paper proposes a new "decompose-and-edit" paradigm for the text-based speech insertion task that facilitates arbitrary-length speech insertion and even full sentence generation. In the proposed paradigm, global and local factors in speech are explicitly decomposed and separately manipulated to achieve high speaker similarity and continuous prosody. Specifically, we proposed to represent the global factors by multiple tokens, which are extracted by cross-attention operation and then injected back by link-attention operation. Due to the rich representation of global factors, we manage to achieve high speaker similarity in a zero-shot manner. In addition, we introduce a prosody smoothing task to make the local prosody factor context-aware and therefore achieve satisfactory prosody continuity. We further achieve high voice quality with an adversarial training stage. In the subjective test, our method achieves state-of-the-art performance in both naturalness and similarity. Audio samples can be found at https://ydcustc.github.io/retrieverTTS-demo/.

preprint2021arXiv

General-Purpose Speech Representation Learning through a Self-Supervised Multi-Granularity Framework

This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning. In the design of MGF, speech hierarchy is taken into consideration. Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales. For phoneme-scale learning, we borrow idea from the masked language model but tailor it for the continuous speech signal by replacing classification loss with a contrastive loss. We corroborate our design by evaluating MGF representation on various downstream tasks, including phoneme classification, speaker classification, speech recognition, and emotion classification. Experiments verify that training at different time scales needs different training targets and loss functions, which in general complement each other and lead to a better performance.

preprint2020arXiv

Experimental sensing quantum atmosphere of a single spin

Understanding symmetry-breaking states of materials is a major challenge in the modern physical sciences. Quantum atmosphere proposed recently sheds light on the hidden world of these symmetry broken patterns. But the requirements for exquisite sensitivity to the small shift and tremendous spatial resolution to local information pose huge obstacles to its experimental manifestation. In our experiment, we prepare time-reversal-symmetry conserved and broken quantum atmosphere of a single nuclear spin and successfully observe their symmetry properties. Our work proves in principle that finding symmetry patterns from quantum atmosphere is conceptually viable. It also opens up entirely new possibilities in the potential application of quantum sensing in material diagnosis.

Zhiyuan Zhao

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models

Refinement via Regeneration: Enlarging Modification Space Boosts Image Refinement in Unified Multimodal Models

An Anchor-Free Detector for Continuous Speech Keyword Spotting

Emergent Stripes of Active Rotors in Shear Flows

Odd Viscosity in Chiral Passive Suspensions

PetLock:A Genderless and Standard Interface for the Future On-orbit Construction

RetrieverTTS: Modeling Decomposed Factors for Text-Based Speech Insertion

General-Purpose Speech Representation Learning through a Self-Supervised Multi-Granularity Framework

Experimental sensing quantum atmosphere of a single spin