Researcher profile

Mu Yang

Mu Yang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

Vision-Language-Action (VLA) models have emerged as a promising paradigm for end-to-end autonomous driving. However, existing reasoning mechanisms still struggle to provide planning-oriented intermediate representations: textual Chain-of-Thought (CoT) fails to preserve continuous spatiotemporal structure, while latent world reasoning remains difficult to use as a direct condition for action generation. In this paper, we propose CoWorld-VLA, a multi-expert world reasoning framework for autonomous driving, where world representations serve as explicit conditions to guide action planning. CoWorld-VLA extracts complementary world information through multi-source supervision and encodes it into expert tokens within the VLA, thereby providing planner-accessible conditioning signals. Specifically, we construct four types of tokens: semantic interaction, geometric structure, dynamic evolution, and ego trajectory tokens, which respectively model interaction intent, spatial structure, future temporal dynamics, and behavioral goals. During action generation, CoWorld-VLA employs a diffusion-based hierarchical multi-expert fusion planner, which is coupled with scene context throughout the joint denoising process to generate continuous ego trajectories. Experiments show that CoWorld-VLA achieves competitive results in both future scene generation and planning on the NAVSIM v1 benchmark, demonstrating strong performance in collision avoidance and trajectory accuracy. Ablation studies further validate the complementarity of expert tokens and their effectiveness as planning conditions for action generation. Code will be available at https://github.com/AFARI-Research/CoWorld-VLA.

preprint2024arXiv

Filtering one-way Einstein-Podolsky-Rosen steering

Einstein-Podolsky-Rosen (EPR) steering, a fundamental concept of quantum nonlocality, describes one observer's capability to remotely affect another distant observer's state by local measurements. Unlike quantum entanglement and Bell nonlocality, both associated with the symmetric quantum correlation, EPR steering depicts the unique asymmetric property of quantum nonlocality. With the local filter operation in which some system components are discarded, quantum nonlocality can be distilled to enhance the nonlocal correlation, and even the hidden nonlocality can be activated. However, asymmetric quantum nonlocality in the filter operation still lacks a well-rounded investigation, especially considering the discarded parts where quantum nonlocal correlations may still exist with probabilities. Here, in both theory and experiment, we investigate the effect of reusing the discarded particles from local filter. We observe all configurations of EPR steering simultaneously and other intriguing evolution of asymmetric quantum nonlocality, such as reversing the direction of one-way EPR steering. This work provides a perspective to answer "What is the essential role of utilizing quantum steering as a resource?", and demonstrates a practical toolbox for manipulating asymmetric quantum systems with significant potential applications in quantum information tasks.

preprint2022arXiv

Demonstrating shareability of multipartite Einstein-Podolsky-Rosen steering

Einstein-Podolsky-Rosen (EPR) steering, a category of quantum nonlocal correlations describing the ability of one observer to influence another party's state via local measurements, is different from both entanglement and Bell nonlocality by possessing an asymmetric property. For multipartite EPR steering, the monogamous situation, where two observers cannot simultaneously steer the state of the third party, has been investigated rigorously both in theory and experiment. In contrast to the monogamous situation, the shareability of EPR steering in reduced subsystems allows the state of one party to be steered by two or more observers and thus reveals more configurations of multipartite EPR steering. However, the experimental implementation of such a kind of shareability has still been absent until now. Here, in an optical experiment, we provide a proof-of-principle demonstration of the shareability of EPR steering without the constraint of monogamy in a three-qubit system. Moreover, based on the reduced bipartite EPR steering detection results, we verify the genuine three-qubit entanglement results. This work provides a complementary viewpoint for understanding multipartite EPR steering and has potential applications in many quantum information protocols, such as multipartite entanglement detection, quantum cryptography, and the construction of quantum networks.

preprint2022arXiv

Improving Mispronunciation Detection with Wav2vec2-based Momentum Pseudo-Labeling for Accentedness and Intelligibility Assessment

Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech. In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models. Specifically, we use Wav2vec 2.0 as our SSL model, and fine-tune it using original labeled L2 speech samples plus the created pseudo-labeled L2 speech samples. Our pseudo labels are dynamic and are produced by an ensemble of the online model on-the-fly, which ensures that our model is robust to pseudo label noise. We show that fine-tuning with pseudo labels achieves a 5.35% phoneme error rate reduction and 2.48% MDD F1 score improvement over a labeled-samples-only fine-tuning baseline. The proposed PL method is also shown to outperform conventional offline PL methods. Compared to the state-of-the-art MDD systems, our MDD solution produces a more accurate and consistent phonetic error diagnosis. In addition, we conduct an open test on a separate UTD-4Accents dataset, where our system recognition outputs show a strong correlation with human perception, based on accentedness and intelligibility.

preprint2022arXiv

Topological band structure via twisted photons in a degenerate cavity

Synthetic dimensions based on particles' internal degrees of freedom, such as frequency, spatial modes and arrival time, have attracted significant attention. They offer ideal large-scale lattices to simulate nontrivial topological phenomena. Exploring more synthetic dimensions is one of the paths toward higher dimensional physics. In this work, we design and experimentally control the coupling among synthetic dimensions consisting of the intrinsic photonic orbital angular momentum and spin angular momentum degrees of freedom in a degenerate optical resonant cavity, which generates a periodically driven spin-orbital coupling system. We directly characterize the system's properties, including the density of states, energy band structures and topological windings, through the transmission intensity measurements. Our work demonstrates a novel mechanism for exploring the spatial modes of twisted photons as the synthetic dimension, which paves the way to design rich topological physics in a highly compact platform.

preprint2022arXiv

Toward practical weak measurement wavefront sensing: spatial resolution and achromatism

The weak measurement wavefront sensor detects the phase gradient of light like the Shack-Hartmann sensor does. However, the use of one thin birefringent crystal to displace light beams results in a wavelength-dependent phase difference between the two polarization components, which limits the practical application. Using a Savart plate which consists of two such crystals can compensate for the phase difference and realize achromatic wavefront sensing when combined with an achromatic retarder. We discuss the spatial resolution of the sensor and experimentally reconstruct a wavefront modulated by a pattern. Then we obtain the Zernike coefficients with three different wavelengths before and after modulation. Our work makes this new wavefront sensor more applicable to actual tasks like biomedical imaging.

preprint2022arXiv

Towards Lifelong Learning of Multilingual Text-To-Speech Synthesis

This work presents a lifelong learning approach to train a multilingual Text-To-Speech (TTS) system, where each language was seen as an individual task and was learned sequentially and continually. It does not require pooled data from all languages altogether, and thus alleviates the storage and computation burden. One of the challenges of lifelong learning methods is "catastrophic forgetting": in TTS scenario it means that model performance quickly degrades on previous languages when adapted to a new language. We approach this problem via a data-replay-based lifelong learning method. We formulate the replay process as a supervised learning problem, and propose a simple yet effective dual-sampler framework to tackle the heavily language-imbalanced training samples. Through objective and subjective evaluations, we show that this supervised learning formulation outperforms other gradient-based and regularization-based lifelong learning methods, achieving 43% Mel-Cepstral Distortion reduction compared to a fine-tuning baseline.

preprint2021arXiv

InverseMV: Composing Piano Scores with a Convolutional Video-Music Transformer

Many social media users prefer consuming content in the form of videos rather than text. However, in order for content creators to produce videos with a high click-through rate, much editing is needed to match the footage to the music. This posts additional challenges for more amateur video makers. Therefore, we propose a novel attention-based model VMT (Video-Music Transformer) that automatically generates piano scores from video frames. Using music generated from models also prevent potential copyright infringements that often come with using existing music. To the best of our knowledge, there is no work besides the proposed VMT that aims to compose music for video. Additionally, there lacks a dataset with aligned video and symbolic music. We release a new dataset composed of over 7 hours of piano scores with fine alignment between pop music videos and MIDI files. We conduct experiments with human evaluation on VMT, SeqSeq model (our baseline), and the original piano version soundtrack. VMT achieves consistent improvements over the baseline on music smoothness and video relevance. In particular, with the relevance scores and our case study, our model has shown the capability of multimodality on frame-level actors' movement for music generation. Our VMT model, along with the new dataset, presents a promising research direction toward composing the matching soundtrack for videos. We have released our code at https://github.com/linchintung/VMT