Researcher profile

Yiming Wang

Yiming Wang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
24works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

24 published item(s)

preprint2026arXiv

ANO: A Principled Approach to Robust Policy Optimization

Proximal Policy Optimization (PPO) dominates reinforcement learning and LLM alignment but relies on a "hard clipping" mechanism that discards valuable gradients. Conversely, unconstrained methods like SPO expose the optimization to unbounded updates, causing severe instability and policy collapse during extreme outlier encounters. To resolve this dilemma, we introduce a principled design space for policy optimization, demonstrating that a robust estimator must inherently suppress outliers while maintaining a smooth restoration force. Guided by these geometric principles, we derive Anchored Neighborhood Optimization (ANO), a novel method that seamlessly replaces hard clipping with a redescending gradient mechanism. Extensive evaluations demonstrate ANO's empirical superiority across diverse domains. In continuous (MuJoCo) and discrete (Atari) control, ANO establishes a robust state-of-the-art, uniquely preventing policy collapse even under highly aggressive learning rates ($1 \times 10^{-3}$). Furthermore, in LLM alignment (RLHF), ANO explicitly eliminates the catastrophic KL divergence explosion inherent to unconstrained methods, dominating PPO, SPO, and GRPO in head-to-head win rates.

preprint2026arXiv

Anon: Extrapolating Adaptivity Beyond SGD and Adam

Adaptive optimizers such as Adam have achieved great success in training large-scale models like large language models and diffusion models. However, they often generalize worse than non-adaptive methods, such as SGD on classical architectures like CNNs. We identify a key cause of this performance gap: adaptivity in pre-conditioners, which limits the optimizer's ability to adapt to diverse optimization landscapes. To address this, we propose Anon (Adaptivity Non-restricted Optimizer with Novel convergence technique), a novel optimizer with continuously tunable adaptivity in R, allowing it to interpolate between SGD-like and Adam-like behaviors and even extrapolate beyond both. To ensure convergence across the entire adaptivity spectrum, we introduce incremental delay update (IDU), a novel mechanism that is more flexible than AMSGrad's hard max-tracking strategy and enhances robustness to gradient noise. We theoretically establish convergence guarantees under both convex and non-convex settings. Empirically, Anon consistently outperforms state-of-the-art optimizers on representative image classification, diffusion, and language modeling tasks. These results demonstrate that adaptivity can serve as a valuable tunable design principle, and Anon provides the first unified and reliable framework capable of bridging the gap between classical and modern optimizers and surpassing their advantageous properties.

preprint2026arXiv

EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing

Weakly supervised Audio-Visual Video Parsing (AVVP) aims to recognize and temporally localize audio, visual, and audio-visual events in videos using only coarse-grained labels. Faced with the challenging task settings, existing research advances along two main paths: pre-training pseudo-label generators for fine-grained cross-modal semantic guidance, or refining AVVP model architectures to enhance audio-visual fusion. However, since audio and visual signals are typically unaligned, achieving accurate video parsing fundamentally relies on precise perception of uni-modal events. Yet these multi-modal focused strategies excessively emphasize multi-modal fusion while inadequately guiding and preserving uni-modal semantics, resulting in noisy pseudo-labels and sub-optimal video parsing performance. This paper proposes a novel framework that enhances uni-modal representations for both the pseudo-label generator and the AVVP model. Specifically, we introduce a similarity-based label migration approach to annotate pre-training data, thereby enabling the pseudo-label generator to better understand uni-modal events. We also employ a soft-constrained manner to refine modeling of uni-modal features in parallel with multi-modal fusion. These designs enable coordinated attention to both uni-modal and cross-modal representations, thus boosting the localization performance for events. Extensive experiments show that our method outperforms state-of-the-art methods in both pseudo-label and AVVP performance.

preprint2026arXiv

Enhancing Retrieval Augmentation via Adversarial Collaboration

Retrieval-augmented Generation (RAG) is a prevalent approach for domain-specific LLMs, yet it is often plagued by "Retrieval Hallucinations"--a phenomenon where fine-tuned models fail to recognize and act upon poor-quality retrieved documents, thus undermining performance. To address this, we propose the Adversarial Collaboration RAG (AC-RAG) framework. AC-RAG employs two heterogeneous agents: a generalist Detector that identifies knowledge gaps, and a domain-specialized Resolver that provides precise solutions. Guided by a moderator, these agents engage in an adversarial collaboration, where the Detector's persistent questioning challenges the Resolver's expertise. This dynamic process allows for iterative problem dissection and refined knowledge retrieval. Extensive experiments show that AC-RAG significantly improves retrieval accuracy and outperforms state-of-the-art RAG methods across various vertical domains.

preprint2026arXiv

PALM-Bench: A Comprehensive Benchmark for Personalized Audio-Language Models

Large Audio-Language Models (LALMs) have demonstrated strong performance in audio understanding and generation. Yet, our extensive benchmarking reveals that their behavior is largely generic (e.g., summarizing spoken content) and fails to adequately support personalized question answering (e.g., summarizing what my best friend says). In contrast, human conditions their interpretation and decision-making on each individual's personal context. To bridge this gap, we formalize the task of Personalized LALMs (PALM) for recognizing personal concepts and reasoning within personal context. Moreover, we create the first benchmark (PALM-Bench) to foster the methodological advances in PALM and enable structured evaluation on several tasks across multi-speaker scenarios. Our extensive experiments on representative open-source LALMs, show that existing training-free prompting and supervised fine-tuning strategies, while yield improvements, remains limited in modeling personalized knowledge and transferring them across tasks robustly. Data and code will be released.

preprint2026arXiv

PhysSFI-Net: Physics-informed Geometric Learning of Skeletal and Facial Interactions for Orthognathic Surgical Outcome Prediction

Orthognathic surgery repositions jaw bones to restore occlusion and enhance facial aesthetics. Accurate simulation of postoperative facial morphology is essential for preoperative planning. However, traditional biomechanical models are computationally expensive, while geometric deep learning approaches often lack interpretability. In this study, we develop and validate a physics-informed geometric deep learning framework named PhysSFI-Net for precise prediction of soft tissue deformation following orthognathic surgery. PhysSFI-Net consists of three components: a hierarchical graph module with craniofacial and surgical plan encoders combined with attention mechanisms to extract skeletal-facial interaction features; a Long Short-Term Memory (LSTM)-based sequential predictor for incremental soft tissue deformation; and a biomechanics-inspired module for high-resolution facial surface reconstruction. Model performance was assessed using point cloud shape error (Hausdorff distance), surface deviation error, and landmark localization error (Euclidean distances of craniomaxillofacial landmarks) between predicted facial shapes and corresponding ground truths. A total of 135 patients who underwent combined orthodontic and orthognathic treatment were included for model training and validation. Quantitative analysis demonstrated that PhysSFI-Net achieved a point cloud shape error of 1.070 +/- 0.088 mm, a surface deviation error of 1.296 +/- 0.349 mm, and a landmark localization error of 2.445 +/- 1.326 mm. Comparative experiments indicated that PhysSFI-Net outperformed the state-of-the-art method ACMT-Net in prediction accuracy. In conclusion, PhysSFI-Net enables interpretable, high-resolution prediction of postoperative facial morphology with superior accuracy, showing strong potential for clinical application in orthognathic surgical planning and simulation.

preprint2026arXiv

Unifying Speech Recognition, Synthesis and Conversion with Autoregressive Transformers

Traditional speech systems typically rely on separate, task-specific models for text-to-speech (TTS), automatic speech recognition (ASR), and voice conversion (VC), resulting in fragmented pipelines that limit scalability, efficiency, and cross-task generalization. In this paper, we present General-Purpose Audio (GPA), a unified audio foundation model that integrates multiple core speech tasks within a single large language model (LLM) architecture. GPA operates on a shared discrete audio token space and supports instruction-driven task induction, enabling a single autoregressive model to flexibly perform TTS, ASR, and VC without architectural modifications. This unified design combines a fully autoregressive formulation over discrete speech tokens, joint multi-task training across speech domains, and a scalable inference pipeline that achieves high concurrency and throughput. The resulting model family supports efficient multi-scale deployment, including a lightweight 0.3B-parameter variant optimized for edge and resource-constrained environments. Together, these design choices demonstrate that a unified autoregressive architecture can achieve competitive performance across diverse speech tasks while remaining viable for low-latency, practical deployment.

preprint2025arXiv

Perturbative Kondo destruction and global phase diagram of heavy fermion metals

Strange metals represent a foundational problem in quantum condensed matter physics, and heavy fermion systems provide a canonical setting to advance a general understanding. The concept of a Kondo destruction quantum critical point is widely invoked to describe the competition of the Kondo effect and the local-moment magnetism. Here, we develop a unified field-theoretic approach, analyzing this competition from a rare approach that is anchored by the magnetically ordered side. Our analysis reveals, for the first time within a renormalization group framework, a quantum critical point across which the Kondo effect goes from being destroyed to dominating. Our findings elucidate not only the Kondo destruction quantum criticality but also an accompanying global phase diagram of heavy fermion metals.

preprint2024arXiv

ResidualTransformer: Residual Low-Rank Learning with Weight-Sharing for Transformer Layers

Memory constraint of always-on devices is one of the major concerns when deploying speech processing models on these devices. While larger models trained with sufficiently large amount of data generally perform better, making them fit in the device memory is a demanding challenge. In this paper, we aim to reduce model size by reparameterizing model weights across Transformer encoder layers and assuming a special weight composition and structure. More specifically, inspired by ResNet and the more recent LoRA work, we propose an approach named ResidualTransformer, where each weight matrix in a Transformer layer comprises 1) a shared full-rank component with its adjacent layers, and 2) a unique low-rank component to itself. The low-rank matrices only account for a small amount of model size increase. In addition, we add diagonal weight matrices to improve modeling capacity of the low-rank matrices. Experiments of our 10k-hour speech recognition and speech translation tasks show that the Transformer encoder size can be reduced by ~3X with very slight performance degradation.

preprint2022arXiv

ACTIVE:Augmentation-Free Graph Contrastive Learning for Partial Multi-View Clustering

In this paper, we propose an augmentation-free graph contrastive learning framework, namely ACTIVE, to solve the problem of partial multi-view clustering. Notably, we suppose that the representations of similar samples (i.e., belonging to the same cluster) and their multiply views features should be similar. This is distinct from the general unsupervised contrastive learning that assumes an image and its augmentations share a similar representation. Specifically, relation graphs are constructed using the nearest neighbours to identify existing similar samples, then the constructed inter-instance relation graphs are transferred to the missing views to build graphs on the corresponding missing data. Subsequently, two main components, within-view graph contrastive learning (WGC) and cross-view graph consistency learning (CGC), are devised to maximize the mutual information of different views within a cluster. The proposed approach elevates instance-level contrastive learning and missing data inference to the cluster-level, effectively mitigating the impact of individual missing data on clustering. Experiments on several challenging datasets demonstrate the superiority of our proposed methods.

preprint2022arXiv

Behavior Recognition Based on the Integration of Multigranular Motion Features

The recognition of behaviors in videos usually requires a combinatorial analysis of the spatial information about objects and their dynamic action information in the temporal dimension. Specifically, behavior recognition may even rely more on the modeling of temporal information containing short-range and long-range motions; this contrasts with computer vision tasks involving images that focus on the understanding of spatial information. However, current solutions fail to jointly and comprehensively analyze short-range motion between adjacent frames and long-range temporal aggregations at large scales in videos. In this paper, we propose a novel behavior recognition method based on the integration of multigranular (IMG) motion features. In particular, we achieve reliable motion information modeling through the synergy of a channel attention-based short-term motion feature enhancement module (CMEM) and a cascaded long-term motion feature integration module (CLIM). We evaluate our model on several action recognition benchmarks such as HMDB51, Something-Something and UCF101. The experimental results demonstrate that our approach outperforms the previous state-of-the-art methods, which confirms its effectiveness and efficiency.

preprint2022arXiv

Graph-based Generative Face Anonymisation with Pose Preservation

We propose AnonyGAN, a GAN-based solution for face anonymisation which replaces the visual information corresponding to a source identity with a condition identity provided as any single image. With the goal to maintain the geometric attributes of the source face, i.e., the facial pose and expression, and to promote more natural face generation, we propose to exploit a Bipartite Graph to explicitly model the relations between the facial landmarks of the source identity and the ones of the condition identity through a deep model. We further propose a landmark attention model to relax the manual selection of facial landmarks, allowing the network to weight the landmarks for the best visual naturalness and pose preservation. Finally, to facilitate the appearance learning, we propose a hybrid training strategy to address the challenge caused by the lack of direct pixel-level supervision. We evaluate our method and its variants on two public datasets, CelebA and LFW, in terms of visual naturalness, facial pose preservation and of its impacts on face detection and re-identification. We prove that AnonyGAN significantly outperforms the state-of-the-art methods in terms of visual naturalness, face detection and pose preservation.

preprint2022arXiv

Long-tailed Recognition by Learning from Latent Categories

In this work, we address the challenging task of long-tailed image recognition. Previous long-tailed recognition methods commonly focus on the data augmentation or re-balancing strategy of the tail classes to give more attention to tail classes during the model training. However, due to the limited training images for tail classes, the diversity of tail class images is still restricted, which results in poor feature representations. In this work, we hypothesize that common latent features among the head and tail classes can be used to give better feature representation. Motivated by this, we introduce a Latent Categories based long-tail Recognition (LCReg) method. Specifically, we propose to learn a set of class-agnostic latent features shared among the head and tail classes. Then, we implicitly enrich the training sample diversity via applying semantic data augmentation to the latent features. Extensive experiments on five long-tailed image recognition datasets demonstrate that our proposed LCReg is able to significantly outperform previous methods and achieve state-of-the-art results.

preprint2022arXiv

Loop closure detection using local 3D deep descriptors

We present a simple yet effective method to address loop closure detection in simultaneous localisation and mapping using local 3D deep descriptors (L3Ds). L3Ds are emerging compact representations of patches extracted from point clouds that are learnt from data using a deep learning algorithm. We propose a novel overlap measure for loop detection by computing the metric error between points that correspond to mutually-nearest-neighbour descriptors after registering the loop candidate point cloud by its estimated relative pose. This novel approach enables us to accurately detect loops and estimate six degrees-of-freedom poses in the case of small overlaps. We compare our L3D-based loop closure approach with recent approaches on LiDAR data and achieve state-of-the-art loop closure detection accuracy. Additionally, we embed our loop closure approach in RESLAM, a recent edge-based SLAM system, and perform the evaluation on real-world RGBD-TUM and synthetic ICL datasets. Our approach enables RESLAM to achieve a better localisation accuracy compared to its original loop closure strategy. Our project page is available at github.com/yiming107/l3d_loop_closure.

preprint2022arXiv

Spatial Commonsense Graph for Object Localisation in Partial Scenes

We solve object localisation in partial scenes, a new problem of estimating the unknown position of an object (e.g. where is the bag?) given a partial 3D scan of a scene. The proposed solution is based on a novel scene graph model, the Spatial Commonsense Graph (SCG), where objects are the nodes and edges define pairwise distances between them, enriched by concept nodes and relationships from a commonsense knowledge base. This allows SCG to better generalise its spatial inference over unknown 3D scenes. The SCG is used to estimate the unknown position of the target object in two steps: first, we feed the SCG into a novel Proximity Prediction Network, a graph neural network that uses attention to perform distance prediction between the node representing the target object and the nodes representing the observed objects in the SCG; second, we propose a Localisation Module based on circular intersection to estimate the object position using all the predicted pairwise distances in order to be independent of any reference system. We create a new dataset of partially reconstructed scenes to benchmark our method and baselines for object localisation in partial scenes, where our proposed method achieves the best localisation performance.

preprint2022arXiv

Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training

Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. It usually requires a codebook obtained in an unsupervised way, making it less accurate and difficult to interpret. We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance and also the pre-training efficiency, either through decoding with a hybrid ASR system to generate phoneme-level alignments (named PBERT), or performing clustering on the supervised speech features extracted from an end-to-end CTC model (named CTC clustering). Both the hybrid and CTC models are trained on the same small amount of labeled speech as used in fine-tuning. Experiments demonstrate significant superiority of our methods to various SSL and self-training baselines, with up to 17.0% relative WER reduction. Our pre-trained models also show good transferability in a non-ASR speech task.

preprint2022arXiv

Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs for Robust Speech Recognition

The goal of self-supervised learning (SSL) for automatic speech recognition (ASR) is to learn good speech representations from a large amount of unlabeled speech for the downstream ASR task. However, most SSL frameworks do not consider noise robustness which is crucial for real-world applications. In this paper we propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech via contrastive learning. Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network. In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets of each other. By doing this, it enforces the network to have consistent predictions for the original and noisy speech, thus allows to learn contextualized representation with noise robustness. Our experiments on synthesized and real noisy data show the effectiveness of our method: it achieves 2.9--4.9% relative word error rate (WER) reduction on the synthesized noisy LibriSpeech data without deterioration on the original data, and 5.7% on CHiME-4 real 1-channel noisy data compared to a data augmentation baseline even with a strong language model for decoding. Our results on CHiME-4 can match or even surpass those with well-designed speech enhancement components.

preprint2021arXiv

Explicit forms of zero modes in symmetric interacting Kitaev chain without and with dimerization

The fermionic and bosonic zero modes of the 1D interacting Kitaev chain at the symmetric point are unveiled. The many-body structures of the Majorana zero modes in the topological region are given explicitly by carrying out perturbation expansion up to infinite order. We also give the analytic expressions of the bosonic zero modes in the topologically trivial phase. Our results are generalized to the hybrid fermion system comprised of the interacting Kitaev model and the Su-Schrieffer-Heeger model, in which we show that these two types of zero modes can coexist in certain region of its phase diagram.

preprint2021arXiv

Record high $T_{\rm c}$ and robust superconductivity in transition metal $δ$-Ti phase at megabar pressure

We report a record high superconducting transition temperature ($T_{\rm c}$) up to 23.6 K under high pressure in the elemental metal Ti, one of the top ten most abundant elements in Earth's crust. The $T_{\rm c}$ increases monotonically from 2.3 K at 40.3 GPa to 23.6 K at 144.9 GPa, which surpasses all known records from elemental metals reported so far. With further compression, a robust $T_{\rm c}$ of ~23 K is observed between 144.9 and 183 GPa in the $δ$-Ti phase. The pressure-dependent $T_{\rm c}$ can be well described by the conventional electron-phonon coupling (EPC) mechanism. Density Functional Theory calculations show the Fermi nesting and the phonon softening of optical branches at the $γ$-Ti to $δ$-Ti phase transition pressure enhance EPC, which results in the record high $T_{\rm c}$. We attribute the robust superconductivity in $δ$-Ti to the apparent robustness of its strong EPC against lattice compression. These results provide new insight into exploring new high-$T_{\rm c}$ elemental metals and Ti-based superconducting alloys.

preprint2021arXiv

Wake Word Detection with Streaming Transformers

Modern wake word detection systems usually rely on neural networks for acoustic modeling. Transformers has recently shown superior performance over LSTM and convolutional networks in various sequence modeling tasks with their better temporal modeling power. However it is not clear whether this advantage still holds for short-range temporal modeling like wake word detection. Besides, the vanilla Transformer is not directly applicable to the task due to its non-streaming nature and the quadratic time and space complexity. In this paper we explore the performance of several variants of chunk-wise streaming Transformers tailored for wake word detection in a recently proposed LF-MMI system, including looking-ahead to the next chunk, gradient stopping, different positional embedding methods and adding same-layer dependency between chunks. Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate with a comparable model size, while still maintaining linear complexity w.r.t. the sequence length.

preprint2020arXiv

POMP: Pomcp-based Online Motion Planning for active visual search in indoor environments

In this paper we focus on the problem of learning an optimal policy for Active Visual Search (AVS) of objects in known indoor environments with an online setup. Our POMP method uses as input the current pose of an agent (e.g. a robot) and a RGB-D frame. The task is to plan the next move that brings the agent closer to the target object. We model this problem as a Partially Observable Markov Decision Process solved by a Monte-Carlo planning approach. This allows us to make decisions on the next moves by iterating over the known scenario at hand, exploring the environment and searching for the object at the same time. Differently from the current state of the art in Reinforcement Learning, POMP does not require extensive and expensive (in time and computation) labelled data so being very agile in solving AVS in small and medium real scenarios. We only require the information of the floormap of the environment, an information usually available or that can be easily extracted from an a priori single exploration run. We validate our method on the publicly available AVD benchmark, achieving an average success rate of 0.76 with an average path length of 17.1, performing close to the state of the art but without any training needed. Additionally, we show experimentally the robustness of our method when the quality of the object detection goes from ideal to faulty.

preprint2020arXiv

PyChain: A Fully Parallelized PyTorch Implementation of LF-MMI for End-to-End ASR

We present PyChain, a fully parallelized PyTorch implementation of end-to-end lattice-free maximum mutual information (LF-MMI) training for the so-called \emph{chain models} in the Kaldi automatic speech recognition (ASR) toolkit. Unlike other PyTorch and Kaldi based ASR toolkits, PyChain is designed to be as flexible and light-weight as possible so that it can be easily plugged into new ASR projects, or other existing PyTorch-based ASR tools, as exemplified respectively by a new project PyChain-example, and Espresso, an existing end-to-end ASR toolkit. PyChain's efficiency and flexibility is demonstrated through such novel features as full GPU training on numerator/denominator graphs, and support for unequal length sequences. Experiments on the WSJ dataset show that with simple neural networks and commonly used machine learning techniques, PyChain can achieve competitive results that are comparable to Kaldi and better than other end-to-end ASR systems.

preprint2020arXiv

Wake Word Detection with Alignment-Free Lattice-Free MMI

Always-on spoken language interfaces, e.g. personal digital assistants, rely on a wake word to start processing spoken input. We present novel methods to train a hybrid DNN/HMM wake word detection system from partially labeled training data, and to use it in on-line applications: (i) we remove the prerequisite of frame-level alignments in the LF-MMI training algorithm, permitting the use of un-transcribed training examples that are annotated only for the presence/absence of the wake word; (ii) we show that the classical keyword/filler model must be supplemented with an explicit non-speech (silence) model for good performance; (iii) we present an FST-based decoder to perform online detection. We evaluate our methods on two real data sets, showing 50%--90% reduction in false rejection rates at pre-specified false alarm rates over the best previously published figures, and re-validate them on a third (large) data set.

preprint2019arXiv

Super-resolution Imaging of the Fluorescent Dipole Assembly with Polarized Structured Illumination Microscopy

Fluorescence polarization microscopy images both the intensity and orientation of fluorescent dipoles, which plays a vital role in studying the molecular structure and dynamics of bio-complex. However, it is difficult to resolve the dipole assemblies on the subcellular structure and their dynamics in living cells with super-resolution. Here we report polarized structured illumination microscopy (pSIM), which decouples the entangled spatial and angular structured illumination through interpreting the dipoles in spatio-angular hyperspace. We demonstrate its application on a series of biological filamentous systems such as cytoskeleton networks and lambda-DNA, and report the dynamics of short actin sliding through myosin-coated surface. Further, pSIM reveals "side-by-side" organization of the actin ring structure in the membrane-associated periodic skeleton in hippocampal neurons. It also images the dipole dynamics of green fluorescent proteins labeled to the microtubules in live U2OS cells. pSIM can be applied directly to a large variety of commercial or home-built SIM systems.