Researcher profile

Yiming Wu

Yiming Wu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
11works
0followers
11topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2026arXiv

Are Tools Always Beneficial? Learning to Invoke Tools Adaptively for Dual-Mode Multimodal LLM Reasoning

Tool-augmented reasoning has emerged as a promising direction for enhancing the reasoning capabilities of multimodal large language models (MLLMs). However, existing studies mainly focus on enabling models to perform tool invocation, while neglecting the necessity of invoking tools. We argue that tool usage is not always beneficial, as redundant or inappropriate invocations largely increase reasoning overhead and even mislead model predictions. To address this issue, we introduce AutoTool, a model that adaptively decides whether to invoke tools according to the characteristics of each query. Within a reinforcement learning framework, we design an explicit dual-mode reasoning strategy with mode-specific reward functions to guide the model toward producing accurate responses. Moreover, to prevent premature bias toward a single reasoning mode, AutoTool jointly explores and balances tool-assisted and text-centric reasoning throughout training, and promotes free exploration in later stages. Extensive experiments demonstrate that AutoTool exhibits outstanding performance and high efficiency, yielding a 21.8\% accuracy gain on V* benchmark compared to the base model, and a 44.9\% improvement in efficiency over existing tool-augmented methods on POPE benchmark. Code is available at https://github.com/MQinghe/AutoTool.

preprint2026arXiv

CoLVR: Enhancing Exploratory Latent Visual Reasoning via Contrastive Optimization

Due to the potential for exploratory reasoning of Latent Visual Reasoning, recent works tend to enable MLLMs (Multimodal Large Language Models) to perform visual reasoning by propagating continuous hidden states instead of decoding intermediate steps into discrete tokens. However, existing works typically rely on hard alignment objectives to force latent representations to match predefined visual features, thereby severely limiting the exploratory of latent reasoning process. To address this problem, we propose CoLVR (Contrastive Optimization for Latent Visual Reasoning). To obtain a more exploratory visual reasoning, CoLVR introduces a latent contrastive training framework. Firstly, CoLVR learns diverse and exploratory representations with a latent contrastive objective guided by angle-based perturbation, which expands the semantic latent space and avoids over-constrained embedding. Then, CoLVR employs a latent trajectory contrastive reward for RL (Reinforcement Learning) post-training to enable fine-grained optimization of latent visual reasoning process and thus fostering diverse reasoning behaviors. Experiments demonstrate that CoLVR significantly enhances the exploratory capability of latent representations, achieving average improvements of 5.83% on VSP and 8.00% on Jigsaw, while also outperforming existing latent models on out of domain benchmarks, with a 3.40% gain on MMStar. The data, codes, and models are released at https://github.com/Oscar-dzy/CoLVR.

preprint2026arXiv

Efficient Long-Context Modeling in Diffusion Language Models via Block Approximate Sparse Attention

Diffusion Language Models (DLMs) enable globally coherent, bidirectional, and controllable text generation, offering advantages over traditional autoregressive LLMs, while scaling to ultra-long sequences remains costly. Many existing block-sparse attention methods select blocks by fixed sampling patterns over the high-resolution attention space, such as tail regions or anti-diagonal stripes. Such prior-driven sampling can miss salient tokens and introduce instability under distribution shifts. In this paper, we propose the Block Approximate Sparse Attention framework (BA-Att) with block-wise pre-downsampled operation, which identifies informative regions within a compact downsampled space, avoiding reliance on brittle positional priors. To analyze its theoretical behavior, we define an oracle post-downsample attention map and formalize the approximation error between pre- and post-downsample schemes. Based on this insight, we introduce a lightweight norm-sorting module and a covariance-compensated correction that approximates full covariance using diagonal QK variances, reducing computational complexity. Extensive experiments show that our operator achieves up to 6.95x acceleration over FlashAttention in attention computation, and maintains near full-attention performance at 50% sparsity across language models, multimodal language models, and video generation models, demonstrating strong efficiency and generalization.

preprint2026arXiv

Mass Concept Erasure in Diffusion Models with Concept Hierarchy

The success of diffusion models has raised concerns about the generation of unsafe or harmful content, prompting concept erasure approaches that fine-tune modules to suppress specific concepts while preserving general generative capabilities. However, as the number of erased concepts grows, these methods often become inefficient and ineffective, since each concept requires a separate set of fine-tuned parameters and may degrade the overall generation quality. In this work, we propose a supertype-subtype concept hierarchy that organizes erased concepts into a parent-child structure. Each erased concept is treated as a child node, and semantically related concepts (e.g., macaw, and bald eagle) are grouped under a shared parent node, referred to as a supertype concept (e.g., bird). Rather than erasing concepts individually, we introduce an effective and efficient group-wise suppression method, where semantically similar concepts are grouped and erased jointly by sharing a single set of learnable parameters. During the erasure phase, standard diffusion regularization is applied to preserve denoising process in unmasked regions. To mitigate the degradation of supertype generation caused by excessive erasure of semantically related subtypes, we propose a novel method called Supertype-Preserving Low-Rank Adaptation (SuPLoRA), which encodes the supertype concept information in the frozen down-projection matrix and updates only the up-projection matrix during erasure. Theoretical analysis demonstrates the effectiveness of SuPLoRA in mitigating generation performance degradation. We construct a more challenging benchmark that requires simultaneous erasure of concepts across diverse domains, including celebrities, objects, and pornographic content.

preprint2022arXiv

D3T-GAN: Data-Dependent Domain Transfer GANs for Few-shot Image Generation

As an important and challenging problem, few-shot image generation aims at generating realistic images through training a GAN model given few samples. A typical solution for few-shot generation is to transfer a well-trained GAN model from a data-rich source domain to the data-deficient target domain. In this paper, we propose a novel self-supervised transfer scheme termed D3T-GAN, addressing the cross-domain GANs transfer in few-shot image generation. Specifically, we design two individual strategies to transfer knowledge between generators and discriminators, respectively. To transfer knowledge between generators, we conduct a data-dependent transformation, which projects and reconstructs the target samples into the source generator space. Then, we perform knowledge transfer from transformed samples to generated samples. To transfer knowledge between discriminators, we design a multi-level discriminant knowledge distillation from the source discriminator to the target discriminator on both the real and fake samples. Extensive experiments show that our method improve the quality of generated images and achieves the state-of-the-art FID scores on commonly used datasets.

preprint2022arXiv

F3A-GAN: Facial Flow for Face Animation with Generative Adversarial Networks

Formulated as a conditional generation problem, face animation aims at synthesizing continuous face images from a single source image driven by a set of conditional face motion. Previous works mainly model the face motion as conditions with 1D or 2D representation (e.g., action units, emotion codes, landmark), which often leads to low-quality results in some complicated scenarios such as continuous generation and largepose transformation. To tackle this problem, the conditions are supposed to meet two requirements, i.e., motion information preserving and geometric continuity. To this end, we propose a novel representation based on a 3D geometric flow, termed facial flow, to represent the natural motion of the human face at any pose. Compared with other previous conditions, the proposed facial flow well controls the continuous changes to the face. After that, in order to utilize the facial flow for face editing, we build a synthesis framework generating continuous images with conditional facial flows. To fully take advantage of the motion information of facial flows, a hierarchical conditional framework is designed to combine the extracted multi-scale appearance features from images and motion features from flows in a hierarchical manner. The framework then decodes multiple fused features back to images progressively. Experimental results demonstrate the effectiveness of our method compared to other state-of-the-art methods.

preprint2022arXiv

Improving Long Tailed Document-Level Relation Extraction via Easy Relation Augmentation and Contrastive Learning

Towards real-world information extraction scenario, research of relation extraction is advancing to document-level relation extraction(DocRE). Existing approaches for DocRE aim to extract relation by encoding various information sources in the long context by novel model architectures. However, the inherent long-tailed distribution problem of DocRE is overlooked by prior work. We argue that mitigating the long-tailed distribution problem is crucial for DocRE in the real-world scenario. Motivated by the long-tailed distribution problem, we propose an Easy Relation Augmentation(ERA) method for improving DocRE by enhancing the performance of tailed relations. In addition, we further propose a novel contrastive learning framework based on our ERA, i.e., ERACL, which can further improve the model performance on tailed relations and achieve competitive overall DocRE performance compared to the state-of-arts.

preprint2021arXiv

The Color Gradients of the Globular Cluster Systems in M87 and M49

Combining data from the ACS Virgo Cluster Survey (ACSVCS) and the Next Generation Virgo cluster Survey (NGVS), we extend previous studies of color gradients of the globular cluster (GC) systems of the two most massive galaxies in the Virgo cluster, M87 and M49, to radii of $\sim 15~R_e$ ($\sim 200$ kpc for M87 and $\sim 250$ kpc for M49). We find significant negative color gradients, i.e., becoming bluer with increasing distance, out to these large radii. The gradients are driven mainly by the outwards decrease of the ratio of red to blue GC numbers. The color gradients are also detected out to $\sim 15~R_e$ in the red and blue sub-populations of GCs taken separately. In addition, we find a negative color gradient when we consider the satellite low-mass elliptical galaxies as a system, i.e., the satellite galaxies closer to the center of the host galaxy usually have redder color indices, both for their stars and GCs. According to the "two phase" formation scenario of massive early-type galaxies, the host galaxy accretes stars and GCs from low-mass satellite galaxies in the second phase. So the accreted GC system naturally inherits the negative color gradient present in the satellite population. This can explain why the color gradient of the GC system can still be observed at large radii after multiple minor mergers.

preprint2020arXiv

BANet: Bidirectional Aggregation Network with Occlusion Handling for Panoptic Segmentation

Panoptic segmentation aims to perform instance segmentation for foreground instances and semantic segmentation for background stuff simultaneously. The typical top-down pipeline concentrates on two key issues: 1) how to effectively model the intrinsic interaction between semantic segmentation and instance segmentation, and 2) how to properly handle occlusion for panoptic segmentation. Intuitively, the complementarity between semantic segmentation and instance segmentation can be leveraged to improve the performance. Besides, we notice that using detection/mask scores is insufficient for resolving the occlusion problem. Motivated by these observations, we propose a novel deep panoptic segmentation scheme based on a bidirectional learning pipeline. Moreover, we introduce a plug-and-play occlusion handling algorithm to deal with the occlusion between different object instances. The experimental results on COCO panoptic benchmark validate the effectiveness of our proposed method. Codes will be released soon at https://github.com/Mooonside/BANet.

preprint2020arXiv

Interplay between superconductivity and non-Fermi liquid at a quantum critical point in a metal. II. The $γ$-model at a finite $T$ for $0<γ<1$

In this paper we continue the analysis of the interplay between non-Fermi liquid and superconductivity for quantum-critical systems, the low-energy physics of which is described by an effective model with dynamical electron-electron interaction $V(Ω_m) \propto 1/|Ω_m|^γ$ (the $γ$ model). In paper I [A. Abanov and A. V. Chubukov, Phys Rev B. 102, 024524 (2020)] two of us analyzed the $γ$ model at $T=0$ for $0<γ<1$ and argued that there exist a discrete, infinite set of topologically distinct solutions for the superconducting gap, all with the same spatial symmetry. The gap function $Δ_n (ω_m)$ for the $n$th solution changes sign $n$ times as the function of Matsubara frequency. In this paper we analyze the linearized gap equation at a finite $T$. We show that there exist an infinite set of pairing instability temperatures, $T_{p,n}$, and the eigenfunction $Δ_n (ω_{m})$ changes sign $n$ times as a function of a Matsubara number $m$. We argue that $Δ_n (ω_{m})$ retains its functional form below $T_{p,n}$ and at $T=0$ coincides with the $n$th solution of the nonlinear gap equation. Like in paper I, we extend the model to the case when the interaction in the pairing channel has an additional factor $1/N$ compared to that in the particle-hole channel. We show that $T_{p,0}$ remains finite at large $N$ due to special properties of fermions with Matsubara frequencies $\pm πT$, but all other $T_{p,n}$ terminate at $N_{cr} = O(1)$. The gap function vanishes at $T \to 0$ for $N > N_{cr}$ and remains finite for $N < N_{cr}$. This is consistent with the $T =0$ analysis.

preprint2020arXiv

Semi-supervised Neural Chord Estimation Based on a Variational Autoencoder with Latent Chord Labels and Features

This paper describes a statistically-principled semi-supervised method of automatic chord estimation (ACE) that can make effective use of music signals regardless of the availability of chord annotations. The typical approach to ACE is to train a deep classification model (neural chord estimator) in a supervised manner by using only annotated music signals. In this discriminative approach, prior knowledge about chord label sequences (model output) has scarcely been taken into account. In contrast, we propose a unified generative and discriminative approach in the framework of amortized variational inference. More specifically, we formulate a deep generative model that represents the generative process of chroma vectors (observed variables) from discrete labels and continuous features (latent variables), which are assumed to follow a Markov model favoring self-transitions and a standard Gaussian distribution, respectively. Given chroma vectors as observed data, the posterior distributions of the latent labels and features are computed approximately by using deep classification and recognition models, respectively. These three models form a variational autoencoder and can be trained jointly in a semi-supervised manner. The experimental results show that the regularization of the classification model based on the Markov prior of chord labels and the generative model of chroma vectors improved the performance of ACE even under the supervised condition. The semi-supervised learning using additional non-annotated data can further improve the performance.