Researcher profile

Yuwei Fan

Yuwei Fan contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
9works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

9 published item(s)

preprint2026arXiv

AIS: Adaptive Importance Sampling for Quantized RL

Reinforcement learning (RL) for large language models (LLMs) is dominated by the cost of rollout generation, which has motivated the use of low-precision rollouts (e.g., FP8) paired with a BF16 trainer to improve throughput and reduce memory pressure. This introduces a rollout-training mismatch that biases the policy gradient and can cause training to collapse outright on reasoning benchmarks. We show that the mismatch is non-stationary and acts as a double-edged sword: early in training it provides a stochastic exploration bonus, exposing the gradient to trajectories the trainer would otherwise under-sample, but the same perturbation transitions into a destabilizing source of bias as the policy concentrates. To solve this, we propose Adaptive Importance Sampling (AIS), a correction framework that adjusts the strength of its intervention on a per-batch basis. AIS combines three real-time diagnostics, namely weight reliability, divergence severity, and variance amplification, into a single mixing coefficient that interpolates between the uncorrected and fully importance-weighted gradients, suppressing the destabilizing component of the mismatch while preserving its exploratory benefit. We integrate AIS into GRPO and evaluate it on the diffusion-based LLaDA-8B-Instruct and the autoregressive Qwen3-8B and Qwen3.5-9B across mathematical reasoning and planning benchmarks. AIS matches the BF16 baseline on most tasks while retaining the 1.5 to 2.76x rollout speedup of FP8.

preprint2026arXiv

Multi-Scale Dequant: Eliminating Dequantization Bottleneck via Activation Decomposition for Efficient LLM Inference

Quantization is essential for efficient large language model (LLM) inference, yet the dequantization step-converting low-bit weights back to high-precision for matrix multiplication has become a critical bottleneck on modern AI accelerators. On architectures with decoupled compute units (e.g., Ascend NPUs), dequantization operations can consume more cycles than the matrix multiplication itself, leaving the high-throughput tensor cores underutilized. This paper presents Multi-Scale Dequant (MSD), a quantization framework that removes weight/KV dequantization from the GEMM critical path. Instead of lifting low-bit weights to BF16 precision, MSD decomposes high-precision BF16 activations into multiple low-precision components, each of which can be multiplied directly with quantized weights via native hardware-accelerated GEMM. This approach shifts the computational paradigm from precision conversion to multi-scale approximation, avoiding INT8-to-BF16 weight conversion before GEMM. We instantiate MSD for two weight formats and derive tight error bounds for each. For INT8 weights (W4A16), two-pass INT8 decomposition achieves near 16 effective bits. For MXFP4 weights (W4A16), two-pass MXFP4 decomposition yields near 6.6 effective bits with error bound 1/64 per block surpassing single-pass MXFP8(5.24 bits) while maintaining the same effective GEMM compute time. We further derive closed-form latency and HBM traffic models showing that MSD avoids the Vector-Cube pipeline stall caused by dequantization and reduces KV cache HBM traffic by up to 2.5 times in attention. Numerical simulations on matrix multiplication and Flash Attention kernels confirm that MSD does not degrade accuracy compared to dequantization baselines, and in many settings achieves lower L2 error.

preprint2021arXiv

A Simple Multiscale Method for Mean Field Games

This paper proposes a multiscale method for solving the numerical solution of mean field games which accelerates the convergence and addresses the problem of determining the initial guess. Starting from an approximate solution at the coarsest level, the method constructs approximations on successively finer grids via alternating sweeping, which not only allows for the use of classical time marching numerical schemes but also enables applications to both local and nonlocal problems. At each level, numerical relaxation is used to stabilize the iterative process. A second-order discretization scheme is derived for higher-order convergence. Numerical examples are provided to demonstrate the efficiency of the proposed method in both local and nonlocal, 1-dimensional and 2-dimensional cases.

preprint2021arXiv

Multi-Level Fine-Tuning: Closing Generalization Gaps in Approximation of Solution Maps under a Limited Budget for Training Data

In scientific machine learning, regression networks have been recently applied to approximate solution maps (e.g., potential-ground state map of Schrödinger equation). In this paper, we aim to reduce the generalization error without spending more time in generating training samples. However, to reduce the generalization error, the regression network needs to be fit on a large number of training samples (e.g., a collection of potential-ground state pairs). The training samples can be produced by running numerical solvers, which takes much time in many applications. In this paper, we aim to reduce the generalization error without spending more time in generating training samples. Inspired by few-shot learning techniques, we develop the Multi-Level Fine-Tuning algorithm by introducing levels of training: we first train the regression network on samples generated at the coarsest grid and then successively fine-tune the network on samples generated at finer grids. Within the same amount of time, numerical solvers generate more samples on coarse grids than on fine grids. We demonstrate a significant reduction of generalization error in numerical experiments on challenging problems with oscillations, discontinuities, or rough coefficients. Further analysis can be conducted in the Neural Tangent Kernel regime and we provide practical estimators to the generalization error. The number of training samples at different levels can be optimized for the smallest estimated generalization error under the constraint of budget for training data. The optimized distribution of budget over levels provides practical guidance with theoretical insight as in the celebrated Multi-Level Monte Carlo algorithm.

preprint2020arXiv

A Nonlinear Hyperbolic Model for Radiative Transfer Equation in Slab Geometry

Linear models for the radiative transfer equation have been well developed, while nonlinear models are seldom investigated even for slab geometry due to some essential difficulties. We have proposed a moment model in MPN for slab geometry which combines the ideas of the classical PN and MN model. Though the model is far from perfect, it was demonstrated to be quite efficient in numerically approximating the solution of the radiative transfer equation, that we are motivated to further improve this model. Consequently we propose in this paper a new model following the chartmap in MPN with some significant theoretic progresses. The new model is derived with global hyperbolicity, and meanwhile some necessary physical properties are preserved. We give a complete analysis to the characteristic structure and propose a numerical scheme for the new model. Numerical examples are presented to demonstrate the numerical performance of the new model.

preprint2020arXiv

Hyperbolic Model Reduction for Kinetic Equations

We make a brief historical review to the moment model reduction to the kinetic equations, particularly the Grad's moment method for Boltzmann equation. The focus is on the hyperbolicity of the reduced model, which is essential to the existence of its classical solution as a Cauchy problem. The theory of the framework we developed in last years is then introduced, which may preserve the hyperbolic nature of the kinetic equations with high universality. Some lastest progress on the comparison between models with/without hyperbolicity is presented to validate the hyperbolic moment models for rarefied gases.

preprint2020arXiv

Meta-learning Pseudo-differential Operators with Deep Neural Networks

This paper introduces a meta-learning approach for parameterized pseudo-differential operators with deep neural networks. With the help of the nonstandard wavelet form, the pseudo-differential operators can be approximated in a compressed form with a collection of vectors. The nonlinear map from the parameter to this collection of vectors and the wavelet transform are learned together from a small number of matrix-vector multiplications of the pseudo-differential operator. Numerical results for Green's functions of elliptic partial differential equations and the radiative transfer equations demonstrate the efficiency and accuracy of the proposed approach.

preprint2019arXiv

Solving Electrical Impedance Tomography with Deep Learning

This paper introduces a new approach for solving electrical impedance tomography (EIT) problems using deep neural networks. The mathematical problem of EIT is to invert the electrical conductivity from the Dirichlet-to-Neumann (DtN) map. Both the forward map from the electrical conductivity to the DtN map and the inverse map are high-dimensional and nonlinear. Motivated by the linear perturbative analysis of the forward map and based on a numerically low-rank property, we propose compact neural network architectures for the forward and inverse maps for both 2D and 3D problems. Numerical results demonstrate the efficiency of the proposed neural networks.

preprint2012arXiv

Globally Hyperbolic Regularization of Grad's Moment System

In this paper, we propose a globally hyperbolic regularization to the general Grad's moment system in multi-dimensional spaces. Systems with moments up to an arbitrary order are studied. The characteristic speeds of the regularized moment system can be analytically given and only depend on the macroscopic velocity and the temperature. The structure of the eigenvalues and eigenvectors of the coefficient matrix is fully clarified. The regularization together with the properties of the resulting moment systems is consistent with the simple one-dimensional case discussed in [1]. Besides, all characteristic waves are proven to be genuinely nonlinear or linearly degenerate, and the studies on the properties of rarefaction waves, contact discontinuities and shock waves are included.