Researcher profile

Hanyu Li

Hanyu Li contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
21works
0followers
14topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

21 published item(s)

preprint2026arXiv

An Information-Theoretic Criterion for Efficient Data Synthesis

Synthetic data becomes crucial for large language model training, but its effectiveness is highly inconsistent. We provide an information-theoretic account of this inconsistency: synthetic data improves a model only when the generation-training loop is information-open, i.e., shaped by external signals (verifiers, environments, or rubrics) that inject task-relevant information beyond the model's current distribution. When the loop is information-closed (relying on the model's own outputs without such signals), the data processing inequality ensures that task-relevant information can only decrease, making collapse a predicted outcome. Among information-open pipelines, both efficiency and generalization hinge on the meta-level of supervision: a coarser signal such as binary correctness treats all acceptable outputs as equivalent, so the behavior it teaches is not tied to any particular domain or surface form and generalizes naturally across tasks and domains. These observations lead to a guiding thesis: learning preferentially converges to the most information-efficient signal component available, which accelerates learning when that component is the intended one, but causes reward hacking when a spurious pattern happens to be simpler.

preprint2026arXiv

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.

preprint2026arXiv

MiMo-V2-Flash Technical Report

We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.

preprint2025arXiv

MiMo-Audio: Audio Language Models are Few-Shot Learners

Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

preprint2023arXiv

Least product relative error estimation for functional multiplicative model and optimal subsampling

In this paper, we study the functional linear multiplicative model based on the least product relative error criterion. Under some regularization conditions, we establish the consistency and asymptotic normality of the estimator. Further, we investigate the optimal subsampling for this model with massive data. Both the consistency and the asymptotic distribution of the subsampling estimator are first derived. Then, we obtain the optimal subsampling probabilities based on the A-optimality criterion. Moreover, the useful alternative subsampling probabilities without computing the inverse of the Hessian matrix are also proposed, which are easier to implement in practise. Finally, numerical studies and real data analysis are done to evaluate the performance of the proposed approaches.

preprint2022arXiv

Componentwise perturbation analysis for the generalized Schur decomposition

By defining two important terms called basic perturbation vectors and obtaining their linear bounds, we obtain the linear componentwise perturbation bounds for unitary factors and upper triangular factors of the generalized Schur decomposition. The perturbation bounds for the diagonal elements of the upper triangular factors and the generalized invariant subspace are also derived. From the former, we present an upper bound and a condition number of the generalized eigenvalue. Furthermore, with numerical iterative method, the nonlinear componentwise perturbation bounds of the generalized Schur decomposition are also provided. Numerical examples are given to test the obtained bounds. Among them, we compare our upper bound and condition number of the generalized eigenvalue with their counterparts given in the literature. Numerical results show that they are very close to each other but our results don't contain the information on the left and right generalized eigenvectors.

preprint2022arXiv

DAIR-V2X: A Large-Scale Dataset for Vehicle-Infrastructure Cooperative 3D Object Detection

Autonomous driving faces great safety challenges for a lack of global perspective and the limitation of long-range perception capabilities. It has been widely agreed that vehicle-infrastructure cooperation is required to achieve Level 5 autonomy. However, there is still NO dataset from real scenarios available for computer vision researchers to work on vehicle-infrastructure cooperation-related problems. To accelerate computer vision research and innovation for Vehicle-Infrastructure Cooperative Autonomous Driving (VICAD), we release DAIR-V2X Dataset, which is the first large-scale, multi-modality, multi-view dataset from real scenarios for VICAD. DAIR-V2X comprises 71254 LiDAR frames and 71254 Camera frames, and all frames are captured from real scenes with 3D annotations. The Vehicle-Infrastructure Cooperative 3D Object Detection problem (VIC3D) is introduced, formulating the problem of collaboratively locating and identifying 3D objects using sensory inputs from both vehicle and infrastructure. In addition to solving traditional 3D object detection problems, the solution of VIC3D needs to consider the temporal asynchrony problem between vehicle and infrastructure sensors and the data transmission cost between them. Furthermore, we propose Time Compensation Late Fusion (TCLF), a late fusion framework for the VIC3D task as a benchmark based on DAIR-V2X. Find data, code, and more up-to-date information at https://thudair.baai.ac.cn/index and https://github.com/AIR-THU/DAIR-V2X.

preprint2022arXiv

Greedy randomized sampling nonlinear Kaczmarz methods

The nonlinear Kaczmarz method was recently proposed to solve the system of nonlinear equations. In this paper, we first discuss two greedy selection rules, i.e., the maximum residual and maximum distance rules, for the nonlinear Kaczmarz iteration. Then, based on them, two kinds of greedy randomized sampling methods are presented. Further, we also devise four corresponding greedy randomized block methods, i.e., the multiple samples-based methods. The linear convergence in expectation of all the proposed methods is proved. Numerical results show that, in some applications including brown almost linear function and generalized linear model, the greedy selection rules give faster convergence rates than the random ones, and the block methods outperform the single sample-based ones.

preprint2022arXiv

On Convergence Lemma and Convergence Stability for Piecewise Analytic Functions

In this work, a convergence lemma for function $f$ being finite compositions of analytic mappings and the maximum operator is proved. The lemma shows that the set of $δ$-stationary points near an isolated local minimum point $x^*$ is shrinking to $x^*$ as $δ\to 0$. It is a natural extension of the version for strongly convex $C^1$ functions. However, the correctness of the lemma is subtle. Analytic mappings are necessary for the lemma in the sense that replacing it with differentiable or $C^\infty$ mappings makes the lemma false. The proof is based on stratification theorems of semi-analytic sets by Łojasiewicz. An extension of this proof presents a geometric characterization of the set of stationary points of $f$. Finally, a notion of stability on stationary points, called convergence stability, is proposed. It asks, under small numerical errors, whether a reasonable convergent optimization method started near a stationary point should eventually converge to the same stationary point. The concept of convergence stability becomes nontrivial qualitatively only when the objective function is both nonsmooth and nonconvex. Via the convergence lemma, an intuitive equivalent condition for convergence stability of $f$ is proved. These results together provide a new geometric perspective to study the problem of "where-to-converge" in nonsmooth nonconvex optimization.

preprint2022arXiv

Optimal subsampling for functional quantile regression

Subsampling is an efficient method to deal with massive data. In this paper, we investigate the optimal subsampling for linear quantile regression when the covariates are functions. The asymptotic distribution of the subsampling estimator is first derived. Then, we obtain the optimal subsampling probabilities based on the A-optimality criterion. Furthermore, the modified subsampling probabilities without estimating the densities of the response variables given the covariates are also proposed, which are easier to implement in practise. Numerical experiments on synthetic and real data show that the proposed methods always outperform the one with uniform sampling and can approximate the results based on full data well with less computational efforts.

preprint2022arXiv

Optimal Subsampling for High-dimensional Ridge Regression

We investigate the feature compression of high-dimensional ridge regression using the optimal subsampling technique. Specifically, based on the basic framework of random sampling algorithm on feature for ridge regression and the A-optimal design criterion, we first obtain a set of optimal subsampling probabilities. Considering that the obtained probabilities are uneconomical, we then propose the nearly optimal ones. With these probabilities, a two step iterative algorithm is established which has lower computational cost and higher accuracy. We provide theoretical analysis and numerical experiments to support the proposed methods. Numerical results demonstrate the decent performance of our methods.

preprint2022arXiv

Practical Sketching-Based Randomized Tensor Ring Decomposition

Based on sketching techniques, we propose two randomized algorithms for tensor ring (TR) decomposition. Specifically, by defining new tensor products and investigating their properties, we apply the Kronecker sub-sampled randomized Fourier transform and TensorSketch to the alternating least squares problems derived from the minimization problem of TR decomposition to devise the randomized algorithms. From the former, we find an algorithmic framework based on random projection for randomized TR decomposition. Theoretical results on sketch size and complexity analyses for the two algorithms are provided. We compare our proposals with the state-of-the-art method using both synthetic and real data. Numerical results show that they have quite decent performance in accuracy and computing time

preprint2022arXiv

Rope3D: TheRoadside Perception Dataset for Autonomous Driving and Monocular 3D Object Detection Task

Concurrent perception datasets for autonomous driving are mainly limited to frontal view with sensors mounted on the vehicle. None of them is designed for the overlooked roadside perception tasks. On the other hand, the data captured from roadside cameras have strengths over frontal-view data, which is believed to facilitate a safer and more intelligent autonomous driving system. To accelerate the progress of roadside perception, we present the first high-diversity challenging Roadside Perception 3D dataset- Rope3D from a novel view. The dataset consists of 50k images and over 1.5M 3D objects in various scenes, which are captured under different settings including various cameras with ambiguous mounting positions, camera specifications, viewpoints, and different environmental conditions. We conduct strict 2D-3D joint annotation and comprehensive data analysis, as well as set up a new 3D roadside perception benchmark with metrics and evaluation devkit. Furthermore, we tailor the existing frontal-view monocular 3D object detection approaches and propose to leverage the geometry constraint to solve the inherent ambiguities caused by various sensors, viewpoints. Our dataset is available on https://thudair.baai.ac.cn/rope.

preprint2022arXiv

Sketch-and-project methods for tensor linear systems

For tensor linear systems with respect to the popular t-product, we first present the sketch-and-project method and its adaptive variants. Their Fourier domain versions are also investigated. Then, considering that the existing sketching tensor or way for sampling has some limitations, we propose two improved strategies. Convergence analyses for the methods mentioned above are provided. We compare our methods with the existing ones using synthetic and real data. Numerical results show that they have quite decent performance in terms of the number of iterations and running time.

preprint2022arXiv

Splitting-based randomized iterative methods for solving indefinite least squares problem

The indefinite least squares (ILS) problem is a generalization of the famous linear least squares problem. It minimizes an indefinite quadratic form with respect to a signature matrix. For this problem, we first propose an impressively simple and effective splitting (SP) method according to its own structure and prove that it converges 'unconditionally' for any initial value. Further, to avoid implementing some matrix multiplications and calculating the inverse of large matrix and considering the acceleration and efficiency of the randomized strategy, we develop two randomized iterative methods on the basis of the SP method as well as the randomized Kaczmarz, Gauss-Seidel and coordinate descent methods, and describe their convergence properties. Numerical results show that our three methods all have quite decent performance in both computing time and iteration numbers compared with the latest iterative method of the ILS problem, and also demonstrate that the two randomized methods indeed yield significant acceleration in term of computing time.

preprint2020arXiv

A Count Sketch Kaczmarz Method For Solving Large Overdetermined Linear Systems

In this paper, combining count sketch and maximal weighted residual Kaczmarz method, we propose a fast randomized algorithm for large overdetermined linear systems. Convergence analysis of the new algorithm is provided. Numerical experiments show that, for the same accuracy, our method behaves better in computing time compared with the state-of-the-art algorithm.

preprint2020arXiv

A novel greedy Gauss-Seidel method for solving large linear least squares problem

We present a novel greedy Gauss-Seidel method for solving large linear least squares problem. This method improves the greedy randomized coordinate descent (GRCD) method proposed recently by Bai and Wu [Bai ZZ, and Wu WT. On greedy randomized coordinate descent methods for solving large linear least-squares problems. Numer Linear Algebra Appl. 2019;26(4):1--15], which in turn improves the popular randomized Gauss-Seidel method. Convergence analysis of the new method is provided. Numerical experiments show that, for the same accuracy, our method outperforms the GRCD method in term of the computing time.

preprint2020arXiv

A Novel Greedy Kaczmarz Method For Solving Consistent Linear Systems

With a quite different way to determine the working rows, we propose a novel greedy Kaczmarz method for solving consistent linear systems. Convergence analysis of the new method is provided. Numerical experiments show that, for the same accuracy, our method outperforms the greedy randomized Kaczmarz method and the relaxed greedy randomized Kaczmarz method introduced recently by Bai and Wu [Z.Z. BAI AND W.T. WU, On greedy randomized Kaczmarz method for solving large sparse linear systems, SIAM J. Sci. Comput., 40 (2018), pp. A592--A606; Z.Z. BAI AND W.T. WU, On relaxed greedy randomized Kaczmarz methods for solving large sparse linear systems, Appl. Math. Lett., 83 (2018), pp. 21--26] in term of the computing time.

preprint2020arXiv

Greedy Block Gauss-Seidel Methods for Solving Large Linear Least Squares Problem

With a greedy strategy to construct control index set of coordinates firstly and then choosing the corresponding column submatrix in each iteration, we present a greedy block Gauss-Seidel (GBGS) method for solving large linear least squares problem. Theoretical analysis demonstrates that the convergence factor of the GBGS method can be much smaller than that of the greedy randomized coordinate descent (GRCD) method proposed recently in the literature. On the basis of the GBGS method, we further present a pseudoinverse-free greedy block Gauss-Seidel method, which doesn't need to calculate the Moore-Penrose pseudoinverse of the column submatrix in each iteration any more and hence can be achieved greater acceleration. Moreover, this method can also be used for distributed implementations. Numerical experiments show that, for the same accuracy, our methods can far outperform the GRCD method in terms of the iteration number and computing time.

preprint2020arXiv

On the condition number theory of the equality constrained indefinite least squares problem

In this paper, within a unified framework of the condition number theory we present the explicit expression of the projected condition number of the equality constrained indefinite least squares problem. By setting specific norms and parameters, some widely used condition numbers, like the normwise, mixed and componentwise condition numbers follow as its special cases. Considering practical applications and computation, some new compact forms or upper bounds of the projected condition numbers are given to improve the computational efficiency. The new compact forms are of particular interest in calculating the exact value of the 2-norm projected condition numbers. When the equality constrained indefinite least squares problem degenerates into some specific least squares problems, our results give some new findings on the condition number theory of these specific least squares problems. Numerical experiments are given to illustrate our theoretical results.

preprint2020arXiv

Randomized block Krylov space methods for trace and log-determinant estimators

We present randomized algorithms based on block Krylov space method for estimating the trace and log-determinant of Hermitian positive semi-definite matrices. Using the properties of Chebyshev polynomial and Gaussian random matrix, we provide the error analysis of the proposed estimators and obtain the expectation and concentration error bounds. These bounds improve the corresponding ones given in the literature. Numerical experiments are presented to illustrate the performance of the algorithms and to test the error bounds.