Researcher profile

Yifan Song

Yifan Song contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
9topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

MiMo-V2-Flash Technical Report

We present MiMo-V2-Flash, a Mixture-of-Experts (MoE) model with 309B total parameters and 15B active parameters, designed for fast, strong reasoning and agentic capabilities. MiMo-V2-Flash adopts a hybrid attention architecture that interleaves Sliding Window Attention (SWA) with global attention, with a 128-token sliding window under a 5:1 hybrid ratio. The model is pre-trained on 27 trillion tokens with Multi-Token Prediction (MTP), employing a native 32k context length and subsequently extended to 256k. To efficiently scale post-training compute, MiMo-V2-Flash introduces a novel Multi-Teacher On-Policy Distillation (MOPD) paradigm. In this framework, domain-specialized teachers (e.g., trained via large-scale reinforcement learning) provide dense and token-level reward, enabling the student model to perfectly master teacher expertise. MiMo-V2-Flash rivals top-tier open-weight models such as DeepSeek-V3.2 and Kimi-K2, despite using only 1/2 and 1/3 of their total parameters, respectively. During inference, by repurposing MTP as a draft model for speculative decoding, MiMo-V2-Flash achieves up to 3.6 acceptance length and 2.6x decoding speedup with three MTP layers. We open-source both the model weights and the three-layer MTP weights to foster open research and community collaboration.

preprint2026arXiv

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.

preprint2025arXiv

MiMo-Audio: Audio Language Models are Few-Shot Learners

Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

preprint2022arXiv

Less is More: Rethinking State-of-the-art Continual Relation Extraction Models with a Frustratingly Easy but Effective Approach

Continual relation extraction (CRE) requires the model to continually learn new relations from class-incremental data streams. In this paper, we propose a Frustratingly easy but Effective Approach (FEA) method with two learning stages for CRE: 1) Fast Adaption (FA) warms up the model with only new data. 2) Balanced Tuning (BT) finetunes the model on the balanced memory data. Despite its simplicity, FEA achieves comparable (on TACRED or superior (on FewRel) performance compared with the state-of-the-art baselines. With careful examinations, we find that the data imbalance between new and old relations leads to a skewed decision boundary in the head classifiers over the pretrained encoders, thus hurting the overall performance. In FEA, the FA stage unleashes the potential of memory data for the subsequent finetuning, while the BT stage helps establish a more balanced decision boundary. With a unified view, we find that two strong CRE baselines can be subsumed into the proposed training pipeline. The success of FEA also provides actionable insights and suggestions for future model designing in CRE.

preprint2022arXiv

Marine Bubble Flow Quantification Using Wide-Baseline Stereo Photogrammetry

Reliable quantification of natural and anthropogenic gas release (e.g.\ CO$_2$, methane) from the seafloor into the water column, and potentially to the atmosphere, is a challenging task. While ship-based echo sounders such as single beam and multibeam systems allow detection of free gas, bubbles, in the water even from a great distance, exact quantification utilizing the hydroacoustic data requires additional parameters such as rise speed and bubble size distribution. Optical methods are complementary in the sense that they can provide high temporal and spatial resolution of single bubbles or bubble streams from close distance. In this contribution we introduce a complete instrument and evaluation method for optical bubble stream characterization targeted at flows of up to 100ml/min and bubbles with a few millimeters radius. The dedicated instrument employs a high-speed deep sea capable stereo camera system that can record terabytes of bubble imagery when deployed at a seep site for later automated analysis. Bubble characteristics can be obtained for short sequences, then relocating the instrument to other locations, or in autonomous mode of definable intervals up to several days, in order to capture bubble flow variations due to e.g. tide dependent pressure changes or reservoir depletion. Beside reporting the steps to make bubble characterization robust and autonomous, we carefully evaluate the reachable accuracy to be in the range of 1-2\% of the bubble radius and propose a novel auto-calibration procedure that, due to the lack of point correspondences, uses only the silhouettes of bubbles. The system has been operated successfully in 1000m water depth at the Cascadia margin offshore Oregon to assess methane fluxes from various seep locations. Besides sample results we also report failure cases and lessons learnt during deployment and method development.

preprint2022arXiv

Skill requirements in job advertisements: A comparison of skill-categorization methods based on explanatory power in wage regressions

In this paper, we compare different methods to extract skill requirements from job advertisements. We consider three top-down methods that are based on expert-created dictionaries of keywords, and a bottom-up method of unsupervised topic modeling, the Latent Dirichlet Allocation (LDA) model. We measure the skill requirements based on these methods using a U.K. dataset of job advertisements that contains over 1 million entries. We estimate the returns of the identified skills using wage regressions. Finally, we compare the different methods by the wage variation they can explain, assuming that better-identified skills will explain a higher fraction of the wage variation in the labor market. We find that the top-down methods perform worse than the LDA model, as they can explain only about 20% of the wage variation, while the LDA model explains about 45% of it.

preprint2021arXiv

Deep Sea Robotic Imaging Simulator

Nowadays underwater vision systems are being widely applied in ocean research. However, the largest portion of the ocean - the deep sea - still remains mostly unexplored. Only relatively few image sets have been taken from the deep sea due to the physical limitations caused by technical challenges and enormous costs. Deep sea images are very different from the images taken in shallow waters and this area did not get much attention from the community. The shortage of deep sea images and the corresponding ground truth data for evaluation and training is becoming a bottleneck for the development of underwater computer vision methods. Thus, this paper presents a physical model-based image simulation solution, which uses an in-air texture and depth information as inputs, to generate underwater image sequences taken by robots in deep ocean scenarios. Different from shallow water conditions, artificial illumination plays a vital role in deep sea image formation as it strongly affects the scene appearance. Our radiometric image formation model considers both attenuation and scattering effects with co-moving spotlights in the dark. By detailed analysis and evaluation of the underwater image formation model, we propose a 3D lookup table structure in combination with a novel rendering strategy to improve simulation performance. This enables us to integrate an interactive deep sea robotic vision simulation in the Unmanned Underwater Vehicles simulator. To inspire further deep sea vision research by the community, we will release the source code of our deep sea image converter to the public.

preprint2020arXiv

Light Pose Calibration for Camera-light Vision Systems

Illuminating a scene with artificial light is a prerequisite for seeing in dark environments. However, nonuniform and dynamic illumination can deteriorate or even break computer vision approaches, for instance when operating a robot with headlights in the darkness. This paper presents a novel light calibration approach by taking multi-view and -distance images of a reference plane in order to provide pose information of the employed light sources to the computer vision system. By following a physical light propagation approach, under consideration of energy preservation, the estimation of light poses is solved by minimizing of the differences between real and rendered pixel intensities. During the evaluation we show the robustness and consistency of this method by statistically analyzing the light pose estimation results with different setups. Although the results are demonstrated using a rotationally-symmetric non-isotropic light, the method is suited also for non-symmetric lights.