Researcher profile

Di Qi

Di Qi contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
6works
0followers
4topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

6 published item(s)

preprint2026arXiv

STEP3-VL-10B Technical Report

We present STEP3-VL-10B, a lightweight open-source foundation model designed to redefine the trade-off between compact efficiency and frontier-level multimodal intelligence. STEP3-VL-10B is realized through two strategic shifts: first, a unified, fully unfrozen pre-training strategy on 1.2T multimodal tokens that integrates a language-aligned Perception Encoder with a Qwen3-8B decoder to establish intrinsic vision-language synergy; and second, a scaled post-training pipeline featuring over 1k iterations of reinforcement learning. Crucially, we implement Parallel Coordinated Reasoning (PaCoRe) to scale test-time compute, allocating resources to scalable perceptual reasoning that explores and synthesizes diverse visual hypotheses. Consequently, despite its compact 10B footprint, STEP3-VL-10B rivals or surpasses models 10$\times$-20$\times$ larger (e.g., GLM-4.6V-106B, Qwen3-VL-235B) and top-tier proprietary flagships like Gemini 2.5 Pro and Seed-1.5-VL. Delivering best-in-class performance, it records 92.2% on MMBench and 80.11% on MMMU, while excelling in complex reasoning with 94.43% on AIME2025 and 75.95% on MathVision. We release the full model suite to provide the community with a powerful, efficient, and reproducible baseline.

preprint2026arXiv

The finite expression method for turbulent dynamics with high-order moment recovery

Turbulent dynamical systems are characterized by nonlinear interactions and stochastic effects that generate coupled statistical quantities, such as non-zero higher-order moments, which are difficult to capture from data with accuracy. We propose a two-stage data-driven modeling framework that combines symbolic regression with generative models to jointly identify the governing dynamics and predict their key statistical quantities. In Stage I of the framework, the Finite Expression Method (FEX) is adopted to discover closed-form expressions of the deterministic dynamics, recovering nonlinear interaction terms and external forcing without predefined libraries. In Stage II, generative models are introduced to learn the residual stochastic components as a refined correction to the model error from the Stage I approximation, enabling accurate characterization of higher-order statistics. Theoretical analysis establishes the consistency of the symbolic estimator and quantifies the estimation error in terms of data size and numerical discretization. The model performance is verified through detailed numerical experiments on the stochastic triad models across multiple regimes, demonstrating that the framework successfully recovers interaction terms and forcing expressions, and accurately predicts statistical moments up to order five. These results highlight the potential of integrating interpretable symbolic discovery with data-driven stochastic modeling for complex turbulent systems.

preprint2024arXiv

Slot-guided Volumetric Object Radiance Fields

We present a novel framework for 3D object-centric representation learning. Our approach effectively decomposes complex scenes into individual objects from a single image in an unsupervised fashion. This method, called slot-guided Volumetric Object Radiance Fields (sVORF), composes volumetric object radiance fields with object slots as a guidance to implement unsupervised 3D scene decomposition. Specifically, sVORF obtains object slots from a single image via a transformer module, maps these slots to volumetric object radiance fields with a hypernetwork and composes object radiance fields with the guidance of object slots at a 3D location. Moreover, sVORF significantly reduces memory requirement due to small-sized pixel rendering during training. We demonstrate the effectiveness of our approach by showing top results in scene decomposition and generation tasks of complex synthetic datasets (e.g., Room-Diverse). Furthermore, we also confirm the potential of sVORF to segment objects in real-world scenes (e.g., the LLFF dataset). We hope our approach can provide preliminary understanding of the physical world and help ease future research in 3D object-centric representation learning.

preprint2022arXiv

A Physics-Informed Data-Driven Algorithm for Ensemble Forecast of Complex Turbulent Systems

A new ensemble forecast algorithm, named as the physics-informed data-driven algorithm with conditional Gaussian statistics (PIDD-CG), is developed to predict the time evolution of the probability density functions (PDFs) of complex turbulent systems with partial observations. The PIDD-CG algorithm integrates a unique multiscale statistical closure model with an extremely efficient nonlinear data assimilation scheme to represent the PDF as a mixture of conditional statistics, which overcomes the curse of dimensionality for high-dimensional systems. The multiscale features in the time evolution of each conditional statistics ensemble member effectively captured by an appropriate combination of physics-informed analytic formulae and recurrent neural networks. An information metric is adopted as the loss function for the latter to more accurately calibrate the key turbulent signals with strong fluctuations. The proposed algorithm succeeds in forecasting both the transient and statistical equilibrium non-Gaussian PDFs of strongly turbulent systems with intermittency, regime switching and extreme events.

preprint2020arXiv

Anomalous waves triggered by abrupt depth changes: laboratory experiments and truncated KdV statistical mechanics

Recent laboratory experiments of Bolles et al. (2019) demonstrate that an abrupt change in bottom topography can trigger anomalous statistics in randomized surface waves. Motivated by these observations, Majda et al. (2019) developed a theoretical framework, based on deterministic and statistical analysis of the truncated Korteweg-de Vries (TKdV) system, that successfully captures key qualitative features of the experiments, including the robust emergence of anomalous statistics and heightened skewness in the outgoing wavefield. Here, we extend these parallel experimental and modeling efforts with several new findings that have resulted from a synergetic interaction between the two. By precisely relating model parameters to physical ones, we calibrate the model inverse temperature to the specific conditions present in the experiments, thereby permitting a quantitative comparison. We find theoretically predicted distributions of surface displacement to match the experimental measurements with surprising detail. Prompted by the presence of surface slope in the TKdV Hamiltonian, we present new experimental measurements on surface slope statistics and compare them to model predictions. Analysis of some deterministic trajectories of TKdV elucidates the experimental length and time scales required for the statistical transition to a skewed state. Finally, the theory predicts a peculiar relationship between the outgoing displacement skewness and the change in slope variance, specifically how their ratio depends on the wave amplitude and depth ratio. New experimental measurements confirm this prediction in spectacular fashion.

preprint2020arXiv

ImageBERT: Cross-modal Pre-training with Large-scale Weak-supervised Image-Text Data

In this paper, we introduce a new vision-language pre-trained model -- ImageBERT -- for image-text joint embedding. Our model is a Transformer-based model, which takes different modalities as input and models the relationship between them. The model is pre-trained on four tasks simultaneously: Masked Language Modeling (MLM), Masked Object Classification (MOC), Masked Region Feature Regression (MRFR), and Image Text Matching (ITM). To further enhance the pre-training quality, we have collected a Large-scale weAk-supervised Image-Text (LAIT) dataset from Web. We first pre-train the model on this dataset, then conduct a second stage pre-training on Conceptual Captions and SBU Captions. Our experiments show that multi-stage pre-training strategy outperforms single-stage pre-training. We also fine-tune and evaluate our pre-trained ImageBERT model on image retrieval and text retrieval tasks, and achieve new state-of-the-art results on both MSCOCO and Flickr30k datasets.