Source author record

Mingxiao Li

Mingxiao Li appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Artificial Intelligence Computation and Language physics.optics eess.SP physics.app-ph

Catalog footprint

What is connected

5works

6topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

EponaV2: Driving World Model with Comprehensive Future Reasoning

Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.

preprint2022arXiv

Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question Answering

Knowledge-based visual question answering (VQA) is a vision-language task that requires an agent to correctly answer image-related questions using knowledge that is not presented in the given image. It is not only a more challenging task than regular VQA but also a vital step towards building a general VQA system. Most existing knowledge-based VQA systems process knowledge and image information similarly and ignore the fact that the knowledge base (KB) contains complete information about a triplet, while the extracted image information might be incomplete as the relations between two objects are missing or wrongly detected. In this paper, we propose a novel model named dynamic knowledge memory enhanced multi-step graph reasoning (DMMGR), which performs explicit and implicit reasoning over a key-value knowledge memory module and a spatial-aware image graph, respectively. Specifically, the memory module learns a dynamic knowledge representation and generates a knowledge-aware question representation at each reasoning step. Then, this representation is used to guide a graph attention operator over the spatial-aware image graph. Our model achieves new state-of-the-art accuracy on the KRVQR and FVQA datasets. We also conduct ablation experiments to prove the effectiveness of each component of the proposed model.

preprint2022arXiv

Modeling Coreference Relations in Visual Dialog

Visual dialog is a vision-language task where an agent needs to answer a series of questions grounded in an image based on the understanding of the dialog history and the image. The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering. Most previous works have focused on learning better multi-modal representations or on exploring different ways of fusing visual and language features, while the coreferences in the dialog are mainly ignored. In this paper, based on linguistic knowledge and discourse features of human dialog we propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way. Experimental results on the VisDial v1.0 dataset shows that our model, which integrates two novel and linguistically inspired soft constraints in a deep transformer neural architecture, obtains new state-of-the-art performance in terms of recall at 1 and other evaluation metrics compared to current existing models and this without pretraining on other vision-language datasets. Our qualitative results also demonstrate the effectiveness of the method that we propose.

preprint2022arXiv

Self-injection-locked second-harmonic integrated source

High coherence visible and near-visible laser sources are centrally important to the operation of advanced position/navigation/timing systems as well as classical/quantum sensing systems. However, the complexity and size of these bench-top lasers is an impediment to their transitioning beyond the laboratory. Here, a system-on-a-chip that emits high-coherence visible and near-visible lightwaves is demonstrated. The devices rely upon a new approach wherein wavelength conversion and coherence increase by self-injection-locking are combined within in a single nonlinear resonator. This simplified approach is demonstrated in a hybridly-integrated device and provides a short-term linewidth around 10-30 kHz. On-chip, converted optical power over 2 mW is also obtained. Moreover, measurements show that heterogeneous integration can result in conversion efficiency higher than 25% with output power over 11 mW. Because the approach uses mature III-V pump lasers in combination with thin-film lithium niobate, it can be scaled for low-cost manufacturing of high-coherence visible emitters. Also, the coherence generation process can be transferred to other frequency conversion processes including optical parametric oscillation, sum/difference frequency generation, and third-harmonic generation.

preprint2020arXiv

Lithium niobate photonic-crystal electro-optic modulator

Modern advanced photonic integrated circuits require dense integration of high-speed electro-optic functional elements on a compact chip that consumes only moderate power. Energy efficiency, operation speed, and device dimension are thus crucial metrics underlying almost all current developments of photonic signal processing units. Recently, thin-film lithium niobate (LN) emerges as a promising platform for photonic integrated circuits. Here we make an important step towards miniaturizing functional components on this platform, reporting probably the smallest high-speed LN electro-optic modulators, based upon photonic crystal nanobeam resonators. The devices exhibit a significant tuning efficiency up to 1.98 GHz/V, a broad modulation bandwidth of 17.5 GHz, while with a tiny electro-optic modal volume of only 0.58 $μ{\rm m}^3$. The modulators enable efficient electro-optic driving of high-Q photonic cavity modes in both adiabatic and non-adiabatic regimes, and allow us to achieve electro-optic switching at 11 Gb/s with a bit-switching energy as low as 22 fJ. The demonstration of energy efficient and high-speed electro-optic modulation at the wavelength scale paves a crucial foundation for realizing large-scale LN photonic integrated circuits that are of immense importance for broad applications in data communication, microwave photonics, and quantum photonics.

Mingxiao Li

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

EponaV2: Driving World Model with Comprehensive Future Reasoning

Dynamic Key-value Memory Enhanced Multi-step Graph Reasoning for Knowledge-based Visual Question Answering

Modeling Coreference Relations in Visual Dialog

Self-injection-locked second-harmonic integrated source

Lithium niobate photonic-crystal electro-optic modulator