Source author record

Jianbo Ma

Jianbo Ma appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Computer Vision cond-mat.mtrl-sci eess.AS Machine Learning physics.atm-clus Sound

Catalog footprint

What is connected

3works

7topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.

preprint2024arXiv

A unified multichannel far-field speech recognition system: combining neural beamforming with attention based end-to-end model

Far-field speech recognition is a challenging task that conventionally uses signal processing beamforming to attack noise and interference problem. But the performance has been found usually limited due to heavy reliance on environmental assumption. In this paper, we propose a unified multichannel far-field speech recognition system that combines the neural beamforming and transformer-based Listen, Spell, Attend (LAS) speech recognition system, which extends the end-to-end speech recognition system further to include speech enhancement. Such framework is then jointly trained to optimize the final objective of interest. Specifically, factored complex linear projection (fCLP) has been adopted to form the neural beamforming. Several pooling strategies to combine look directions are then compared in order to find the optimal approach. Moreover, information of the source direction is also integrated in the beamforming to explore the usefulness of source direction as a prior, which is usually available especially in multi-modality scenario. Experiments on different microphone array geometry are conducted to evaluate the robustness against spacing variance of microphone array. Large in-house databases are used to evaluate the effectiveness of the proposed framework and the proposed method achieve 19.26\% improvement when compared with a strong baseline.

preprint2020arXiv

An ab initio molecular dynamics exploration of associates in Ba-Bi liquid with strong ordering trends

Fictive associates are widely used to describe and model liquid phases with strong ordering trends. However, little evidence is known about the assumed associates in most cases. In the present work, an ab initio molecular dynamics (AIMD) study is employed to investigate the characters of the Ba-Bi liquid, in which associates have been assumed in existing thermodynamic modeling. It is found that in the Ba rich melt, the Bi atoms are almost completely surrounded by Ba atoms. The Bi-centered coordination polyhedrons are strongly associated to crystalline structures of Ba5Bi3 and Ba4Bi3 with a longer lifetime than other polyhedrons during the AIMD simulations. In addition, these Bi-centered polyhedrons in Ba rich melt connect with each other through vertex, edge, face, and/or bipyramid sharing to form medium range orders (MRO). In the Bi rich melt, the Ba-centered polyhedrons also form MROs, but they are both structurally and compositionally diverse with a shorter lifetime. These findings from AIMD study provide evidences that there exist a strongly ordering Ba4Bi3 associate and a weakly ordering BaBi3 associate in the Ba-Bi liquid. The predicted enthalpy of mixing in the liquid agrees well with the results by the CALPHAD modeling in the literature.