Source author record

Xiangyu Kong

Xiangyu Kong appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Artificial Intelligence quant-ph Computation and Language eess.AS Multiagent Systems Multimedia physics.optics Robotics Sound

Catalog footprint

What is connected

7works

10topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based coordinate transformations. Instead, we introduce an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT). This guides the MLLM through a "geometric-to-semantic" projection, forcing it to first establish B's local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B's visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out-of-view ambiguities (establishing a rigorous zero-shot baseline of 42% accuracy), our sensory-bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.

preprint2026arXiv

CAR: Query-Guided Confidence-Aware Reranking for Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) depends on document ranking to provide useful evidence for generation, but conventional reranking methods mainly optimize query-document relevance rather than generation usefulness. A relevant document may still introduce noise, while a lower-ranked document may better reduce the generator's uncertainty. We propose CAR (Confidence-Aware Reranking), a query-guided, training-free, and plug-and-play reranking framework that uses generator confidence change as a document usefulness signal. CAR estimates confidence through the semantic consistency of multiple sampled answers under query-only and query-document conditions. Documents that significantly increase confidence are promoted, those that decrease confidence are demoted, and uncertain cases preserve the baseline order, while a query-level gate avoids unnecessary intervention on already confident queries. Experiments on four BEIR datasets show that CAR consistently improves NDCG@5 across sparse and dense retrievers, LLM-based and supervised rerankers, and four LLM backbones. Notably, CAR improves the YesNo reranker by 25.4 percent on average under Contriever retrieval, and its ranking gains strongly correlate with downstream generation F1 improvements, achieving Spearman rho = 0.964.

preprint2022arXiv

Multi-Modal Multi-Correlation Learning for Audio-Visual Speech Separation

In this paper we propose a multi-modal multi-correlation learning framework targeting at the task of audio-visual speech separation. Although previous efforts have been extensively put on combining audio and visual modalities, most of them solely adopt a straightforward concatenation of audio and visual features. To exploit the real useful information behind these two modalities, we define two key correlations which are: (1) identity correlation (between timbre and facial attributes); (2) phonetic correlation (between phoneme and lip motion). These two correlations together comprise the complete information, which shows a certain superiority in separating target speaker's voice especially in some hard cases, such as the same gender or similar content. For implementation, contrastive learning or adversarial training approach is applied to maximize these two correlations. Both of them work well, while adversarial training shows its advantage by avoiding some limitations of contrastive learning. Compared with previous research, our solution demonstrates clear improvement on experimental metrics without additional complexity. Further analysis reveals the validity of the proposed architecture and its good potential for future extension.

preprint2020arXiv

Pose-Assisted Multi-Camera Collaboration for Active Object Tracking

Active Object Tracking (AOT) is crucial to many visionbased applications, e.g., mobile robot, intelligent surveillance. However, there are a number of challenges when deploying active tracking in complex scenarios, e.g., target is frequently occluded by obstacles. In this paper, we extend the single-camera AOT to a multi-camera setting, where cameras tracking a target in a collaborative fashion. To achieve effective collaboration among cameras, we propose a novel Pose-Assisted Multi-Camera Collaboration System, which enables a camera to cooperate with the others by sharing camera poses for active object tracking. In the system, each camera is equipped with two controllers and a switcher: The vision-based controller tracks targets based on observed images. The pose-based controller moves the camera in accordance to the poses of the other cameras. At each step, the switcher decides which action to take from the two controllers according to the visibility of the target. The experimental results demonstrate that our system outperforms all the baselines and is capable of generalizing to unseen environments. The code and demo videos are available on our website https://sites.google.com/view/pose-assistedcollaboration.

preprint2020arXiv

Quantum Pure State Tomography via Variational Hybrid Quantum-Classical Method

To obtain a complete description of a quantum system, one usually employs standard quantum state tomography, which however requires exponential number of measurements to perform and hence is impractical when the system's size grows large. In this work, we introduce a self-learning tomographic scheme based on the variational hybrid quantum-classical method. The key part of the scheme is a learning procedure, in which we learn a control sequence capable of driving the unknown target state coherently to a simple fiducial state, so that the target state can be directly reconstructed by applying the control sequence reversely. In this manner, the state tomography problem is converted to a state-to-state transfer problem. To solve the latter problem, we use the closed-loop learning control approach. Our scheme is further experimentally tested using techniques of a 4-qubit nuclear magnetic resonance. {Experimental results indicate that the proposed tomographic scheme can handle a broad class of states including entangled states in quantum information, as well as dynamical states of quantum many-body systems common to condensed matter physics.

preprint2018arXiv

A Quantum Algorithm for Solving Linear Differential Equations: Theory and Experiment

We present and experimentally realize a quantum algorithm for efficiently solving the following problem: given an $N\times N$ matrix $\mathcal{M}$, an $N$-dimensional vector $\textbf{\emph{b}}$, and an initial vector $\textbf{\emph{x}}(0)$, obtain a target vector $\textbf{\emph{x}}(t)$ as a function of time $t$ according to the constraint $d\textbf{\emph{x}}(t)/dt=\mathcal{M}\textbf{\emph{x}}(t)+\textbf{\emph{b}}$. We show that our algorithm exhibits an exponential speedup over its classical counterpart in certain circumstances. In addition, we demonstrate our quantum algorithm for a $4\times4$ linear differential equation using a 4-qubit nuclear magnetic resonance quantum information processor. Our algorithm provides a key technique for solving many important problems which rely on the solutions to linear differential equations.

preprint2013arXiv

A lasing mechanism based on absorption boundary of gain materials

A new kind of mechanism of lasing is investigated experimentally. It is quite different from the traditional laser with cavity and the random laser with random scattering. In this mechanism, the intensity-dependent refractive index effect and thermal lensing effects of the pump beam induce a large gradient of the refractive index in the gain material, which forms a passive equivalent boundary that provides the feedback in the lasing system. A real lasing system, a liquid disk laser, is performed, it achieves 2-D omnidirectional radiation with a high efficiency of 28%, its radiation spectral property can be explained by resonant Raman scattering.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint