Researcher profile

Xingyu Fu

Xingyu Fu contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 17 - UnverifiedVerification L1Unclaimed author
4works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

4 published item(s)

preprint2026arXiv

When Vision Speaks for Sound

Despite rapid progress in video-capable MLLMs, we find that their apparent audio understanding in videos is often vision-driven: models rely on visual cues to infer or hallucinate acoustic information, rather than verifying the audio stream. This issue appears across both state-of-the-art open-source omni models and leading closed-source models from providers such as Google and OpenAI. We characterize this failure mode as an audio-visual Clever Hans effect, in which models appear (falsely) audio-grounded, but actually exploit visual-acoustic correlations without verifying whether the audio and visual streams are truly aligned. To systematically study this behavior, we introduce Thud, an intervention-driven probing framework based on three counterfactual audio edits: Shift, which tests temporal synchronization; Mute, which tests sound existence; and Swap, which tests audio-visual consistency. Beyond diagnosis, we further study a two-stage alignment recipe: intervention-derived preference pairs teach audio verification, while event-level general video preferences regularize the model against over-specialization. Our best 10K-sample recipe improves average performance across the three intervention dimensions by 28 percentage points, while slightly improving performance on general video and audio-visual QA benchmarks.

preprint2022arXiv

There is a Time and Place for Reasoning Beyond the Image

Images are often more significant than only the pixels to human eyes, as we can infer, associate, and reason with contextual information from other sources to establish a more complete picture. For example, in Figure 1, we can find a way to identify the news articles related to the picture through segment-wise understandings of the signs, the buildings, the crowds, and more. This reasoning could provide the time and place the image was taken, which will help us in subsequent tasks, such as automatic storyline construction, correction of image source in intended effect photographs, and upper-stream processing such as image clustering for certain location or time. In this work, we formulate this problem and introduce TARA: a dataset with 16k images with their associated news, time, and location, automatically extracted from New York Times, and an additional 61k examples as distant supervision from WIT. On top of the extractions, we present a crowdsourced subset in which we believe it is possible to find the images' spatio-temporal information for evaluation purpose. We show that there exists a $70\%$ gap between a state-of-the-art joint model and human performance, which is slightly filled by our proposed model that uses segment-wise reasoning, motivating higher-level vision-language joint models that can conduct open-ended reasoning with world knowledge. The data and code are publicly available at https://github.com/zeyofu/TARA.

preprint2020arXiv

ASTRAL: Adversarial Trained LSTM-CNN for Named Entity Recognition

Named Entity Recognition (NER) is a challenging task that extracts named entities from unstructured text data, including news, articles, social comments, etc. The NER system has been studied for decades. Recently, the development of Deep Neural Networks and the progress of pre-trained word embedding have become a driving force for NER. Under such circumstances, how to make full use of the information extracted by word embedding requires more in-depth research. In this paper, we propose an Adversarial Trained LSTM-CNN (ASTRAL) system to improve the current NER method from both the model structure and the training process. In order to make use of the spatial information between adjacent words, Gated-CNN is introduced to fuse the information of adjacent words. Besides, a specific Adversarial training method is proposed to deal with the overfitting problem in NER. We add perturbation to variables in the network during the training process, making the variables more diverse, improving the generalization and robustness of the model. Our model is evaluated on three benchmarks, CoNLL-03, OntoNotes 5.0, and WNUT-17, achieving state-of-the-art results. Ablation study and case study also show that our system can converge faster and is less prone to overfitting.

preprint2020arXiv

SRQA: Synthetic Reader for Factoid Question Answering

The question answering system can answer questions from various fields and forms with deep neural networks, but it still lacks effective ways when facing multiple evidences. We introduce a new model called SRQA, which means Synthetic Reader for Factoid Question Answering. This model enhances the question answering system in the multi-document scenario from three aspects: model structure, optimization goal, and training method, corresponding to Multilayer Attention (MA), Cross Evidence (CE), and Adversarial Training (AT) respectively. First, we propose a multilayer attention network to obtain a better representation of the evidences. The multilayer attention mechanism conducts interaction between the question and the passage within each layer, making the token representation of evidences in each layer takes the requirement of the question into account. Second, we design a cross evidence strategy to choose the answer span within more evidences. We improve the optimization goal, considering all the answers' locations in multiple evidences as training targets, which leads the model to reason among multiple evidences. Third, adversarial training is employed to high-level variables besides the word embedding in our model. A new normalization method is also proposed for adversarial perturbations so that we can jointly add perturbations to several target variables. As an effective regularization method, adversarial training enhances the model's ability to process noisy data. Combining these three strategies, we enhance the contextual representation and locating ability of our model, which could synthetically extract the answer span from several evidences. We perform SRQA on the WebQA dataset, and experiments show that our model outperforms the state-of-the-art models (the best fuzzy score of our model is up to 78.56%, with an improvement of about 2%).