Source author record

Yijia Fan

Yijia Fan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Artificial Intelligence Information Theory math.IT

Catalog footprint

What is connected

5works

4topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

3D-Agent:Tri-Modal Multi-Agent Collaboration for Scalable 3D Object Annotation

Driven by applications in autonomous driving robotics and augmented reality 3D object annotation presents challenges beyond 2D annotation including spatial complexity occlusion and viewpoint inconsistency Existing approaches based on single models often struggle to address these issues effectively We propose Tri MARF a novel framework that integrates tri modal inputs including 2D multi view images textual descriptions and 3D point clouds within a multi agent collaborative architecture to enhance large scale 3D annotation Tri MARF consists of three specialized agents a vision language model agent for generating multi view descriptions an information aggregation agent for selecting optimal descriptions and a gating agent that aligns textual semantics with 3D geometry for refined captioning Extensive experiments on Objaverse LVIS Objaverse XL and ABO demonstrate that Tri MARF substantially outperforms existing methods achieving a CLIPScore of 88 point 7 compared to prior state of the art methods retrieval accuracy of 45 point 2 and 43 point 8 on ViLT R at 5 and a throughput of up to 12000 objects per hour on a single NVIDIA A100 GPU

preprint2026arXiv

Covering Human Action Space for Computer Use: Data Synthesis and Benchmark

Computer-use agents (CUAs) automate on-screen work, as illustrated by GPT-5.4 and Claude. Yet their reliability on complex, low-frequency interactions is still poor, limiting user trust. Our analysis of failure cases from advanced models suggests a long-tail pattern in GUI operations, where a relatively small fraction of complex and diverse interactions accounts for a disproportionate share of task failures. We hypothesize that this issue largely stems from the scarcity of data for complex interactions. To address this problem, we propose a new benchmark CUActSpot for evaluating models' capabilities on complex interactions across five modalities: GUI, text, table, canvas, and natural image, as well as a variety of actions (click, drag, draw, etc.), covering a broader range of interaction types than prior click-centric benchmarks that focus mainly on GUI widgets. We also design a renderer-based data-synthesis pipeline: scenes are automatically generated for each modality, screenshots and element coordinates are recorded, and an LLM produces matching instructions and action traces. After training on this corpus, our Phi-Ground-Any-4B outperforms open-source models with fewer than 32B parameters. We will release our benchmark, data, code, and models at https://github.com/microsoft/Phi-Ground.git

preprint2026arXiv

Decoupling Endpoint and Semantic Transition Learning for Zero-Shot Composed Image Retrieval

Zero-shot composed image retrieval (ZS-CIR) retrieves a target image from a reference image and a text modification without human-annotated CIR triplets. Projection-based ZS-CIR methods are attractive because they do not rely on LLMs at inference and remain lightweight, but they often underperform LLM-based approaches on complex semantic modifications. This gap reflects a semantic transition bottleneck in projection-based ZS-CIR: endpoint-level matching can let the edit text act as a target-side attribute cue rather than grounding it as a source-conditioned semantic transition. We further show that adding semantic transition supervision to the same text adapter creates an endpoint--transition conflict between endpoint alignment and semantic transition alignment. To address this conflict, DeCIR decouples endpoint and transition learning. It constructs paired forward/reverse edit tuples from image-caption pairs, trains separate low-rank text adapter branches for endpoint alignment and semantic transition alignment, and merges them with Low-Rank Directional Merge (LRDM) into one deployable adapter. Extensive experiments on CIRR, CIRCO, FashionIQ, and GeneCIS demonstrate that DeCIR consistently improves projection-based ZS-CIR without increasing inference complexity.

preprint2008arXiv

On the Performance of Selection Relaying

Interest in selection relaying is growing. The recent developments in this area have largely focused on information theoretic analyses such as outage performance. Some of these analyses are accurate only at high SNR regimes. In this paper error rate analyses that are sufficiently accurate over a wide range of SNR regimes are provided. The motivations for this work are that practical systems operate at far lower SNR values than those supported by the high SNR analysis. To enable designers to make informed decisions regarding network design and deployment, it is imperative that system performance is evaluated with a reasonable degree of accuracy over practical SNR regimes. Simulations have been used to corroborate the analytical results, as close agreement between the two is observed.

preprint2007arXiv

Recovering Multiplexing Loss Through Successive Relaying Using Repetition Coding

In this paper, a transmission protocol is studied for a two relay wireless network in which simple repetition coding is applied at the relays. Information-theoretic achievable rates for this transmission scheme are given, and a space-time V-BLAST signalling and detection method that can approach them is developed. It is shown through the diversity multiplexing tradeoff analysis that this transmission scheme can recover the multiplexing loss of the half-duplex relay network, while retaining some diversity gain. This scheme is also compared with conventional transmission protocols that exploit only the diversity of the network at the cost of a multiplexing loss. It is shown that the new transmission protocol offers significant performance advantages over conventional protocols, especially when the interference between the two relays is sufficiently strong.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint