Source author record

Jiarui Cai

Jiarui Cai appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning quant-ph

Catalog footprint

What is connected

5works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

We introduce Talk2Move, a reinforcement learning (RL) based diffusion framework for text-instructed spatial transformation of objects within scenes. Spatially manipulating objects in a scene through natural language poses a challenge for multimodal generation systems. While existing text-based manipulation methods can adjust appearance or style, they struggle to perform object-level geometric transformations-such as translating, rotating, or resizing objects-due to scarce paired supervision and pixel-level optimization limits. Talk2Move employs Group Relative Policy Optimization (GRPO) to explore geometric actions through diverse rollouts generated from input images and lightweight textual variations, removing the need for costly paired data. A spatial reward guided model aligns geometric transformations with linguistic description, while off-policy step evaluation and active step sampling improve learning efficiency by focusing on informative transformation stages. Furthermore, we design object-centric spatial rewards that evaluate displacement, rotation, and scaling behaviors directly, enabling interpretable and coherent transformations. Experiments on curated benchmarks demonstrate that Talk2Move achieves precise, consistent, and semantically faithful object transformations, outperforming existing text-guided editing approaches in both spatial accuracy and scene coherence.

preprint2022arXiv

MeMOT: Multi-Object Tracking with Memory

We propose an online tracking algorithm that performs the object detection and data association under a common framework, capable of linking objects after a long time span. This is realized by preserving a large spatio-temporal memory to store the identity embeddings of the tracked objects, and by adaptively referencing and aggregating useful information from the memory as needed. Our model, called MeMOT, consists of three main modules that are all Transformer-based: 1) Hypothesis Generation that produce object proposals in the current video frame; 2) Memory Encoding that extracts the core information from the memory for each tracked object; and 3) Memory Decoding that solves the object detection and data association tasks simultaneously for multi-object tracking. When evaluated on widely adopted MOT benchmark datasets, MeMOT observes very competitive performance.

preprint2022arXiv

NTIRE 2021 Multi-modal Aerial View Object Classification Challenge

In this paper, we introduce the first Challenge on Multi-modal Aerial View Object Classification (MAVOC) in conjunction with the NTIRE 2021 workshop at CVPR. This challenge is composed of two different tracks using EO andSAR imagery. Both EO and SAR sensors possess different advantages and drawbacks. The purpose of this competition is to analyze how to use both sets of sensory information in complementary ways. We discuss the top methods submitted for this competition and evaluate their results on our blind test set. Our challenge results show significant improvement of more than 15% accuracy from our current baselines for each track of the competition

preprint2020arXiv

IA-MOT: Instance-Aware Multi-Object Tracking with Motion Consistency

Multiple object tracking (MOT) is a crucial task in computer vision society. However, most tracking-by-detection MOT methods, with available detected bounding boxes, cannot effectively handle static, slow-moving and fast-moving camera scenarios simultaneously due to ego-motion and frequent occlusion. In this work, we propose a novel tracking framework, called "instance-aware MOT" (IA-MOT), that can track multiple objects in either static or moving cameras by jointly considering the instance-level features and object motions. First, robust appearance features are extracted from a variant of Mask R-CNN detector with an additional embedding head, by sending the given detections as the region proposals. Meanwhile, the spatial attention, which focuses on the foreground within the bounding boxes, is generated from the given instance masks and applied to the extracted embedding features. In the tracking stage, object instance masks are aligned by feature similarity and motion consistency using the Hungarian association algorithm. Moreover, object re-identification (ReID) is incorporated to recover ID switches caused by long-term occlusion or missing detection. Overall, when evaluated on the MOTS20 and KITTI-MOTS dataset, our proposed method won the first place in Track 3 of the BMTT Challenge in CVPR2020 workshops.

preprint2020arXiv

Quantum Key Distribution with High Order Fibonacci-like Orbital Angular Momentum States

The coding space in quantum communication could be expanded to high-dimensional space by using orbital angular momentum (OAM) states of photons, as both the capacity of the channel and security are enhanced. Here we present a novel approach to realize high-capacity quantum key distribution (QKD) by exploiting OAM states. The innovation of the proposed approach relies on a unique type of entangledphoton source which produces entangled photons with OAM randomly distributed among high order Fiboncci-like numbers and a new physical mechanism for efficiently sharing keys. This combination of entanglement with mathematical properties of high order Fibonacci sequences provides the QKD protocol which is immune to photon-number-splitting attacks and allows secure generation of long keys from few photons. Unlike other protocols, reference frame alignment and active modulation of production and detection bases are unnecessary.

Jiarui Cai

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Talk2Move: Reinforcement Learning for Text-Instructed Object-Level Geometric Transformation in Scenes

MeMOT: Multi-Object Tracking with Memory

NTIRE 2021 Multi-modal Aerial View Object Classification Challenge

IA-MOT: Instance-Aware Multi-Object Tracking with Motion Consistency

Quantum Key Distribution with High Order Fibonacci-like Orbital Angular Momentum States