Source author record

Jiajun Liu

Jiajun Liu appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

18works

22topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

We present Chain-of-Action (CoA), a novel visuo-motor policy paradigm built upon Trajectory Autoregressive Modeling. Unlike conventional approaches that predict next step action(s) forward, CoA generates an entire trajectory by explicit backward reasoning with task-specific goals through an action-level Chain-of-Thought (CoT) process. This process is unified within a single autoregressive structure: (1) the first token corresponds to a stable keyframe action that encodes the task-specific goals; and (2) subsequent action tokens are generated autoregressively, conditioned on the initial keyframe and previously predicted actions. This backward action reasoning enforces a global-to-local structure, allowing each local action to be tightly constrained by the final goal. To further realize the action reasoning structure, CoA incorporates four complementary designs: continuous action token representation; dynamic stopping for variable-length trajectory generation; reverse temporal ensemble; and multi-token prediction to balance action chunk modeling with global structure. As a result, CoA gives strong spatial generalization capabilities while preserving the flexibility and simplicity of a visuo-motor policy. Empirically, we observe CoA achieves the state-of-the-art performance across 60 RLBench tasks and 8 real-world manipulation tasks.

preprint2026arXiv

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

Retrieval-augmented generation (RAG) is a widely adopted paradigm for enhancing LLMs in medical applications by incorporating expert multimodal knowledge during generation. However, the underlying retrieval databases may naturally contain, or be intentionally injected with, adversarial knowledge, which can perturb model outputs and undermine system reliability. To investigate this risk, prior studies have explored knowledge poisoning attacks in medical RAG systems. Nevertheless, most of them rely on the strong assumption that adversaries possess prior knowledge of user queries, which is unrealistic in deployments and substantially limits their practical applicability. In this paper, we propose M\textsuperscript{3}Att, a knowledge-poisoning framework designed for medical multimodal RAG systems, assuming only limited distribution knowledge of the underlying database. Our core idea is to inject covert misinformation into textual data while using paired visual data as a query-agnostic trigger to promote retrieval. We first propose a unified framework that introduces imperceptible perturbations to visual inputs to manipulate retrieval probabilities. Besides, due to the prior medical knowledge in LLMs, naively poisoned medical content with explicit factual errors can be corrected during generation. Thus, we leverage the inherent ambiguity of medical diagnosis and design a covert misinformation injection strategy that degrades diagnostic accuracy while evading model self-correction. Experiments on five LLMs and datasets demonstrate that M\textsuperscript{3}Att consistently produces clinically plausible yet incorrect generations. Codes: https://github.com/ypr17/M3Att.

preprint2026arXiv

ReCoVR: Closing the Loop in Interactive Composed Video Retrieval

Composed video retrieval (CoVR) searches for target videos using a reference video and a modification text, but existing methods are restricted to a single interaction round and cannot support the progressive nature of real-world visual search. To bridge this gap, we first formalize interactive composed video retrieval, a multi-turn extension of CoVR, where users progressively refine their search intent through natural-language feedback across turns. Adapting existing interactive retrieval methods to this setting reveals two structural weaknesses: reliance on a single retrieval channel and an open-loop retrieval design that consumes user feedback but does not diagnose whether its own retrieval trajectory is drifting or stagnating. To address these limitations, we propose ReCoVR (Reflexive Composed Video Retrieval), a dual-pathway architecture built on reflexive perception, where the system treats its retrieval history as diagnostic evidence alongside user feedback. Specifically, an Intent Pathway routes heterogeneous feedback to complementary retrieval channels, while a Reflection Pathway performs trajectory-level reflection to monitor result evolution and correct retrieval errors across turns. Experiments on multiple benchmarks show that ReCoVR consistently outperforms interactive baselines, notably achieving 74.30% R@1 after just one interactive round on the WebVid-CoVR-Test dataset.

preprint2026arXiv

Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift

Direct Preference Learning has emerged as a dominant offline paradigm for preference optimization. Most of these methods are based on the Bradley-Terry (BT) model for pairwise preference ranking, which directly aligns language model with human preference. Prior work has observed a counter-intuitive phenomenon termed likelihood displacement, where the absolute probability of preferred responses decreases simultaneously during training. We demonstrate that such displacement can lead to a more devastating failure mode, which we defined as \textit{Catastrophic Preference Shift}, where the lost preference probability mass inadvertently shifts toward out-of-distribution (OOD) responses. Such a failure mode is a key limitation shared across BT-style direct preference learning methods, due to the fundamental conflict between the unconstrained discriminative alignment and generative foundational capabilities, ultimately leading to severe performance degradation (e.g., SimPO suffers a significant drop in reasoning accuracy from 73.5\% to 37.5\%). We analyze existing BT-style methods from the probability evolution perspective and theoretically prove that these methods exhibit over-reliance on model initialization and can lead to preference shift. To resolve these counter-intuitive behaviors, we propose a theoretically grounded Stable Preference Optimization (SPO) framework that constrains preference learning within a safe alignment region. Empirical evaluations demonstrate that SPO effectively stabilizes and enhances the performance of existing BT-style preference learning methods. SPO provides new insights into the design of preference learning objectives and opens up new avenues towards more reliable and interpretable language model alignment.

preprint2024arXiv

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

Vision Transformers (ViTs) have revolutionized the field of computer vision, yet their deployments on resource-constrained devices remain challenging due to high computational demands. To expedite pre-trained ViTs, token pruning and token merging approaches have been developed, which aim at reducing the number of tokens involved in the computation. However, these methods still have some limitations, such as image information loss from pruned tokens and inefficiency in the token-matching process. In this paper, we introduce a novel Graph-based Token Propagation (GTP) method to resolve the challenge of balancing model efficiency and information preservation for efficient ViTs. Inspired by graph summarization algorithms, GTP meticulously propagates less significant tokens' information to spatially and semantically connected tokens that are of greater importance. Consequently, the remaining few tokens serve as a summarization of the entire token graph, allowing the method to reduce computational complexity while preserving essential information of eliminated tokens. Combined with an innovative token selection strategy, GTP can efficiently identify image tokens to be propagated. Extensive experiments have validated GTP's effectiveness, demonstrating both efficiency and performance improvements. Specifically, GTP decreases the computational complexity of both DeiT-S and DeiT-B by up to 26% with only a minimal 0.3% accuracy drop on ImageNet-1K without finetuning, and remarkably surpasses the state-of-the-art token merging method on various backbones at an even faster inference speed. The source code is available at https://github.com/Ackesnal/GTP-ViT.

preprint2022arXiv

A Real-time Edge-AI System for Reef Surveys

Crown-of-Thorn Starfish (COTS) outbreaks are a major cause of coral loss on the Great Barrier Reef (GBR) and substantial surveillance and control programs are ongoing to manage COTS populations to ecologically sustainable levels. In this paper, we present a comprehensive real-time machine learning-based underwater data collection and curation system on edge devices for COTS monitoring. In particular, we leverage the power of deep learning-based object detection techniques, and propose a resource-efficient COTS detector that performs detection inferences on the edge device to assist marine experts with COTS identification during the data collection phase. The preliminary results show that several strategies for improving computational efficiency (e.g., batch-wise processing, frame skipping, model input size) can be combined to run the proposed detection model on edge hardware with low resource consumption and low information loss.

preprint2022arXiv

Instant Graph Neural Networks for Dynamic Graphs

Graph Neural Networks (GNNs) have been widely used for modeling graph-structured data. With the development of numerous GNN variants, recent years have witnessed groundbreaking results in improving the scalability of GNNs to work on static graphs with millions of nodes. However, how to instantly represent continuous changes of large-scale dynamic graphs with GNNs is still an open problem. Existing dynamic GNNs focus on modeling the periodic evolution of graphs, often on a snapshot basis. Such methods suffer from two drawbacks: first, there is a substantial delay for the changes in the graph to be reflected in the graph representations, resulting in losses on the model's accuracy; second, repeatedly calculating the representation matrix on the entire graph in each snapshot is predominantly time-consuming and severely limits the scalability. In this paper, we propose Instant Graph Neural Network (InstantGNN), an incremental computation approach for the graph representation matrix of dynamic graphs. Set to work with dynamic graphs with the edge-arrival model, our method avoids time-consuming, repetitive computations and allows instant updates on the representation and instant predictions. Graphs with dynamic structures and dynamic attributes are both supported. The upper bounds of time complexity of those updates are also provided. Furthermore, our method provides an adaptive training strategy, which guides the model to retrain at moments when it can make the greatest performance gains. We conduct extensive experiments on several real-world and synthetic datasets. Empirical results demonstrate that our model achieves state-of-the-art accuracy while having orders-of-magnitude higher efficiency than existing methods.

preprint2022arXiv

InvisibiliTee: Angle-agnostic Cloaking from Person-Tracking Systems with a Tee

After a survey for person-tracking system-induced privacy concerns, we propose a black-box adversarial attack method on state-of-the-art human detection models called InvisibiliTee. The method learns printable adversarial patterns for T-shirts that cloak wearers in the physical world in front of person-tracking systems. We design an angle-agnostic learning scheme which utilizes segmentation of the fashion dataset and a geometric warping process so the adversarial patterns generated are effective in fooling person detectors from all camera angles and for unseen black-box detection models. Empirical results in both digital and physical environments show that with the InvisibiliTee on, person-tracking systems' ability to detect the wearer drops significantly.

preprint2022arXiv

Spatiotemporal continuous estimates of daily 1-km PM2.5 from 2000 to present under the Tracking Air Pollution in China (TAP) framework

High spatial resolution PM2.5 data covering a long time period are urgently needed to support population exposure assessment and refined air quality management. In this study, we provided complete-coverage PM2.5 predictions with a 1-km spatial resolution from 2000 to the present under the Tracking Air Pollution in China (TAP, http://tapdata.org.cn/) framework. To support high spatial resolution modelling, we collected PM2.5 measurements from both national and local monitoring stations. To correctly reflect the temporal variations in land cover characteristics that affected the local variations in PM2.5, we constructed continuous annual geoinformation datasets, including the road maps and ensemble gridded population maps, in China from 2000 to 2021. We also examined various model structures and predictor combinations to balance the computational cost and model performance. The final model fused 10-km TAP PM2.5 predictions from our previous work, 1-km satellite aerosol optical depth retrievals and land use parameters with a random forest model. Our annual model had an out-of-bag R2 ranging between 0.80 and 0.84, and our hindcast model had a by-year cross-validation R2 of 0.76. This open-access 1-km resolution PM2.5 data product with complete coverage successfully revealed the local-scale spatial variations in PM2.5 and could benefit environmental studies and policy-making.

preprint2022arXiv

STAR-GNN: Spatial-Temporal Video Representation for Content-based Retrieval

We propose a video feature representation learning framework called STAR-GNN, which applies a pluggable graph neural network component on a multi-scale lattice feature graph. The essence of STAR-GNN is to exploit both the temporal dynamics and spatial contents as well as visual connections between regions at different scales in the frames. It models a video with a lattice feature graph in which the nodes represent regions of different granularity, with weighted edges that represent the spatial and temporal links. The contextual nodes are aggregated simultaneously by graph neural networks with parameters trained with retrieval triplet loss. In the experiments, we show that STAR-GNN effectively implements a dynamic attention mechanism on video frame sequences, resulting in the emphasis for dynamic and semantically rich content in the video, and is robust to noise and redundancies. Empirical results show that STAR-GNN achieves state-of-the-art performance for Content-Based Video Retrieval.

preprint2021arXiv

A novel policy for pre-trained Deep Reinforcement Learning for Speech Emotion Recognition

Reinforcement Learning (RL) is a semi-supervised learning paradigm which an agent learns by interacting with an environment. Deep learning in combination with RL provides an efficient method to learn how to interact with the environment is called Deep Reinforcement Learning (deep RL). Deep RL has gained tremendous success in gaming - such as AlphaGo, but its potential have rarely being explored for challenging tasks like Speech Emotion Recognition (SER). The deep RL being used for SER can potentially improve the performance of an automated call centre agent by dynamically learning emotional-aware response to customer queries. While the policy employed by the RL agent plays a major role in action selection, there is no current RL policy tailored for SER. In addition, extended learning period is a general challenge for deep RL which can impact the speed of learning for SER. Therefore, in this paper, we introduce a novel policy - "Zeta policy" which is tailored for SER and apply Pre-training in deep RL to achieve faster learning rate. Pre-training with cross dataset was also studied to discover the feasibility of pre-training the RL Agent with a similar dataset in a scenario of where no real environmental data is not available. IEMOCAP and SAVEE datasets were used for the evaluation with the problem being to recognize four emotions happy, sad, angry and neutral in the utterances provided. Experimental results show that the proposed "Zeta policy" performs better than existing policies. The results also support that pre-training can reduce the training time upon reducing the warm-up period and is robust to cross-corpus scenario.

preprint2020arXiv

A Three-limb Teleoperated Robotic System with Foot Control for Flexible Endoscopic Surgery

Flexible endoscopy requires high skills to manipulate both the endoscope and associated instruments. In most robotic flexible endoscopic systems, the endoscope and instruments are controlled separately by two operators, which may result in communication errors and inefficient operation. We present a novel teleoperation robotic endoscopic system that can be commanded by a surgeon alone. This 13 degrees-of-freedom (DoF) system integrates a foot-controlled robotic flexible endoscope and two hand-controlled robotic endoscopic instruments (a robotic grasper and a robotic cauterizing hook). A foot-controlled human-machine interface maps the natural foot gestures to the 4-DoF movements of the endoscope, and two hand-controlled interfaces map the movements of the two hands to the two instruments individually. The proposed robotic system was validated in an ex-vivo experiment carried out by six subjects, where foot control was also compared with a sequential clutch-based hand control scheme. The participants could successfully teleoperate the endoscope and the two instruments to cut the tissues at scattered target areas in a porcine stomach. Foot control yielded 43.7% faster task completion and required less mental effort as compared to the clutch-based hand control scheme. The system introduced in this paper is intuitive for three-limb manipulation even for operators without experience of handling the endoscope and robotic instruments. This three-limb teleoperated robotic system enables one surgeon to intuitively control three endoscopic tools which normally require two operators, leading to reduced manpower, less communication errors, and improved efficiency.

preprint2016arXiv

A Novel Framework for Online Amnesic Trajectory Compression in Resource-constrained Environments

State-of-the-art trajectory compression methods usually involve high space-time complexity or yield unsatisfactory compression rates, leading to rapid exhaustion of memory, computation, storage and energy resources. Their ability is commonly limited when operating in a resource-constrained environment especially when the data volume (even when compressed) far exceeds the storage limit. Hence we propose a novel online framework for error-bounded trajectory compression and ageing called the Amnesic Bounded Quadrant System (ABQS), whose core is the Bounded Quadrant System (BQS) algorithm family that includes a normal version (BQS), Fast version (FBQS), and a Progressive version (PBQS). ABQS intelligently manages a given storage and compresses the trajectories with different error tolerances subject to their ages. In the experiments, we conduct comprehensive evaluations for the BQS algorithm family and the ABQS framework. Using empirical GPS traces from flying foxes and cars, and synthetic data from simulation, we demonstrate the effectiveness of the standalone BQS algorithms in significantly reducing the time and space complexity of trajectory compression, while greatly improving the compression rates of the state-of-the-art algorithms (up to 45%). We also show that the operational time of the target resource-constrained hardware platform can be prolonged by up to 41%. We then verify that with ABQS, given data volumes that are far greater than storage space, ABQS is able to achieve 15 to 400 times smaller errors than the baselines. We also show that the algorithm is robust to extreme trajectory shapes.

preprint2015arXiv

Temporal Embedding in Convolutional Neural Networks for Robust Learning of Abstract Snippets

The prediction of periodical time-series remains challenging due to various types of data distortions and misalignments. Here, we propose a novel model called Temporal embedding-enhanced convolutional neural Network (TeNet) to learn repeatedly-occurring-yet-hidden structural elements in periodical time-series, called abstract snippets, for predicting future changes. Our model uses convolutional neural networks and embeds a time-series with its potential neighbors in the temporal domain for aligning it to the dominant patterns in the dataset. The model is robust to distortions and misalignments in the temporal domain and demonstrates strong prediction power for periodical time-series. We conduct extensive experiments and discover that the proposed model shows significant and consistent advantages over existing methods on a variety of data modalities ranging from human mobility to household power consumption records. Empirical results indicate that the model is robust to various factors such as number of samples, variance of data, numerical ranges of data etc. The experiments also verify that the intuition behind the model can be generalized to multiple data types and applications and promises significant improvement in prediction performances across the datasets studied.

preprint2015arXiv

Understanding Human Mobility from Twitter

Understanding human mobility is crucial for a broad range of applications from disease prediction to communication networks. Most efforts on studying human mobility have so far used private and low resolution data, such as call data records. Here, we propose Twitter as a proxy for human mobility, as it relies on publicly available data and provides high resolution positioning when users opt to geotag their tweets with their current location. We analyse a Twitter dataset with more than six million geotagged tweets posted in Australia, and we demonstrate that Twitter can be a reliable source for studying human mobility patterns. Our analysis shows that geotagged tweets can capture rich features of human mobility, such as the diversity of movement orbits among individuals and of movements within and between cities. We also find that short and long-distance movers both spend most of their time in large metropolitan areas, in contrast with intermediate-distance movers movements, reflecting the impact of different modes of travel. Our study provides solid evidence that Twitter can indeed be a useful proxy for tracking and predicting human movement.

preprint2014arXiv

Bounded Quadrant System: Error-bounded Trajectory Compression on the Go

Long-term location tracking, where trajectory compression is commonly used, has gained high interest for many applications in transport, ecology, and wearable computing. However, state-of-the-art compression methods involve high space-time complexity or achieve unsatisfactory compression rate, leading to rapid exhaustion of memory, computation, storage and energy resources. We propose a novel online algorithm for error-bounded trajectory compression called the Bounded Quadrant System (BQS), which compresses trajectories with extremely small costs in space and time using convex-hulls. In this algorithm, we build a virtual coordinate system centered at a start point, and establish a rectangular bounding box as well as two bounding lines in each of its quadrants. In each quadrant, the points to be assessed are bounded by the convex-hull formed by the box and lines. Various compression error-bounds are therefore derived to quickly draw compression decisions without expensive error computations. In addition, we also propose a light version of the BQS version that achieves $\mathcal{O}(1)$ complexity in both time and space for processing each point to suit the most constrained computation environments. Furthermore, we briefly demonstrate how this algorithm can be naturally extended to the 3-D case. Using empirical GPS traces from flying foxes, cars and simulation, we demonstrate the effectiveness of our algorithm in significantly reducing the time and space complexity of trajectory compression, while greatly improving the compression rates of the state-of-the-art algorithms (up to 47%). We then show that with this algorithm, the operational time of the target resource-constrained hardware platform can be prolonged by up to 41%.

preprint2014arXiv

Multi-scale Population and Mobility Estimation with Geo-tagged Tweets

Recent outbreaks of Ebola and Dengue viruses have again elevated the significance of the capability to quickly predict disease spread in an emergent situation. However, existing approaches usually rely heavily on the time-consuming census processes, or the privacy-sensitive call logs, leading to their unresponsive nature when facing the abruptly changing dynamics in the event of an outbreak. In this paper we study the feasibility of using large-scale Twitter data as a proxy of human mobility to model and predict disease spread. We report that for Australia, Twitter users' distribution correlates well the census-based population distribution, and that the Twitter users' travel patterns appear to loosely follow the gravity law at multiple scales of geographic distances, i.e. national level, state level and metropolitan level. The radiation model is also evaluated on this dataset though it has shown inferior fitness as a result of Australia's sparse population and large landmass. The outcomes of the study form the cornerstones for future work towards a model-based, responsive prediction method from Twitter data for disease spread.

preprint2014arXiv

Optimal Lévy-flight foraging in a finite landscape

We present a simple model to study Lévy-flight foraging in a finite landscape with countable targets. In our approach, foraging is a step-based exploratory random search process with a power-law step-size distribution $P(l) \propto l^{-μ}$. We find that, when the termination is regulated by a finite number of steps $N$, the optimum value of $μ$ that maximises the foraging efficiency can vary substantially in the interval $μ\in (1,3)$, depending on the landscape features (landscape size and number of targets). We further demonstrate that subjective returning can be another significant factor that affects the foraging efficiency in such context. Our results suggest that Lévy-flight foraging may arise through an interaction between the environmental context and the termination of exploitation, and particularly that the number of steps can play an important role in this scenario which is overlooked by most previous work. Our study not only provides a new perspective on Lévy-flight foraging, but also opens new avenues for investigating the interaction between foraging dynamics and environment as well as offers a realistic framework for analysing animal movement patterns from empirical data.

Jiajun Liu

What is connected

Connect this record

See the researcher in context

Building this map preview

18 published item(s)

Chain-of-Action: Trajectory Autoregressive Modeling for Robotic Manipulation

Knowledge Poisoning Attacks on Medical Multi-Modal Retrieval-Augmented Generation

ReCoVR: Closing the Loop in Interactive Composed Video Retrieval

Stable Preference Optimization: A Bilevel Approach to Catastrophic Preference Shift

GTP-ViT: Efficient Vision Transformers via Graph-based Token Propagation

A Real-time Edge-AI System for Reef Surveys

Instant Graph Neural Networks for Dynamic Graphs

InvisibiliTee: Angle-agnostic Cloaking from Person-Tracking Systems with a Tee

Spatiotemporal continuous estimates of daily 1-km PM2.5 from 2000 to present under the Tracking Air Pollution in China (TAP) framework

STAR-GNN: Spatial-Temporal Video Representation for Content-based Retrieval

A novel policy for pre-trained Deep Reinforcement Learning for Speech Emotion Recognition

A Three-limb Teleoperated Robotic System with Foot Control for Flexible Endoscopic Surgery

A Novel Framework for Online Amnesic Trajectory Compression in Resource-constrained Environments

Temporal Embedding in Convolutional Neural Networks for Robust Learning of Abstract Snippets

Understanding Human Mobility from Twitter

Bounded Quadrant System: Error-bounded Trajectory Compression on the Go

Multi-scale Population and Mobility Estimation with Geo-tagged Tweets

Optimal Lévy-flight foraging in a finite landscape