Researcher profile

Xue Jiang

Xue Jiang contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
17works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

17 published item(s)

preprint2026arXiv

AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval

Despite notable advancements in remote sensing vision-language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real-world applications. To push the boundaries of VLMs in remote sensing, we specifically address vehicle imagery captured by drones and introduce a spatially-aware dataset AirSpatial, which comprises over 206K instructions and introduces two novel tasks: Spatial Grounding and Spatial Question Answering. It is also the first remote sensing grounding dataset to provide 3DBB. To effectively leverage existing image understanding of VLMs to spatial domains, we adopt a two-stage training strategy comprising Image Understanding Pre-training and Spatial Understanding Fine-tuning. Utilizing this trained spatially-aware VLM, we develop an aerial agent, AirSpatialBot, which is capable of fine-grained vehicle attribute recognition and retrieval. By dynamically integrating task planning, image understanding, spatial understanding, and task execution capabilities, AirSpatialBot adapts to diverse query requirements. Experimental results validate the effectiveness of our approach, revealing the spatial limitations of existing VLMs while providing valuable insights. The model, code, and datasets will be released at https://github.com/VisionXLab/AirSpatialBot

preprint2026arXiv

DVGBench: Implicit-to-Explicit Visual Grounding Benchmark in UAV Imagery with Large Vision-Language Models

Remote sensing (RS) large vision-language models (LVLMs) have shown strong promise across visual grounding (VG) tasks. However, existing RS VG datasets predominantly rely on explicit referring expressions-such as relative position, relative size, and color cues-thereby constraining performance on implicit VG tasks that require scenario-specific domain knowledge. This article introduces DVGBench, a high-quality implicit VG benchmark for drones, covering six major application scenarios: traffic, disaster, security, sport, social activity, and productive activity. Each object provides both explicit and implicit queries. Based on the dataset, we design DroneVG-R1, an LVLM that integrates the novel Implicit-to-Explicit Chain-of-Thought (I2E-CoT) within a reinforcement learning paradigm. This enables the model to take advantage of scene-specific expertise, converting implicit references into explicit ones and thus reducing grounding difficulty. Finally, an evaluation of mainstream models on both explicit and implicit VG tasks reveals substantial limitations in their reasoning capabilities. These findings provide actionable insights for advancing the reasoning capacity of LVLMs for drone-based agents. The code and datasets will be released at https://github.com/zytx121/DVGBench

preprint2026arXiv

From I/O to Code with Discovery Agent

The automatic synthesis of a program from any form of specification is regarded as a holy grail of computer science. Fueled by LLMs, NL2Code has achieved tremendous success, yet the fundamentally more challenging task of synthesizing programs from input-output behavior, which we refer to as IO2Code, remains largely unsolved. Whereas NL2Code can exploit the semantic alignment between natural language and code acquired during pretraining, IO2Code requires recovering underlying principles from concrete computational behavior, navigating a vast and underspecified hypothesis space. To address this, we propose DIO-Agent, a discovery agent for IO2Code. Our method frames IO2Code as an evolutionary search over discrete program space, in which an LLM serves as the mutation operator and concrete error signals from execution guide each mutation. To prevent the search from wandering into structurally complex yet incorrect dead ends, we introduce the Transformation Priority Premise as a mutation prior that biases the LLM toward the simplest hypothesis consistent with current evidence, progressively escalating from constants to conditionals to iteration only when simpler constructs are insufficient. To facilitate systematic study, we further construct an IO2CodeBench spanning multiple difficulty levels. Extensive experiments show that DIO-Agent consistently outperforms both traditional program-by-example method and SOTA evolution-agent baselines across all difficulty levels and various LLMs, while substantially surpassing test-time scaling strategies with equivalent sampling budgets.

preprint2026arXiv

TACT: Mitigating Overthinking and Overacting in Coding Agents via Activation Steering

When language model agents tackle complex software engineering tasks, they often degrade over long trajectories, which we define as *agent drift*. We focus on two recurring failure modes *overthinking* and *overacting*, i.e., where the agent repeatedly reasons over information it already has, and where it issues tool calls without integrating recent observations or acquiring new evidence. In this paper, we introduce TACT (Think-Act Calibration via activation Steering), to detect and mitigate agent drift in the residual stream before it surfaces as a behavioral failure. In specific, we label trajectory steps as overthinking, overacting, or calibrated, and find that their hidden states can separate linearly along two *drift axes*, pointing from calibrated behavior toward each failure mode (AUC $\approx$ 0.9). To mitigate agent drift, we project each step's activation onto these axes at test time and pull drifted ones back toward the calibrated region. Experiments show that TACT outperforms unsteered baselines across SWE-bench Verified, Terminal-Bench 2.0, and CLAW-Eval, lifting average resolve rate by $+5.8$ pp on Qwen3.5-27B and $+4.8$ pp on Gemma-4-26B-A4B-it while cutting steps-to-resolve by up to $26\%$. These gains frame agent drift as a steerable direction in the residual stream, and position TACT as a viable handle for reliable long-horizon agents.

preprint2026arXiv

Transferability of Adversarial Attacks in Video-based MLLMs: A Cross-modal Image-to-Video Approach

Video-based multimodal large language models (V-MLLMs) have shown vulnerability to adversarial examples in video-text multimodal tasks. However, the transferability of adversarial videos to unseen models - a common and practical real-world scenario - remains unexplored. In this paper, we pioneer an investigation into the transferability of adversarial video samples across V-MLLMs. We find that existing adversarial attack methods face significant limitations when applied in black-box settings for V-MLLMs, which we attribute to the following shortcomings: (1) lacking generalization in perturbing video features, (2) focusing only on sparse key-frames, and (3) failing to integrate multimodal information. To address these limitations and deepen the understanding of V-MLLM vulnerabilities in black-box scenarios, we introduce the Image-to-Video MLLM (I2V-MLLM) attack. In I2V-MLLM, we utilize an image-based multimodal large language model (I-MLLM) as a surrogate model to craft adversarial video samples. Multimodal interactions and spatiotemporal information are integrated to disrupt video representations within the latent space, improving adversarial transferability. Additionally, a perturbation propagation technique is introduced to handle different unknown frame sampling strategies. Experimental results demonstrate that our method can generate adversarial examples that exhibit strong transferability across different V-MLLMs on multiple video-text multimodal tasks. Compared to white-box attacks on these models, our black-box attacks (using BLIP-2 as a surrogate model) achieve competitive performance, with average attack success rate (AASR) of 57.98% on MSVD-QA and 58.26% on MSRVTT-QA for Zero-Shot VideoQA tasks, respectively.

preprint2023arXiv

CodePAD: Sequence-based Code Generation with Pushdown Automaton

In the process of code generation, it is essential to guarantee the generated code satisfies grammar constraints of programming language (PL). However, neglecting grammar constraints is a fatal drawback of commonly used sequence-based code generation. In this paper, we devise a pushdown automaton (PDA)-based methodology to address this problem, exploiting the principle that PL is a subset of PDA recognizable language and code accepted by PDA is grammatical. Specifically, we construct a PDA module and design an algorithm to constrain the generation of sequence-based models to ensure grammatical correctness. Guided by this methodology, we further propose CodePAD, a sequence-based code generation framework equipped with a PDA module, to integrate the deduction of PDA into deep learning. Additionally, this framework can leverage states of PDA deduction (including state representation, state prediction task, and joint prediction with state) to assist models in learning PDA deduction. To comprehensively evaluate CodePAD, we construct a PDA for Python and conduct extensive experiments on four public benchmark datasets. CodePAD can leverage existing sequence-based models, and we show that it can achieve 100\% grammatical correctness percentage on these benchmark datasets. Thus, it relatively improve 17\% CodeBLEU on CONALA, 8\% EM on DJANGO, and 15\% CodeBLEU on JUICE-10K compared to base models. In addition, our method significantly enhances pre-trained models, e.g., CodeBLEU of CodeGen-350M improvement from 3.21 to 21.54 on MBPP in zero-shot setting.

preprint2022arXiv

An Adaptive Finite Element DtN Method for Maxwell's Equations

This paper is concerned with a numerical solution to the scattering of a time-harmonic electromagnetic wave by a bounded and impenetrable obstacle in three dimensions. The electromagnetic wave propagation is modeled by a boundary value problem of Maxwell's equations in the exterior domain of the obstacle. Based on the Dirichlet-to-Neumann (DtN) operator, which is defined by an infinite series, an exact transparent boundary condition is introduced and the scattering problem is reduced equivalently into a bounded domain. An a posteriori error estimate based adaptive finite element DtN method is developed to solve the discrete variational problem, where the DtN operator is truncated into a sum of finitely many terms. The a posteriori error estimate takes into account both the finite element approximation error and the truncation error of the DtN operator. The latter is shown to decay exponentially with respect to the truncation parameter. Numerical experiments are presented to illustrate the effectiveness of the proposed method.

preprint2022arXiv

Cross-Image Relational Knowledge Distillation for Semantic Segmentation

Current Knowledge Distillation (KD) methods for semantic segmentation often guide the student to mimic the teacher's structured information generated from individual data samples. However, they ignore the global semantic relations among pixels across various images that are valuable for KD. This paper proposes a novel Cross-Image Relational KD (CIRKD), which focuses on transferring structured pixel-to-pixel and pixel-to-region relations among the whole images. The motivation is that a good teacher network could construct a well-structured feature space in terms of global pixel dependencies. CIRKD makes the student mimic better structured semantic relations from the teacher, thus improving the segmentation performance. Experimental results over Cityscapes, CamVid and Pascal VOC datasets demonstrate the effectiveness of our proposed approach against state-of-the-art distillation methods. The code is available at https://github.com/winycg/CIRKD.

preprint2022arXiv

Cross-Scale Vector Quantization for Scalable Neural Speech Coding

Bitrate scalability is a desirable feature for audio coding in real-time communications. Existing neural audio codecs usually enforce a specific bitrate during training, so different models need to be trained for each target bitrate, which increases the memory footprint at the sender and the receiver side and transcoding is often needed to support multiple receivers. In this paper, we introduce a cross-scale scalable vector quantization scheme (CSVQ), in which multi-scale features are encoded progressively with stepwise feature fusion and refinement. In this way, a coarse-level signal is reconstructed if only a portion of the bitstream is received, and progressively improves the quality as more bits are available. The proposed CSVQ scheme can be flexibly applied to any neural audio coding network with a mirrored auto-encoder structure to achieve bitrate scalability. Subjective results show that the proposed scheme outperforms the classical residual VQ (RVQ) with scalability. Moreover, the proposed CSVQ at 3 kbps outperforms Opus at 9 kbps and Lyra at 3kbps and it could provide a graceful quality boost with bitrate increase.

preprint2022arXiv

End-to-End Neural Speech Coding for Real-Time Communications

Deep-learning based methods have shown their advantages in audio coding over traditional ones but limited attention has been paid on real-time communications (RTC). This paper proposes the TFNet, an end-to-end neural speech codec with low latency for RTC. It takes an encoder-temporal filtering-decoder paradigm that has seldom been investigated in audio coding. An interleaved structure is proposed for temporal filtering to capture both short-term and long-term temporal dependencies. Furthermore, with end-to-end optimization, the TFNet is jointly optimized with speech enhancement and packet loss concealment, yielding a one-for-all network for three tasks. Both subjective and objective results demonstrate the efficiency of the proposed TFNet.

preprint2022arXiv

HELoC: Hierarchical Contrastive Learning of Source Code Representation

Abstract syntax trees (ASTs) play a crucial role in source code representation. However, due to the large number of nodes in an AST and the typically deep AST hierarchy, it is challenging to learn the hierarchical structure of an AST effectively. In this paper, we propose HELoC, a hierarchical contrastive learning model for source code representation. To effectively learn the AST hierarchy, we use contrastive learning to allow the network to predict the AST node level and learn the hierarchical relationships between nodes in a self-supervised manner, which makes the representation vectors of nodes with greater differences in AST levels farther apart in the embedding space. By using such vectors, the structural similarities between code snippets can be measured more precisely. In the learning process, a novel GNN (called Residual Self-attention Graph Neural Network, RSGNN) is designed, which enables HELoC to focus on embedding the local structure of an AST while capturing its overall structure. HELoC is self-supervised and can be applied to many source code related downstream tasks such as code classification, code clone detection, and code clustering after pre-training. Our extensive experiments demonstrate that HELoC outperforms the state-of-the-art source code representation models.

preprint2022arXiv

MMRotate: A Rotated Object Detection Benchmark using PyTorch

We present an open-source toolbox, named MMRotate, which provides a coherent algorithm framework of training, inferring, and evaluation for the popular rotated object detection algorithm based on deep learning. MMRotate implements 18 state-of-the-art algorithms and supports the three most frequently used angle definition methods. To facilitate future research and industrial applications of rotated object detection-related problems, we also provide a large number of trained models and detailed benchmarks to give insights into the performance of rotated object detection. MMRotate is publicly released at https://github.com/open-mmlab/mmrotate.

preprint2022arXiv

PyTSK: A Python Toolbox for TSK Fuzzy Systems

This paper presents PyTSK, a Python toolbox for developing Takagi-Sugeno-Kang (TSK) fuzzy systems. Based on scikit-learn and PyTorch, PyTSK allows users to optimize TSK fuzzy systems using fuzzy clustering or mini-batch gradient descent (MBGD) based algorithms. Several state-of-the-art MBGD-based optimization algorithms are implemented in the toolbox, which can improve the generalization performance of TSK fuzzy systems, especially for big data applications. PyTSK can also be easily extended and customized for more complicated algorithms, such as modifying the structure of TSK fuzzy systems, developing more sophisticated training algorithms, and combining TSK fuzzy systems with neural networks. The code of PyTSK can be found at https://github.com/YuqiCui/pytsk.

preprint2022arXiv

Towards Error-Resilient Neural Speech Coding

Neural audio coding has shown very promising results recently in the literature to largely outperform traditional codecs but limited attention has been paid on its error resilience. Neural codecs trained considering only source coding tend to be extremely sensitive to channel noises, especially in wireless channels with high error rate. In this paper, we investigate how to elevate the error resilience of neural audio codecs for packet losses that often occur during real-time communications. We propose a feature-domain packet loss concealment algorithm (FD-PLC) for real-time neural speech coding. Specifically, we introduce a self-attention-based module on the received latent features to recover lost frames in the feature domain before the decoder. A hybrid segment-level and frame-level frequency-domain discriminator is employed to guide the network to focus on both the generative quality of lost frames and the continuity with neighbouring frames. Experimental results on several error patterns show that the proposed scheme can achieve better robustness compared with the corresponding error-free and error-resilient baselines. We also show that feature-domain concealment is superior to waveform-domain counterpart as post-processing.

preprint2021arXiv

MRNet: a Multi-scale Residual Network for EEG-based Sleep Staging

Sleep staging based on electroencephalogram (EEG) plays an important role in the clinical diagnosis and treatment of sleep disorders. In order to emancipate human experts from heavy labeling work, deep neural networks have been employed to formulate automated sleep staging systems recently. However, EEG signals lose considerable detailed information in network propagation, which affects the representation of deep features. To address this problem, we propose a new framework, called MRNet, for data-driven sleep staging by integrating a multi-scale feature fusion model and a Markov-based sequential correction algorithm. The backbone of MRNet is a residual block-based network, which performs as a feature extractor.Then the fusion model constructs a feature pyramid by concatenating the outputs from the different depths of the backbone, which can help the network better comprehend the signals in different scales. The Markov-based sequential correction algorithm is designed to reduce the output jitters generated by the classifier. The algorithm depends on a prior stage distribution associated with the sleep stage transition rule and the Markov chain. Experiment results demonstrate the competitive performance of our proposed approach on both accuracy and F1 score (e.g., 85.14% Acc and 78.91% F1 score on Sleep-EDFx, and 87.59% Acc and 79.62% F1 score on Sleep-EDF).

preprint2020arXiv

An adaptive finite element DtN method for the elastic wave scattering by biperiodic structures

Consider the scattering of a time-harmonic elastic plane wave by a bi-periodic rigid surface. The displacement of elastic wave motion is modeled by the three-dimensional Navier equation in an open domain above the surface. Based on the Dirichlet-to-Neumann (DtN) operator, which is given as an infinite series, an exact transparent boundary condition is introduced and the scattering problem is formulated equivalently into a boundary value problem in a bounded domain. An a posteriori error estimate based adaptive finite element DtN method is proposed to solve the discrete variational problem where the DtN operator is truncated into a finite number of terms. The a posteriori error estimate takes account of the finite element approximation error and the truncation error of the DtN operator which is shown to decay exponentially with respect to the truncation parameter. Numerical experiments are presented to illustrate the effectiveness of the proposed method.

preprint2020arXiv

Enhanced Ferromagnetism of CrI3 Bilayer by Self-Intercalation

Two-dimensional (2D) ferromagnets with high Curie temperature have long been the pursuit for electronic and spintronic applications. CrI3 is a rising star of intrinsic 2D ferromagnets, however, it suffers from weak exchange coupling. Here we propose a general strategy of self-intercalation to achieve enhanced ferromagnetism in bilayer CrI3. We showed that filling either Cr or I atoms into the van der Waals gap of stacked and twisted CrI3 bilayers can induce the double exchange effect and significantly strengthen the interlayer ferromagnetic coupling. According to our first-principles calculations, the intercalated native atoms act as covalent bridge between two CrI3 layers and lead to discrepant oxidation states for the Cr atoms. These theoretical results offer a facile route to achieve high-Curie-temperature 2D magnets for device implementation.