Researcher profile

Xuesong Li

Xuesong Li contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
7works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

7 published item(s)

preprint2026arXiv

Break the Brake, Not the Wheel: Untargeted Jailbreak via Entropy Maximization

Recent studies show that gradient-based universal image jailbreaks on vision-language models (VLMs) exhibit little or no cross-model transferability, casting doubt on the feasibility of transferable multimodal jailbreaks. We revisit this conclusion under a strictly untargeted threat model without enforcing a fixed prefix or response pattern. Our preliminary experiment reveals that refusal behavior concentrates at high-entropy tokens during autoregressive decoding, and non-refusal tokens already carry substantial probability mass among the top-ranked candidates before attack. Motivated by this finding, we propose Untargeted Jailbreak via Entropy Maximization(UJEM)-KL, a lightweight attack that maximizes entropy at these decision tokens to flip refusal outcomes, while stabilizing the remaining low-entropy positions to preserve output quality. Across three VLMs and two safety benchmarks, UJEM-KL achieves competitive white-box attack success rates and consistently improves transferability, while remaining effective under representative defenses. Our experimental results indicate that the limited transferability primarily stems from overly constrained optimization objectives.

preprint2026arXiv

From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation

Text-based image segmentation aims to delineate object boundaries within an image from text prompts, offering higher flexibility and broader application scope compared to traditional fixed-category segmentation tasks. Recent studies have shown that diffusion models (e.g., Stable Diffusion) can provide rich multimodal semantic features, leading to studies of using diffusion models as feature extractors for segmentation tasks. Such methods, however, inherit the generative natures of diffusion models that are harmful to discriminative segmentation tasks. In response, we propose RLFSeg, a novel framework that leverages Rectified Flow to learn direct mapping from the image to the segmentation mask within the latent space. The model is thus freed from the noise-denoise process and the need to optimize the time step of diffusion models, resulting in substantially better performance than previous diffusion-based methods, especially on zero-shot scenarios. By introducing label refinement and an Adaptive One-Step Sampling strategy, the model achieves higher accuracy even on a single inference step. The framework redirects a pretrained generative model to the discriminative segmentation task with zero modification to model structure, thus reveals promising application potential and significant research value.

preprint2026arXiv

Structural Energy Guidance for View-Consistent Text-to-3D Generation

Text-to-3D generation based on diffusion models often suffers from the Janus problem, leading to inconsistent geometry across viewpoints. This work identifies viewpoint bias in 2D diffusion priors as the main cause and proposes Structural Energy-Guided Sampling (SEGS), a training-free and plug-and-play framework to improve multi-view consistency. SEGS constructs a structural energy in the PCA subspace of U-Net features and injects its gradient into the denoising process. It can be easily integrated into SDS/VSD pipelines without retraining. Experiments show that SEGS reduces the Janus Rate by about 10% on average and improves View-CS scores across multiple baselines, including DreamFusion, Magic3D, and LucidDreamer. This method effectively alleviates viewpoint artifacts while preserving appearance fidelity, providing a flexible solution for high-quality text-to-3D content generation.

preprint2020arXiv

Context-Aware Dynamic Feature Extraction for 3D Object Detection in Point Clouds

Varying density of point clouds increases the difficulty of 3D detection. In this paper, we present a context-aware dynamic network (CADNet) to capture the variance of density by considering both point context and semantic context. Point-level contexts are generated from original point clouds to enlarge the effective receptive filed. They are extracted around the voxelized pillars based on our extended voxelization method and processed with the context encoder in parallel with the pillar features. With a large perception range, we are able to capture the variance of features for potential objects and generate attentive spatial guidance to help adjust the strengths for different regions. In the region proposal network, considering the limited representation ability of traditional convolution where same kernels are shared among different samples and positions, we propose a decomposable dynamic convolutional layer to adapt to the variance of input features by learning from local semantic context. It adaptively generates the position-dependent coefficients for multiple fixed kernels and combines them to convolve with local feature windows. Based on our dynamic convolution, we design a dual-path convolution block to further improve the representation ability. We conduct experiments with our Network on KITTI dataset and achieve good performance on 3D detection task for both precision and speed. Our one-stage detector outperforms SECOND and PointPillars by a large margin and achieves the speed of 30 FPS.

preprint2020arXiv

Efficient and accurate object detection with simultaneous classification and tracking

Interacting with the environment, such as object detection and tracking, is a crucial ability of mobile robots. Besides high accuracy, efficiency in terms of processing effort and energy consumption are also desirable. To satisfy both requirements, we propose a detection framework based on simultaneous classification and tracking in the point stream. In this framework, a tracker performs data association in sequences of the point cloud, guiding the detector to avoid redundant processing (i.e. classifying already-known objects). For objects whose classification is not sufficiently certain, a fusion model is designed to fuse selected key observations that provide different perspectives across the tracking span. Therefore, performance (accuracy and efficiency of detection) can be enhanced. This method is particularly suitable for detecting and tracking moving objects, a process that would require expensive computations if solved using conventional procedures. Experiments were conducted on the benchmark dataset, and the results showed that the proposed method outperforms original tracking-by-detection approaches in both efficiency and accuracy.

preprint2020arXiv

Real-time 3D object proposal generation and classification under limited processing resources

The task of detecting 3D objects is important to various robotic applications. The existing deep learning-based detection techniques have achieved impressive performance. However, these techniques are limited to run with a graphics processing unit (GPU) in a real-time environment. To achieve real-time 3D object detection with limited computational resources for robots, we propose an efficient detection method consisting of 3D proposal generation and classification. The proposal generation is mainly based on point segmentation, while the proposal classification is performed by a lightweight convolution neural network (CNN) model. To validate our method, KITTI datasets are utilized. The experimental results demonstrate the capability of proposed real-time 3D object detection method from the point cloud with a competitive performance of object recall and classification.

preprint2020arXiv

Tunable Graphene Split-Ring Resonators

A split-ring resonator is a prototype of meta-atom in metamaterials. Though noble metal-based split-ring resonators have been extensively studied, up to date, there is no experimental demonstration of split-ring resonators made from graphene, an emerging intriguing plasmonic material. Here, we experimentally demonstrate graphene split-ring resonators with deep subwavelength (about one hundredth of the excitation wavelength) magnetic dipole response in the terahertz regime. Meanwhile, the quadrupole and electric dipole are observed,depending on the incident light polarization. All modes can be tuned via chemical doping or stacking multiple graphene layers. The strong interaction with surface polar phonons of the SiO2 substrate also significantly modifies the response. Finite-element frequency domain simulations nicely reproduce experimental results. Our study moves one stride forward toward the multi-functional graphene metamaterials, beyond simple graphene ribbon or disk arrays with electrical dipole resonances only.