Researcher profile

Shaojie Shen

Shaojie Shen contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
19works
0followers
5topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

19 published item(s)

preprint2026arXiv

FlyCo: Foundation Model-Empowered Drones for Autonomous 3D Structure Scanning in Open-World Environments

Autonomous 3D scanning of open-world target structures via drones remains challenging despite broad applications. Existing paradigms rely on restrictive assumptions or effortful human priors, limiting practicality, efficiency, and adaptability. Recent foundation models (FMs) offer great potential to bridge this gap. This paper investigates a critical research problem: What system architecture can effectively integrate FM knowledge for this task? We answer it with FlyCo, a principled FM-empowered perception-prediction-planning loop enabling fully autonomous, prompt-driven 3D target scanning in diverse unknown open-world environments. FlyCo directly translates low-effort human prompts (text, visual annotations) into precise adaptive scanning flights via three coordinated stages: (1) perception fuses streaming sensor data with vision-language FMs for robust target grounding and tracking; (2) prediction distills FM knowledge and combines multi-modal cues to infer the partially observed target's complete geometry; (3) planning leverages predictive foresight to generate efficient and safe paths with comprehensive target coverage. Building on this, we further design key components to boost open-world target grounding efficiency and robustness, enhance prediction quality in terms of shape accuracy, zero-shot generalization, and temporal stability, and balance long-horizon flight efficiency with real-time computability and online collision avoidance. Extensive challenging real-world and simulation experiments show FlyCo delivers precise scene understanding, high efficiency, and real-time safety, outperforming existing paradigms with lower human effort and verifying the proposed architecture's practicality. Comprehensive ablations validate each component's contribution. FlyCo also serves as a flexible, extensible blueprint, readily leveraging future FM and robotics advances. Code will be released.

preprint2026arXiv

H-OmniStereo: Zero-Shot Omnidirectional Stereo Matching with Heading-Aligned Normal Priors

Stereo matching on top-bottom equirectangular images provides an effective framework for full-surround perception, as vertically aligned epipolar lines enable the use of advanced perspective stereo architectures that are largely driven by large-scale datasets and monocular priors. However, the performance of such adaptations is severely limited by the scarcity of omnidirectional stereo datasets and the degradation of perspective monocular priors under spherical distortions. To address these challenges, we propose H-OmniStereo, a zero-shot omnidirectional stereo matching framework. First, we construct high-quality synthetic dataset comprising over 2.8 million top-bottom equirectangular stereo pairs to scale up training. Second, we introduce an equirectangular monocular normal estimator, specifically operating in a heading-aligned coordinate system. Beyond providing distortion-robust and cross-view-consistent geometric priors for establishing reliable correspondences in stereo matching, this design boosts training efficiency and accommodates train-test FoV mismatches. Extensive experiments show that our approach achieves higher accuracy than existing methods on out-of-domain datasets and successfully generalizes to real-world consumer camera setups using a single model. The model and dataset will be released at https://github.com/JIANG-CX/H-OmniStereo.

preprint2026arXiv

Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future

Autonomous driving has long relied on modular "Perception-Decision-Action" pipelines, where hand-crafted interfaces and rule-based components often break down in complex or long-tailed scenarios. Their cascaded design further propagates perception errors, degrading downstream planning and control. Vision-Action (VA) models address some limitations by learning direct mappings from visual inputs to actions, but they remain opaque, sensitive to distribution shifts, and lack structured reasoning or instruction-following capabilities. Recent progress in Large Language Models (LLMs) and multimodal learning has motivated the emergence of Vision-Language-Action (VLA) frameworks, which integrate perception with language-grounded decision making. By unifying visual understanding, linguistic reasoning, and actionable outputs, VLAs offer a pathway toward more interpretable, generalizable, and human-aligned driving policies. This work provides a structured characterization of the emerging VLA landscape for autonomous driving. We trace the evolution from early VA approaches to modern VLA frameworks and organize existing methods into two principal paradigms: End-to-End VLA, which integrates perception, reasoning, and planning within a single model, and Dual-System VLA, which separates slow deliberation (via VLMs) from fast, safety-critical execution (via planners). Within these paradigms, we further distinguish subclasses such as textual vs. numerical action generators and explicit vs. implicit guidance mechanisms. We also summarize representative datasets and benchmarks for evaluating VLA-based driving systems and highlight key challenges and open directions, including robustness, interpretability, and instruction fidelity. Overall, this work aims to establish a coherent foundation for advancing human-compatible autonomous driving systems.

preprint2023arXiv

Decentralized iLQR for Cooperative Trajectory Planning of Connected Autonomous Vehicles via Dual Consensus ADMM

Developments in cooperative trajectory planning of connected autonomous vehicles (CAVs) have gathered considerable momentum and research attention. Generally, such problems present strong non-linearity and non-convexity, rendering great difficulties in finding the optimal solution. Existing methods typically suffer from low computational efficiency, and this hinders the appropriate applications in large-scale scenarios involving an increasing number of vehicles. To tackle this problem, we propose a novel decentralized iterative linear quadratic regulator (iLQR) algorithm by leveraging the dual consensus alternating direction method of multipliers (ADMM). First, the original non-convex optimization problem is reformulated into a series of convex optimization problems through iterative neighbourhood approximation. Then, the dual of each convex optimization problem is shown to have a consensus structure, which facilitates the use of consensus ADMM to solve for the dual solution in a fully decentralized and parallel architecture. Finally, the primal solution corresponding to the trajectory of each vehicle is recovered by solving a linear quadratic regulator (LQR) problem iteratively, and a novel trajectory update strategy is proposed to ensure the dynamic feasibility of vehicles. With the proposed development, the computation burden is significantly alleviated such that real-time performance is attainable. Two traffic scenarios are presented to validate the proposed algorithm, and thorough comparisons between our proposed method and baseline methods (including centralized iLQR, IPOPT, and SQP) are conducted to demonstrate the scalability of the proposed approach.

preprint2022arXiv

BDPGO: Balanced Distributed Pose Graph Optimization Framework for Swarm Robotics

Distributed pose graph optimization (DPGO) is one of the fundamental techniques of swarm robotics. Currently, the sub-problems of DPGO are built on the native poses. Our validation proves that this approach may introduce an imbalance in the sizes of the sub-problems in real-world scenarios, which affects the speed of DPGO optimization, and potentially increases communication requirements. In addition, the coherence of the estimated poses is not guaranteed when the robots in the swarm fail, or partial robots are disconnected. In this paper, we propose BDPGO, a balanced distributed pose graph optimization framework using the idea of decoupling the robot poses and DPGO. BDPGO re-distributes the poses in the pose graph to the robot swarm in a balanced way by introducing a two-stage graph partitioning method to build balanced subproblems. Our validation demonstrates that BDPGO significantly improves the optimization speed without changing the specific algorithm of DPGO in realistic datasets. What's more, we also validate that BDPGO is robust to robot failure, changes in the wireless network. BDPGO has capable of keeps the coherence of the estimated poses in these situations. The framework also has the potential to be applied to other collaborative simultaneous localization and mapping (CSLAM) problems involved in distributedly solving the factor graph.

preprint2022arXiv

Exploration with Global Consistency Using Real-Time Re-integration and Active Loop Closure

Despite recent progress of robotic exploration, most methods assume that drift-free localization is available, which is problematic in reality and causes severe distortion of the reconstructed map. In this work, we present a systematic exploration mapping and planning framework that deals with drifted localization, allowing efficient and globally consistent reconstruction. A real-time re-integration-based mapping approach along with a frame pruning mechanism is proposed, which rectifies map distortion effectively when drifted localization is corrected upon detecting loop-closure. Besides, an exploration planning method considering historical viewpoints is presented to enable active loop closing, which promotes a higher opportunity to correct localization errors and further improves the mapping quality. We evaluate both the mapping and planning methods as well as the entire system comprehensively in simulation and real-world experiments, showing their effectiveness in practice. The implementation of the proposed method will be made open-source for the benefit of the robotics community.

preprint2022arXiv

Fast 3D Sparse Topological Skeleton Graph Generation for Mobile Robot Global Planning

In recent years, mobile robots are becoming ambitious and deployed in large-scale scenarios. Serving as a high-level understanding of environments, a sparse skeleton graph is beneficial for more efficient global planning. Currently, existing solutions for skeleton graph generation suffer from several major limitations, including poor adaptiveness to different map representations, dependency on robot inspection trajectories and high computational overhead. In this paper, we propose an efficient and flexible algorithm generating a trajectory-independent 3D sparse topological skeleton graph capturing the spatial structure of the free space. In our method, an efficient ray sampling and validating mechanism are adopted to find distinctive free space regions, which contributes to skeleton graph vertices, with traversability between adjacent vertices as edges. A cycle formation scheme is also utilized to maintain skeleton graph compactness. Benchmark comparison with state-of-the-art works demonstrates that our approach generates sparse graphs in a substantially shorter time, giving high-quality global planning paths. Experiments conducted in real-world maps further validate the capability of our method in real-world scenarios. Our method will be made open source to benefit the community.

preprint2022arXiv

Neither Fast Nor Slow: How to Fly Through Narrow Tunnels

Nowadays, multirotors are playing important roles in abundant types of missions. During these missions, entering confined and narrow tunnels that are barely accessible to humans is desirable yet extremely challenging for multirotors. The restricted space and significant ego airflow disturbances induce control issues at both fast and slow flight speeds, meanwhile bringing about problems in state estimation and perception. Thus, a smooth trajectory at a proper speed is necessary for safe tunnel flights. To address these challenges, in this letter, a complete autonomous aerial system that can fly smoothly through tunnels with dimensions narrow to 0.6 m is presented. The system contains a motion planner that generates smooth mini-jerk trajectories along the tunnel center lines, which are extracted according to the map and Euclidean Distance Field (EDF), and its practical speed range is obtained through computational fluid dynamics (CFD) and flight data analyses. Extensive flight experiments on the quadrotor are conducted inside multiple narrow tunnels to validate the planning framework as well as the robustness of the whole system.

preprint2022arXiv

Omni-swarm: A Decentralized Omnidirectional Visual-Inertial-UWB State Estimation System for Aerial Swarms

Decentralized state estimation is one of the most fundamental components of autonomous aerial swarm systems in GPS-denied areas yet it still remains a highly challenging research topic. Omni-swarm, a decentralized omnidirectional visual-inertial-UWB state estimation system for aerial swarms, is proposed in this paper to address this research niche. To solve the issues of observability, complicated initialization, insufficient accuracy, and lack of global consistency, we introduce an omnidirectional perception front-end in Omni-swarm. It consists of stereo wide-FoV cameras and ultra-wideband sensors, visual-inertial odometry, multi-drone map-based localization, and visual drone tracking algorithms. The measurements from the front-end are fused with graph-based optimization in the back-end. The proposed method achieves centimeter-level relative state estimation accuracy while guaranteeing global consistency in the aerial swarm, as evidenced by the experimental results. Moreover, supported by Omni-swarm, inter-drone collision avoidance can be accomplished without any external devices, demonstrating the potential of Omni-swarm as the foundation of autonomous aerial swarms.

preprint2022arXiv

Temporal Point Cloud Completion with Pose Disturbance

Point clouds collected by real-world sensors are always unaligned and sparse, which makes it hard to reconstruct the complete shape of object from a single frame of data. In this work, we manage to provide complete point clouds from sparse input with pose disturbance by limited translation and rotation. We also use temporal information to enhance the completion model, refining the output with a sequence of inputs. With the help of gated recovery units(GRU) and attention mechanisms as temporal units, we propose a point cloud completion framework that accepts a sequence of unaligned and sparse inputs, and outputs consistent and aligned point clouds. Our network performs in an online manner and presents a refined point cloud for each frame, which enables it to be integrated into any SLAM or reconstruction pipeline. As far as we know, our framework is the first to utilize temporal information and ensure temporal consistency with limited transformation. Through experiments in ShapeNet and KITTI, we prove that our framework is effective in both synthetic and real-world datasets.

preprint2022arXiv

Trajectory Prediction with Graph-based Dual-scale Context Fusion

Motion prediction for traffic participants is essential for a safe and robust automated driving system, especially in cluttered urban environments. However, it is highly challenging due to the complex road topology as well as the uncertain intentions of the other agents. In this paper, we present a graph-based trajectory prediction network named the Dual Scale Predictor (DSP), which encodes both the static and dynamical driving context in a hierarchical manner. Different from methods based on a rasterized map or sparse lane graph, we consider the driving context as a graph with two layers, focusing on both geometrical and topological features. Graph neural networks (GNNs) are applied to extract features with different levels of granularity, and features are subsequently aggregated with attention-based inter-layer networks, realizing better local-global feature fusion. Following the recent goal-driven trajectory prediction pipeline, goal candidates with high likelihood for the target agent are extracted, and predicted trajectories are generated conditioned on these goals. Thanks to the proposed dual-scale context fusion network, our DSP is able to generate accurate and human-like multi-modal trajectories. We evaluate the proposed method on the large-scale Argoverse motion forecasting benchmark, and it achieves promising results, outperforming the recent state-of-the-art methods.

preprint2021arXiv

Event-based Motion Segmentation with Spatio-Temporal Graph Cuts

Identifying independently moving objects is an essential task for dynamic scene understanding. However, traditional cameras used in dynamic scenes may suffer from motion blur or exposure artifacts due to their sampling principle. By contrast, event-based cameras are novel bio-inspired sensors that offer advantages to overcome such limitations. They report pixelwise intensity changes asynchronously, which enables them to acquire visual information at exactly the same rate as the scene dynamics. We develop a method to identify independently moving objects acquired with an event-based camera, i.e., to solve the event-based motion segmentation problem. We cast the problem as an energy minimization one involving the fitting of multiple motion models. We jointly solve two subproblems, namely event cluster assignment (labeling) and motion model fitting, in an iterative manner by exploiting the structure of the input event data in the form of a spatio-temporal graph. Experiments on available datasets demonstrate the versatility of the method in scenes with different motion patterns and number of moving objects. The evaluation shows state-of-the-art results without having to predetermine the number of expected moving objects. We release the software and dataset under an open source licence to foster research in the emerging topic of event-based motion segmentation.

preprint2021arXiv

PiP: Planning-informed Trajectory Prediction for Autonomous Driving

It is critical to predict the motion of surrounding vehicles for self-driving planning, especially in a socially compliant and flexible way. However, future prediction is challenging due to the interaction and uncertainty in driving behaviors. We propose planning-informed trajectory prediction (PiP) to tackle the prediction problem in the multi-agent setting. Our approach is differentiated from the traditional manner of prediction, which is only based on historical information and decoupled with planning. By informing the prediction process with the planning of ego vehicle, our method achieves the state-of-the-art performance of multi-agent forecasting on highway datasets. Moreover, our approach enables a novel pipeline which couples the prediction and planning, by conditioning PiP on multiple candidate trajectories of the ego vehicle, which is highly beneficial for autonomous driving in interactive scenarios.

preprint2020arXiv

An Augmented Reality Interaction Interface for Autonomous Drone

Human drone interaction in autonomous navigation incorporates spatial interaction tasks, including reconstructed 3D map from the drone and human desired target position. Augmented Reality (AR) devices can be powerful interactive tools for handling these spatial interactions. In this work, we build an AR interface that displays the reconstructed 3D map from the drone on physical surfaces in front of the operator. Spatial target positions can be further set on the 3D map by intuitive head gaze and hand gesture. The AR interface is deployed to interact with an autonomous drone to explore an unknown environment. A user study is further conducted to evaluate the overall interaction performance.

preprint2020arXiv

Efficient Uncertainty-aware Decision-making for Automated Driving Using Guided Branching

Decision-making in dense traffic scenarios is challenging for automated vehicles (AVs) due to potentially stochastic behaviors of other traffic participants and perception uncertainties (e.g., tracking noise and prediction errors, etc.). Although the partially observable Markov decision process (POMDP) provides a systematic way to incorporate these uncertainties, it quickly becomes computationally intractable when scaled to the real-world large-size problem. In this paper, we present an efficient uncertainty-aware decision-making (EUDM) framework, which generates long-term lateral and longitudinal behaviors in complex driving environments in real-time. The computation complexity is controlled to an appropriate level by two novel techniques, namely, the domain-specific closed-loop policy tree (DCP-Tree) structure and conditional focused branching (CFB) mechanism. The key idea is utilizing domain-specific expert knowledge to guide the branching in both action and intention space. The proposed framework is validated using both onboard sensing data captured by a real vehicle and an interactive multi-agent simulation platform. We also release the code of our framework to accommodate benchmarking.

preprint2020arXiv

FP-Stereo: Hardware-Efficient Stereo Vision for Embedded Applications

Fast and accurate depth estimation, or stereo matching, is essential in embedded stereo vision systems, requiring substantial design effort to achieve an appropriate balance among accuracy, speed and hardware cost. To reduce the design effort and achieve the right balance, we propose FP-Stereo for building high-performance stereo matching pipelines on FPGAs automatically. FP-Stereo consists of an open-source hardware-efficient library, allowing designers to obtain the desired implementation instantly. Diverse methods are supported in our library for each stage of the stereo matching pipeline and a series of techniques are developed to exploit the parallelism and reduce the resource overhead. To improve the usability, FP-Stereo can generate synthesizable C code of the FPGA accelerator with our optimized HLS templates automatically. To guide users for the right design choice meeting specific application requirements, detailed comparisons are performed on various configurations of our library to investigate the accuracy/speed/cost trade-off. Experimental results also show that FP-Stereo outperforms the state-of-the-art FPGA design from all aspects, including 6.08% lower error, 2x faster speed, 30% less resource usage and 40% less energy consumption. Compared to GPU designs, FP-Stereo achieves the same accuracy at a competitive speed while consuming much less energy.

preprint2020arXiv

Joint Spatial-Temporal Optimization for Stereo 3D Object Tracking

Directly learning multiple 3D objects motion from sequential images is difficult, while the geometric bundle adjustment lacks the ability to localize the invisible object centroid. To benefit from both the powerful object understanding skill from deep neural network meanwhile tackle precise geometry modeling for consistent trajectory estimation, we propose a joint spatial-temporal optimization-based stereo 3D object tracking method. From the network, we detect corresponding 2D bounding boxes on adjacent images and regress an initial 3D bounding box. Dense object cues (local depth and local coordinates) that associating to the object centroid are then predicted using a region-based network. Considering both the instant localization accuracy and motion consistency, our optimization models the relations between the object centroid and observed cues into a joint spatial-temporal error function. All historic cues will be summarized to contribute to the current estimation by a per-frame marginalization strategy without repeated computation. Quantitative evaluation on the KITTI tracking dataset shows our approach outperforms previous image-based 3D tracking methods by significant margins. We also report extensive results on multiple categories and larger datasets (KITTI raw and Argoverse Tracking) for future benchmarking.

preprint2020arXiv

RAPTOR: Robust and Perception-aware Trajectory Replanning for Quadrotor Fast Flight

Recent advances in trajectory replanning have enabled quadrotor to navigate autonomously in unknown environments. However, high-speed navigation still remains a significant challenge. Given very limited time, existing methods have no strong guarantee on the feasibility or quality of the solutions. Moreover, most methods do not consider environment perception, which is the key bottleneck to fast flight. In this paper, we present RAPTOR, a robust and perception-aware replanning framework to support fast and safe flight. A path-guided optimization (PGO) approach that incorporates multiple topological paths is devised, to ensure finding feasible and high-quality trajectories in very limited time. We also introduce a perception-aware planning strategy to actively observe and avoid unknown obstacles. A risk-aware trajectory refinement ensures that unknown obstacles which may endanger the quadrotor can be observed earlier and avoid in time. The motion of yaw angle is planned to actively explore the surrounding space that is relevant for safe navigation. The proposed methods are tested extensively. We will release our implementation as an open-source package for the community.

preprint2020arXiv

Robust Real-time UAV Replanning Using Guided Gradient-based Optimization and Topological Paths

Gradient-based trajectory optimization (GTO) has gained wide popularity for quadrotor trajectory replanning. However, it suffers from local minima, which is not only fatal to safety but also unfavorable for smooth navigation. In this paper, we propose a replanning method based on GTO addressing this issue systematically. A path-guided optimization (PGO) approach is devised to tackle infeasible local minima, which improves the replanning success rate significantly. A topological path searching algorithm is developed to capture a collection of distinct useful paths in 3-D environments, each of which then guides an independent trajectory optimization. It activates a more comprehensive exploration of the solution space and output superior replanned trajectories. Benchmark evaluation shows that our method outplays state-of-the-art methods regarding replanning success rate and optimality. Challenging experiments of aggressive autonomous flight are presented to demonstrate the robustness of our method. We will release our implementation as an open-source package.