Researcher profile

Long Zhao

Long Zhao contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
18works
0followers
13topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

18 published item(s)

preprint2026arXiv

VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement

Despite the remarkable progress of Multimodal Large Language Models (MLLMs) in 2D vision-language tasks, their application to complex 3D scene manipulation remains underexplored. In this paper, we bridge this critical gap by tackling three key challenges in 3D object arrangement task using MLLMs. First, to address the weak visual grounding of MLLMs, which struggle to link programmatic edits with precise 3D outcomes, we introduce an MCP-based API. This shifts the interaction from brittle raw code manipulation to more robust, function-level updates. Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools to analyze scene state, gather spatial information, and validate action outcomes. This perceptual feedback loop is critical for closing the gap between language-based updates and precise 3D-aware manipulation. Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework with designated roles for planning, execution, and verification. This decomposition allows the system to robustly handle multi-step instructions and recover from intermediate errors. We demonstrate the effectiveness of our approach on a diverse set of 25 complex object arrangement tasks, where it significantly outperforms existing baselines. Website: vulcan-3d.github.io

preprint2022arXiv

Are Multimodal Transformers Robust to Missing Modality?

Multimodal data collected from the real world are often imperfect due to missing modalities. Therefore multimodal models that are robust against modal-incomplete data are highly preferred. Recently, Transformer models have shown great success in processing multimodal data. However, existing work has been limited to either architecture designs or pre-training strategies; whether Transformer models are naturally robust against missing-modal data has rarely been investigated. In this paper, we present the first-of-its-kind work to comprehensively investigate the behavior of Transformers in the presence of modal-incomplete data. Unsurprising, we find Transformer models are sensitive to missing modalities while different modal fusion strategies will significantly affect the robustness. What surprised us is that the optimal fusion strategy is dataset dependent even for the same Transformer model; there does not exist a universal strategy that works in general cases. Based on these findings, we propose a principle method to improve the robustness of Transformer models by automatically searching for an optimal fusion strategy regarding input data. Experimental validations on three benchmarks support the superior performance of the proposed method.

preprint2022arXiv

Exploiting Unlabeled Data with Vision and Language Models for Object Detection

Building robust and generic object detection frameworks requires scaling to larger label spaces and bigger training datasets. However, it is prohibitively costly to acquire annotations for thousands of categories at a large scale. We propose a novel method that leverages the rich semantics available in recent vision and language models to localize and classify objects in unlabeled images, effectively generating pseudo labels for object detection. Starting with a generic and class-agnostic region proposal mechanism, we use vision and language models to categorize each region of an image into any object category that is required for downstream tasks. We demonstrate the value of the generated pseudo labels in two specific tasks, open-vocabulary detection, where a model needs to generalize to unseen object categories, and semi-supervised object detection, where additional unlabeled images can be used to improve the model. Our empirical evaluation shows the effectiveness of the pseudo labels in both tasks, where we outperform competitive baselines and achieve a novel state-of-the-art for open-vocabulary object detection. Our code is available at https://github.com/xiaofeng94/VL-PLM.

preprint2022arXiv

Global Matching with Overlapping Attention for Optical Flow Estimation

Optical flow estimation is a fundamental task in computer vision. Recent direct-regression methods using deep neural networks achieve remarkable performance improvement. However, they do not explicitly capture long-term motion correspondences and thus cannot handle large motions effectively. In this paper, inspired by the traditional matching-optimization methods where matching is introduced to handle large displacements before energy-based optimizations, we introduce a simple but effective global matching step before the direct regression and develop a learning-based matching-optimization framework, namely GMFlowNet. In GMFlowNet, global matching is efficiently calculated by applying argmax on 4D cost volumes. Additionally, to improve the matching quality, we propose patch-based overlapping attention to extract large context features. Extensive experiments demonstrate that GMFlowNet outperforms RAFT, the most popular optimization-only method, by a large margin and achieves state-of-the-art performance on standard benchmarks. Thanks to the matching and overlapping attention, GMFlowNet obtains major improvements on the predictions for textureless regions and large motions. Our code is made publicly available at https://github.com/xiaofeng94/GMFlowNet

preprint2022arXiv

Limits of Semistatic Trading Strategies

We show that pointwise limits of semistatic trading strategies in discrete time are again semistatic strategies. The analysis is carried out in full generality for a two-period model, and under a probabilistic condition for multi-period, multi-stock models. Our result contrasts with a counterexample of Acciaio, Larsson and Schachermayer, and shows that their observation is due to a failure of integrability rather than instability of the semistatic form. Mathematically, our results relate to the decomposability of functions as studied in the context of Schrödinger bridges.

preprint2022arXiv

Martingale Schrödinger Bridges and Optimal Semistatic Portfolios

In a two-period financial market where a stock is traded dynamically and European options at maturity are traded statically, we study the so-called martingale Schrödinger bridge Q*; that is, the minimal-entropy martingale measure among all models calibrated to option prices. This minimization is shown to be in duality with an exponential utility maximization over semistatic portfolios. Under a technical condition on the physical measure P, we show that an optimal portfolio exists and provides an explicit solution for Q*. This result overcomes the remarkable issue of non-closedness of semistatic strategies discovered by Acciaio, Larsson and Schachermayer. Specifically, we exhibit a dense subset of calibrated martingale measures with particular properties to show that the portfolio in question has a well-defined and integrable option position.

preprint2022arXiv

Min-Max Latency Optimization Based on Sensed Position State Information in Internet of Vehicles

The dual-function radar communication (DFRC) is an essential technology in Internet of Vehicles (IoV). Consider that the road-side unit (RSU) employs the DFRC signals to sense the vehicles' position state information (PSI), and communicates with the vehicles based on PSI. The objective of this paper is to minimize the maximum communication delay among all vehicles by considering the estimation accuracy constraint of the vehicles' PSI and the transmit power constraint of RSU. By leveraging convex optimization theory, two iterative power allocation algorithms are proposed with different complexities and applicable scenarios. Simulation results indicate that the proposed power allocation algorithm converges and can significantly reduce the maximum transmit delay among vehicles compared with other schemes.

preprint2022arXiv

On the energy of gravitational waves

The energy of gravitational waves is a fundamental problem in gravity theory. The existing descriptions for the energy of gravitational waves, such as the well-known Isaacson energy-momentum tensor, suffer from several defects. Due to the equivalence principle, the gravitational energy-momentum can only be defined quasilocally, being associated with a closed spacelike 2-surface bounding a region. We propose a new approach to derive the energy of gravitational waves $directly$ from the quasilocal gravitational energy. Such an approach is natural and consistent with the quasilocality of gravitational energy-momentum.

preprint2022arXiv

Out-of-Domain Generalization from a Single Source: An Uncertainty Quantification Approach

We are concerned with a worst-case scenario in model generalization, in the sense that a model aims to perform well on many unseen domains while there is only one single domain available for training. We propose Meta-Learning based Adversarial Domain Augmentation to solve this Out-of-Domain generalization problem. The key idea is to leverage adversarial training to create "fictitious" yet "challenging" populations, from which a model can learn to generalize with theoretical guarantees. To facilitate fast and desirable domain augmentation, we cast the model training in a meta-learning scheme and use a Wasserstein Auto-Encoder to relax the widely used worst-case constraint. We further improve our method by integrating uncertainty quantification for efficient domain generalization. Extensive experiments on multiple benchmark datasets indicate its superior performance in tackling single domain generalization.

preprint2022arXiv

The universality of islands outside the horizon

We systematically calculate the quantum extremal surface (QES) associated with Hawking radiation for general $D$-dimensional ($D\geq2$) asymptotically flat (or AdS) eternal black holes using the island formula. We collect the Hawking radiation particles by a non-gravitational bath and find that a QES exists in the near-horizon region outside the black hole when $c\cdot G_{(D)}$ is smaller enough where $c$ is the central charge of the conformal matter and $G_{(D)}$ the $D$-dimensional Newton constant. The locations of the QES in these backgrounds are obtained and the late-time radiation entropy saturates the two times of black hole entropy. Finally, we numerically check that the no island configuration exists once $c\cdot G_{(D)}$ exceeds a certain upper bound in two-dimensional generalized dilaton theories (GDT). When $c\cdot G_{(D)}$ is close to the upper bound, the backreaction of the matter field on the background can not be neglected. We also consider the conditions of existence of the island configuration with the backreaction and prove that the upper bound also exists for the Witten black hole and Weyl-related Witten black hole.

preprint2021arXiv

Box Re-Ranking: Unsupervised False Positive Suppression for Domain Adaptive Pedestrian Detection

False positive is one of the most serious problems brought by agnostic domain shift in domain adaptive pedestrian detection. However, it is impossible to label each box in countless target domains. Therefore, it yields our attention to suppress false positive in each target domain in an unsupervised way. In this paper, we model an object detection task into a ranking task among positive and negative boxes innovatively, and thus transform a false positive suppression problem into a box re-ranking problem elegantly, which makes it feasible to solve without manual annotation. An attached problem during box re-ranking appears that no labeled validation data is available for cherrypicking. Considering we aim to keep the detection of true positive unchanged, we propose box number alignment, a self-supervised evaluation metric, to prevent the optimized model from capacity degeneration. Extensive experiments conducted on cross-domain pedestrian detection datasets have demonstrated the effectiveness of our proposed framework. Furthermore, the extension to two general unsupervised domain adaptive object detection benchmarks also supports our superiority to other state-of-the-arts.

preprint2021arXiv

Nested Hierarchical Transformer: Towards Accurate, Data-Efficient and Interpretable Visual Understanding

Hierarchical structures are popular in recent vision transformers, however, they require sophisticated designs and massive datasets to work well. In this paper, we explore the idea of nesting basic local transformers on non-overlapping image blocks and aggregating them in a hierarchical way. We find that the block aggregation function plays a critical role in enabling cross-block non-local information communication. This observation leads us to design a simplified architecture that requires minor code changes upon the original vision transformer. The benefits of the proposed judiciously-selected design are threefold: (1) NesT converges faster and requires much less training data to achieve good generalization on both ImageNet and small datasets like CIFAR; (2) when extending our key ideas to image generation, NesT leads to a strong decoder that is 8$\times$ faster than previous transformer-based generators; and (3) we show that decoupling the feature learning and abstraction processes via this nested hierarchy in our design enables constructing a novel method (named GradCAT) for visually interpreting the learned model. Source code is available https://github.com/google-research/nested-transformer.

preprint2020arXiv

Beyond Lexical: A Semantic Retrieval Framework for Textual SearchEngine

Search engine has become a fundamental component in various web and mobile applications. Retrieving relevant documents from the massive datasets is challenging for a search engine system, especially when faced with verbose or tail queries. In this paper, we explore a vector space search framework for document retrieval. Specifically, we trained a deep semantic matching model so that each query and document can be encoded as a low dimensional embedding. Our model was trained based on BERT architecture. We deployed a fast k-nearest-neighbor index service for online serving. Both offline and online metrics demonstrate that our method improved retrieval performance and search quality considerably, particularly for tail

preprint2020arXiv

Knowledge as Priors: Cross-Modal Knowledge Generalization for Datasets without Superior Knowledge

Cross-modal knowledge distillation deals with transferring knowledge from a model trained with superior modalities (Teacher) to another model trained with weak modalities (Student). Existing approaches require paired training examples exist in both modalities. However, accessing the data from superior modalities may not always be feasible. For example, in the case of 3D hand pose estimation, depth maps, point clouds, or stereo images usually capture better hand structures than RGB images, but most of them are expensive to be collected. In this paper, we propose a novel scheme to train the Student in a Target dataset where the Teacher is unavailable. Our key idea is to generalize the distilled cross-modal knowledge learned from a Source dataset, which contains paired examples from both modalities, to the Target dataset by modeling knowledge as priors on parameters of the Student. We name our method "Cross-Modal Knowledge Generalization" and demonstrate that our scheme results in competitive performance for 3D hand pose estimation on standard benchmark datasets.

preprint2020arXiv

Learning to Learn Single Domain Generalization

We are concerned with a worst-case scenario in model generalization, in the sense that a model aims to perform well on many unseen domains while there is only one single domain available for training. We propose a new method named adversarial domain augmentation to solve this Out-of-Distribution (OOD) generalization problem. The key idea is to leverage adversarial training to create "fictitious" yet "challenging" populations, from which a model can learn to generalize with theoretical guarantees. To facilitate fast and desirable domain augmentation, we cast the model training in a meta-learning scheme and use a Wasserstein Auto-Encoder (WAE) to relax the widely used worst-case constraint. Detailed theoretical analysis is provided to testify our formulation, while extensive experiments on multiple benchmark datasets indicate its superior performance in tackling single domain generalization.

preprint2020arXiv

Semantic Graph Convolutional Networks for 3D Human Pose Regression

In this paper, we study the problem of learning Graph Convolutional Networks (GCNs) for regression. Current architectures of GCNs are limited to the small receptive field of convolution filters and shared transformation matrix for each node. To address these limitations, we propose Semantic Graph Convolutional Networks (SemGCN), a novel neural network architecture that operates on regression tasks with graph-structured data. SemGCN learns to capture semantic information such as local and global node relationships, which is not explicitly represented in the graph. These semantic relationships can be learned through end-to-end training from the ground truth without additional supervision or hand-crafted rules. We further investigate applying SemGCN to 3D human pose regression. Our formulation is intuitive and sufficient since both 2D and 3D human poses can be represented as a structured graph encoding the relationships between joints in the skeleton of a human body. We carry out comprehensive studies to validate our method. The results prove that SemGCN outperforms state of the art while using 90% fewer parameters.

preprint2020arXiv

Twin-Timescale Radio Resource Management for Ultra-Reliable and Low-Latency Vehicular Networks

To efficiently support safety-related vehicular applications, the ultra-reliable and low-latency communication (URLLC) concept has become an indispensable component of vehicular networks (VNETs). Due to the high mobility of VNETs, exchanging near-instantaneous channel state information (CSI) and making reliable resource allocation decisions based on such short-term CSI evaluations are not practical. In this paper, we consider the downlink of a vehicle-to-infrastructure (V2I) system conceived for URLLC based on idealized perfect and realistic imperfect CSI. By exploiting the benefits of the massive MIMO concept, a two-stage radio resource allocation problem is formulated based on a novel twin-timescale perspective for avoiding the frequent exchange of near-instantaneous CSI. Specifically, based on the prevalent road-traffic density, Stage 1 is constructed for minimizing the worst-case transmission latency on a long-term timescale. In Stage 2, the base station allocates the total power at a short-term timescale according to the large-scale fading CSI encountered for minimizing the maximum transmission latency across all vehicular users. Then, a primary algorithm and a secondary algorithm are conceived for our V2I URLLC system to find the optimal solution of the twin-timescale resource allocation problem, with special emphasis on the complexity imposed. Finally, our simulation results show that the proposed resource allocation scheme significantly reduces the maximum transmission latency, and it is not sensitive to the fluctuation of road-traffic density.

preprint2019arXiv

Optimal Power Flow in Hybrid AC and Multi-terminal HVDC Networks with Offshore Wind Farm Integration Based on Semidefinite Programming

Multi-terminal high voltage direct current (MTHVDC) technology is a promising technology for the offshore wind farm integration, which requires the new control and operation scheme. Therefore, the optimal power flow problem for this system is important to achieve the optimal economic operation. In this paper, convex relaxation model based on semidefinite programming for the MT-HVDC system considering DC/DC converters is proposed to solve the optimal power flow problem. A hybrid AC and MT-HVDC system for offshore wind farm integration is used for the test. The simulation results validate the effectiveness of the proposed model and guarantee that the global optimum solution is achieved.