Researcher profile

Hao Luo

Hao Luo contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
21works
0followers
12topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

21 published item(s)

preprint2026arXiv

Being-H0.7: A Latent World-Action Model from Egocentric Videos

Visual-Language-Action models (VLAs) have advanced generalist robot control by mapping multimodal observations and language instructions directly to actions, but sparse action supervision often encourages shortcut mappings rather than representations of dynamics, contact, and task progress. Recent world-action models introduce future prediction through video rollouts, yet pixel-space prediction is a costly and indirect substrate for control, as it may model visual details irrelevant to action generation and introduces substantial training or inference overhead. We present Being-H0.7, a latent world-action model that brings future-aware reasoning into VLA-style policies without generating future frames. Being-H0.7 inserts learnable latent queries between perception and action as a compact reasoning interface, and trains them with a future-informed dual-branch design: a deployable prior branch infers latent states from the current context, while a training-only posterior branch replaces the queries with embeddings from future observations. Jointly aligning the two branches at the latent reasoning space leads the prior branch to reason future-aware, action-useful structure from current observations alone. At inference, Being-H0.7 discards the posterior branch and performs no visual rollout. Experiments across six simulation benchmarks and diverse real-world tasks show that Being-H0.7 achieves state-of-the-art or comparable performance, combining the predictive benefits of world models with the efficiency and deployability of direct VLA policies.

preprint2026arXiv

DyDiT++: Diffusion Transformers with Timestep and Spatial Dynamics for Efficient Visual Generation

Diffusion Transformer (DiT), an emerging diffusion model for visual generation, has demonstrated superior performance but suffers from substantial computational costs. Our investigations reveal that these costs primarily stem from the static inference paradigm, which inevitably introduces redundant computation in certain diffusion timesteps and spatial regions. To overcome this inefficiency, we propose Dynamic Diffusion Transformer (DyDiT), an architecture that dynamically adjusts its computation along both timestep and spatial dimensions. Building on these designs, we present an extended version, DyDiT++, with improvements in three key aspects. First, it extends the generation mechanism of DyDiT beyond diffusion to flow matching, demonstrating that our method can also accelerate flow-matching-based generation, enhancing its versatility. Furthermore, we enhance DyDiT to tackle more complex visual generation tasks, including video generation and text-to-image generation, thereby broadening its real-world applications. Finally, to address the high cost of full fine-tuning and democratize technology access, we investigate the feasibility of training DyDiT in a parameter-efficient manner and introduce timestep-based dynamic LoRA (TD-LoRA). Extensive experiments on diverse visual generation models, including DiT, SiT, Latte, and FLUX, demonstrate the effectiveness of DyDiT++. Remarkably, with <3% additional fine-tuning iterations, our approach reduces the FLOPs of DiT-XL by 51%, yielding 1.73x realistic speedup on hardware, and achieves a competitive FID score of 2.07 on ImageNet. The code is available at https://github.com/alibaba-damo-academy/DyDiT.

preprint2026arXiv

MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data

Vision-Language Models (VLMs) can generate convincing clinical narratives, yet frequently struggle to visually ground their statements. We posit this limitation arises from the scarcity of high-quality, large-scale clinical referring-localization pairs. To address this, we introduce MedGround, an automated pipeline that transforms segmentation resources into high-quality medical referring grounding data. Leveraging expert masks as spatial anchors, MedGround precisely derives localization targets, extracts shape and spatial cues, and guides VLMs to synthesize natural, clinically grounded queries that reflect morphology and location. To ensure data rigor, a multi-stage verification system integrates strict formatting checks, geometry- and medical-prior rules, and image-based visual judging to filter out ambiguous or visually unsupported samples. Finally, we present MedGround-35K, a novel multimodal medical dataset. Extensive experiments demonstrate that VLMs trained with MedGround-35K consistently achieve improved referring grounding performance, enhance multi-object semantic disambiguation, and exhibit strong generalization to unseen grounding settings. This work highlights MedGround as a scalable, data-driven approach to anchor medical language to verifiable visual evidence. Dataset and code will be released publicly upon acceptance.

preprint2026arXiv

Mistake Notebook Learning: Batch-Clustered Failures for Training-Free Agent Adaptation

With the growing adoption of Large Language Model (LLM) agents in persistent, real-world roles, they naturally encounter continuous streams of tasks and inevitable failures. A key limitation, however, is their inability to systematically learn from these mistakes, forcing them to repeat identical errors in similar contexts. Unlike prior training-free methods that primarily store raw instance-level experience or focus on retrieving successful trajectories, we propose Mistake Notebook Learning (MNL), a novel memory framework that enables agents to self-curate generalizable guidance from batch-clustered failures. This mechanism allows agents to distill shared error patterns into structured &#34;mistake notes,&#34; updating an external memory only when batch performance improves to ensure stability. To further amplify adaptability, we integrate MNL with test-time scaling, leveraging aggregated failure patterns to actively steer the search process away from known pitfalls. Experiments on mathematical reasoning, Text-to-SQL, and interactive agent benchmarks show that MNL achieves competitive performance compared to existing memory mechanisms and in-context methods in both effectiveness and efficiency. These findings position structured mistake abstraction as a critical lever for robust agent evolution, enabling continuous improvement without the cost of parameter updates. The code is available at https://github.com/Bairong-Xdynamics/MistakeNotebookLearning/tree/main.

preprint2024arXiv

Attribute Fusion-based Evidential Classifier on Quantum Circuits

Dempster-Shafer Theory (DST) as an effective and robust framework for handling uncertain information is applied in decision-making and pattern classification. Unfortunately, its real-time application is limited by the exponential computational complexity. People attempt to address the issue by taking advantage of its mathematical consistency with quantum computing to implement DST operations on quantum circuits and realize speedup. However, the progress so far is still impractical for supporting large-scale DST applications. In this paper, we find that Boolean algebra as an essential mathematical tool bridges the definition of DST and quantum computing. Based on the discovery, we establish a flexible framework mapping any set-theoretically defined DST operations to corresponding quantum circuits for implementation. More critically, this new framework is not only uniform but also enables exponential acceleration for computation and is capable of handling complex applications. Focusing on tasks of classification, we based on a classical attribute fusion algorithm putting forward a quantum evidential classifier, where quantum mass functions for attributes are generated with a simple method and the proposed framework is applied for fusing the attribute evidence. Compared to previous methods, the proposed quantum classifier exponentially reduces the computational complexity to linear. Tests on real datasets validate the feasibility.

preprint2022arXiv

A primal-dual flow for affine constrained convex optimization

We introduce a novel primal-dual flow for affine constrained convex optimization problems. As a modification of the standard saddle-point system, our primal-dual flow is proved to possess the exponential decay property, in terms of a tailored Lyapunov function. Then two primal-dual methods are obtained from numerical discretizations of the continuous model, and global nonergodic linear convergence rate is established via a discrete Lyapunov function. Instead of solving the subproblem of the primal variable, we apply the semi-smooth Newton iteration to the subproblem with respect to the multiplier, provided that there are some additional properties such as semi-smoothness and sparsity. Especially, numerical tests on the linearly constrained $l_1$-$l_2$ minimization and the total-variation based image denoising model have been provided.

preprint2022arXiv

Accelerated differential inclusion for convex optimization

This paper introduces a second-order differential inclusion for unconstrained convex optimization. In continuous level, solution existence in proper sense is obtained and exponential decay of a novel Lyapunov function along with the solution trajectory is derived as well. Then in discrete level, based on numerical discretizations of the continuous model, two inexact proximal point algorithms are proposed, and some new convergence rates are established via a discrete Lyapunov function.

preprint2022arXiv

Accelerated primal-dual methods for linearly constrained convex optimization problems

This work proposes an accelerated primal-dual dynamical system for affine constrained convex optimization and presents a class of primal-dual methods with nonergodic convergence rates. In continuous level, exponential decay of a novel Lyapunov function is established and in discrete level, implicit, semi-implicit and explicit numerical discretizations for the continuous model are considered sequentially and lead to new accelerated primal-dual methods for solving linearly constrained optimization problems. Special structures of the subproblems in those schemes are utilized to develop efficient inner solvers. In addition, nonergodic convergence rates in terms of primal-dual gap, primal objective residual and feasibility violation are proved via a tailored discrete Lyapunov function. Moreover, our method has also been applied to decentralized distributed optimization for fast and efficient solution.

preprint2022arXiv

An efficient semismooth Newton-AMG-based inexact primal-dual algorithm for generalized transport problems

This work is concerned with the efficient optimization method for solving a large class of optimal mass transport problems. An inexact primal-dual algorithm is presented from the time discretization of a proper dynamical system, and by using the tool of Lyapunov function, the global (super-)linear convergence rate is established for function residual and feasibility violation. The proposed algorithm contains an inner problem that possesses strong semismoothness property and motivates the use of the semismooth Newton iteration. By exploring the hidden structure of the problem itself, the linear system arising from the Newton iteration is transferred equivalently into a graph Laplacian system, for which a robust algebraic multigrid method is proposed and also analyzed via the famous Xu--Zikatanov identity. Finally, numerical experiments are provided to validate the efficiency of our method.

preprint2022arXiv

Dynamic Gradient Reactivation for Backward Compatible Person Re-identification

We study the backward compatible problem for person re-identification (Re-ID), which aims to constrain the features of an updated new model to be comparable with the existing features from the old model in galleries. Most of the existing works adopt distillation-based methods, which focus on pushing new features to imitate the distribution of the old ones. However, the distillation-based methods are intrinsically sub-optimal since it forces the new feature space to imitate the inferior old feature space. To address this issue, we propose the Ranking-based Backward Compatible Learning (RBCL), which directly optimizes the ranking metric between new features and old features. Different from previous methods, RBCL only pushes the new features to find best-ranking positions in the old feature space instead of strictly alignment, and is in line with the ultimate goal of backward retrieval. However, the sharp sigmoid function used to make the ranking metric differentiable also incurs the gradient vanish issue, therefore stems the ranking refinement during the later period of training. To address this issue, we propose the Dynamic Gradient Reactivation (DGR), which can reactivate the suppressed gradients by adding dynamic computed constant during forward step. To further help targeting the best-ranking positions, we include the Neighbor Context Agents (NCAs) to approximate the entire old feature space during training. Unlike previous works which only test on the in-domain settings, we make the first attempt to introduce the cross-domain settings (including both supervised and unsupervised), which are more meaningful and difficult. The experimental results on all five settings show that the proposed RBCL outperforms previous state-of-the-art methods by large margins under all settings.

preprint2022arXiv

Error estimation of a discontinuous Galerkin method for time fractional subdiffusion problems with nonsmooth data

This paper is devoted to the numerical analysis of a piecewise constant discontinuous Galerkin method for time fractional subdiffusion problems. The regularity of weak solution is firstly established by using variational approach and Mittag-Leffler function. Then several optimal error estimates are derived with low regularity data. Finally, numerical experiments are conducted to verify the theoretical results.

preprint2022arXiv

From differential equation solvers to accelerated first-order methods for convex optimization

Convergence analysis of accelerated first-order methods for convex optimization problems are presented from the point of view of ordinary differential equation solvers. A new dynamical system, called Nesterov accelerated gradient flow, has been derived from the connection between acceleration mechanism and $A$-stability of ODE solvers, and the exponential decay of a tailored Lyapunov function along with the solution trajectory is proved. Numerical discretizations are then considered and convergence rates are established via a unified discrete Lyapunov function. The proposed differential equation solver approach can not only cover existing accelerated methods, such as FISTA, Güler&#39;s proximal algorithm and Nesterov&#39;s accelerated gradient method, but also produce new algorithms for composite convex optimization that possess accelerated convergence rates.

preprint2022arXiv

Nitrogen decoration of basal plane dislocations in 4H-SiC

Basal-plane dislocations (BPDs) pose a great challenge to the reliability of bipolar power devices based on the 4H silicon carbide (4H-SiC). It is well established that heavy nitrogen (N) doping promotes the conversion of BPDs to threading edge dislocations (TEDs) and improves the reliability of 4H-SiC-based bipolar power devices. However, the interaction between N and BPDs, and the effect of N on the electronic properties of BPDs are still ambiguous, which significantly hinder the understanding on the electron-transport mechanism of 4H-SiC-based bipolar power devices. Combining molten-alkali etching and the Kelvin probe force microscopy (KPFM) analysis, we demonstrate that BPDs create acceptor-like states in undoped 4H-SiC, while acting as donors in N-doped 4H-SiC. First-principles calculations verify that BPDs create occupied defect states above the valence band maximum (VBM) and unoccupied defect states under the conduction-band minimum (CBM) of undoped 4H-SiC. The electron transfer from the defect states of intrinsic defects and native impurities to the unoccupied defect states of BPDs gives rise to the acceptor-like behavior of BPDs in undoped 4H-SiC. Defect formation energies indicate that N atoms can spontaneously decorate BPDs during the N doping of 4H-SiC. The binding between N and BPD is strong against decomposition. The accumulation of N dopants at the core of BPDs results in the accumulation of donor-like states at the core of BPDs in N-doped 4H-SiC. This work not only enriches the understanding on the electronic behavior of BPDs in N-doped 4H-SiC, but also helps understand the electron transport mechanism of 4H-SiC-based bipolar power devices.

preprint2022arXiv

Scaled ReLU Matters for Training Vision Transformers

Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in ~\cite{xiao2021early}, and the authors conjecture that the issue lies with the \textit{patchify-stem} of ViT models and propose that early convolutions help transformers see better. In this paper, we further investigate this problem and extend the above conclusion: only early convolutions do not help for stable training, but the scaled ReLU operation in the \textit{convolutional stem} (\textit{conv-stem}) matters. We verify, both theoretically and empirically, that scaled ReLU in \textit{conv-stem} not only improves training stabilization, but also increases the diversity of patch tokens, thus boosting peak performance with a large margin via adding few parameters and flops. In addition, extensive experiments are conducted to demonstrate that previous ViTs are far from being well trained, further showing that ViTs have great potential to be a better substitute of CNNs.

preprint2020arXiv

A Strong Baseline and Batch Normalization Neck for Deep Person Re-identification

This study explores a simple but strong baseline for person re-identification (ReID). Person ReID with deep neural networks has progressed and achieved high performance in recent years. However, many state-of-the-art methods design complex network structures and concatenate multi-branch features. In the literature, some effective training tricks briefly appear in several papers or source codes. The present study collects and evaluates these effective training tricks in person ReID. By combining these tricks, the model achieves 94.5% rank-1 and 85.9% mean average precision on Market1501 with only using the global features of ResNet50. The performance surpasses all existing global- and part-based baselines in person ReID. We propose a novel neck structure named as batch normalization neck (BNNeck). BNNeck adds a batch normalization layer after global pooling layer to separate metric and classification losses into two different feature spaces because we observe they are inconsistent in one embedding space. Extended experiments show that BNNeck can boost the baseline, and our baseline can improve the performance of existing state-of-the-art methods. Our codes and models are available at: https://github.com/michuanhaohao/reid-strong-baseline.

preprint2020arXiv

Cross-Spectrum Dual-Subspace Pairing for RGB-infrared Cross-Modality Person Re-Identification

Due to its potential wide applications in video surveillance and other computer vision tasks like tracking, person re-identification (ReID) has become popular and been widely investigated. However, conventional person re-identification can only handle RGB color images, which will fail at dark conditions. Thus RGB-infrared ReID (also known as Infrared-Visible ReID or Visible-Thermal ReID) is proposed. Apart from appearance discrepancy in traditional ReID caused by illumination, pose variations and viewpoint changes, modality discrepancy produced by cameras of the different spectrum also exists, which makes RGB-infrared ReID more difficult. To address this problem, we focus on extracting the shared cross-spectrum features of different modalities. In this paper, a novel multi-spectrum image generation method is proposed and the generated samples are utilized to help the network to find discriminative information for re-identifying the same person across modalities. Another challenge of RGB-infrared ReID is that the intra-person (images from the same person) discrepancy is often larger than the inter-person (images from different persons) discrepancy, so a dual-subspace pairing strategy is proposed to alleviate this problem. Combining those two parts together, we also design a one-stream neural network combining the aforementioned methods to extract compact representations of person images, called Cross-spectrum Dual-subspace Pairing (CDP) model. Furthermore, during the training process, we also propose a Dynamic Hard Spectrum Mining method to automatically mine more hard samples from hard spectrum based on the current model state to further boost the performance. Extensive experimental results on two public datasets, SYSU-MM01 with RGB + near-infrared images and RegDB with RGB + far-infrared images, have demonstrated the efficiency and generality of our proposed method.

preprint2020arXiv

Multi-Domain Learning and Identity Mining for Vehicle Re-Identification

This paper introduces our solution for the Track2 in AI City Challenge 2020 (AICITY20). The Track2 is a vehicle re-identification (ReID) task with both the real-world data and synthetic data. Our solution is based on a strong baseline with bag of tricks (BoT-BS) proposed in person ReID. At first, we propose a multi-domain learning method to joint the real-world and synthetic data to train the model. Then, we propose the Identity Mining method to automatically generate pseudo labels for a part of the testing data, which is better than the k-means clustering. The tracklet-level re-ranking strategy with weighted features is also used to post-process the results. Finally, with multiple-model ensemble, our method achieves 0.7322 in the mAP score which yields third place in the competition. The codes are available at https://github.com/heshuting555/AICITY2020_DMT_VehicleReID.

preprint2020arXiv

STNReID : Deep Convolutional Networks with Pairwise Spatial Transformer Networks for Partial Person Re-identification

Partial person re-identification (ReID) is a challenging task because only partial information of person images is available for matching target persons. Few studies, especially on deep learning, have focused on matching partial person images with holistic person images. This study presents a novel deep partial ReID framework based on pairwise spatial transformer networks (STNReID), which can be trained on existing holistic person datasets. STNReID includes a spatial transformer network (STN) module and a ReID module. The STN module samples an affined image (a semantically corresponding patch) from the holistic image to match the partial image. The ReID module extracts the features of the holistic, partial, and affined images. Competition (or confrontation) is observed between the STN module and the ReID module, and two-stage training is applied to acquire a strong STNReID for partial ReID. Experimental results show that our STNReID obtains 66.7% and 54.6% rank-1 accuracies on partial ReID and partial iLIDS datasets, respectively. These values are at par with those obtained with state-of-the-art methods.

preprint2020arXiv

Structure-Aware Network for Lane Marker Extraction with Dynamic Vision Sensor

Lane marker extraction is a basic yet necessary task for autonomous driving. Although past years have witnessed major advances in lane marker extraction with deep learning models, they all aim at ordinary RGB images generated by frame-based cameras, which limits their performance in extreme cases, like huge illumination change. To tackle this problem, we introduce Dynamic Vision Sensor (DVS), a type of event-based sensor to lane marker extraction task and build a high-resolution DVS dataset for lane marker extraction. We collect the raw event data and generate 5,424 DVS images with a resolution of 1280$\times$800 pixels, the highest one among all DVS datasets available now. All images are annotated with multi-class semantic segmentation format. We then propose a structure-aware network for lane marker extraction in DVS images. It can capture directional information comprehensively with multidirectional slice convolution. We evaluate our proposed network with other state-of-the-art lane marker extraction models on this dataset. Experimental results demonstrate that our method outperforms other competitors. The dataset is made publicly available, including the raw event data, accumulated images and labels.

preprint2020arXiv

Unsupervised Representation Disentanglement using Cross Domain Features and Adversarial Learning in Variational Autoencoder based Voice Conversion

An effective approach for voice conversion (VC) is to disentangle linguistic content from other components in the speech signal. The effectiveness of variational autoencoder (VAE) based VC (VAE-VC), for instance, strongly relies on this principle. In our prior work, we proposed a cross-domain VAE-VC (CDVAE-VC) framework, which utilized acoustic features of different properties, to improve the performance of VAE-VC. We believed that the success came from more disentangled latent representations. In this paper, we extend the CDVAE-VC framework by incorporating the concept of adversarial learning, in order to further increase the degree of disentanglement, thereby improving the quality and similarity of converted speech. More specifically, we first investigate the effectiveness of incorporating the generative adversarial networks (GANs) with CDVAE-VC. Then, we consider the concept of domain adversarial training and add an explicit constraint to the latent representation, realized by a speaker classifier, to explicitly eliminate the speaker information that resides in the latent code. Experimental results confirm that the degree of disentanglement of the learned latent representation can be enhanced by both GANs and the speaker classifier. Meanwhile, subjective evaluation results in terms of quality and similarity scores demonstrate the effectiveness of our proposed methods.

preprint2019arXiv

Stripe-based and Attribute-aware Network: A Two-Branch Deep Model for Vehicle Re-identification

Vehicle re-identification (Re-ID) has been attracting increasing interest in the field of computer vision due to the growing utilization of surveillance cameras in public security. However, vehicle Re-ID still suffers a similarity challenge despite the efforts made to solve this problem. This challenge involves distinguishing different instances with nearly identical appearances. In this paper, we propose a novel two-branch stripe-based and attribute-aware deep convolutional neural network (SAN) to learn the efficient feature embedding for vehicle Re-ID task. The two-branch neural network, consisting of stripe-based branch and attribute-aware branches, can adaptively extract the discriminative features from the visual appearance of vehicles. A horizontal average pooling and dimension-reduced convolutional layers are inserted into the stripe-based branch to achieve part-level features. Meanwhile, the attribute-aware branch extracts the global feature under the supervision of vehicle attribute labels to separate the similar vehicle identities with different attribute annotations. Finally, the part-level and global features are concatenated together to form the final descriptor of the input image for vehicle Re-ID. The final descriptor not only can separate vehicles with different attributes but also distinguish vehicle identities with the same attributes. The extensive experiments on both VehicleID and VeRi databases show that the proposed SAN method outperforms other state-of-the-art vehicle Re-ID approaches.