Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
39works
0followers
20topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

39 published item(s)

preprint2026arXiv

Coordinated Pandemic Control with Large Language Model Agents as Policymaking Assistants

Effective pandemic control requires timely and coordinated policymaking across administrative regions that are intrinsically interdependent. However, human-driven responses are often fragmented and reactive, with policies formulated in isolation and adjusted only after outbreaks escalate, undermining proactive intervention and global pandemic mitigation. To address this challenge, here we propose a large language model (LLM) multi-agent policymaking framework that supports coordinated and proactive pandemic control across regions. Within our framework, each administrative region is assigned an LLM agent as an AI policymaking assistant. The agent reasons over region-specific epidemiological dynamics while communicating with other agents to account for cross-regional interdependencies. By integrating real-world data, a pandemic evolution simulator, and structured inter-agent communication, our framework enables agents to jointly explore counterfactual intervention scenarios and synthesize coordinated policy decisions through a closed-loop simulation process. We validate the proposed framework using state-level COVID-19 data from the United States between April and December 2020, together with real-world mobility records and observed policy interventions. Compared with real-world pandemic outcomes, our approach reduces cumulative infections and deaths by up to 63.7% and 40.1%, respectively, at the individual state level, and by 39.0% and 27.0%, respectively, when aggregated across states. These results demonstrate that LLM multi-agent systems can enable more effective pandemic control with coordinated policymaking...

preprint2026arXiv

RoboTransfer: Controllable Geometry-Consistent Video Diffusion for Manipulation Policy Transfer

The goal of general-purpose robotics is to create agents that can seamlessly adapt to and operate in diverse, unstructured human environments. Imitation learning has become a key paradigm for robotic manipulation, yet collecting large-scale and diverse demonstrations is prohibitively expensive. Simulators provide a cost-effective alternative, but the sim-to-real gap remains a major obstacle to scalability. We present RoboTransfer, a diffusion-based video generation framework for synthesizing robotic data. By leveraging cross-view feature interactions and globally consistent 3D geometry, RoboTransfer ensures multi-view geometric consistency while enabling fine-grained control over scene elements, such as background editing and object replacement. Extensive experiments demonstrate that RoboTransfer produces videos with superior geometric consistency and visual fidelity. Furthermore, policies trained on this synthetic data exhibit enhanced generalization to novel, unseen scenarios. Project page: https://horizonrobotics.github.io/robot_lab/robotransfer.

preprint2026arXiv

Spatial Multi-Task Learning for Breast Cancer Molecular Subtype Prediction from Single-Phase DCE-MRI

Accurate molecular subtype classification is essential for personalized breast cancer treatment, yet conventional immunohistochemical analysis relies on invasive biopsies and is prone to sampling bias. Although dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) enables non-invasive tumor characterization, clinical workflows typically acquire only single-phase post-contrast images to reduce scan time and contrast agent dose. In this study, we propose a spatial multi-task learning framework for breast cancer molecular subtype prediction from clinically practical single-phase DCE-MRI. The framework simultaneously predicts estrogen receptor (ER), progesterone receptor (PR), human epidermal growth factor receptor 2 (HER2) status, and the Ki-67 proliferation index -- biomarkers that collectively define molecular subtypes. The architecture integrates a deep feature extraction network with multi-scale spatial attention to capture intratumoral and peritumoral characteristics, together with a region-of-interest weighting module that emphasizes the tumor core, rim, and surrounding tissue. Multi-task learning exploits biological correlations among biomarkers through shared representations with task-specific prediction branches. Experiments on a dataset of 960 cases (886 internal cases split 7:1:2 for training/validation/testing, and 74 external cases evaluated via five-fold cross-validation) demonstrate that the proposed method achieves an AUC of 0.893, 0.824, and 0.857 for ER, PR, and HER2 classification, respectively, and a mean absolute error of 8.2\% for Ki-67 regression, significantly outperforming radiomics and single-task deep learning baselines. These results indicate the feasibility of accurate, non-invasive molecular subtype prediction using standard imaging protocols.

preprint2026arXiv

TokenSeg: Efficient 3D Medical Image Segmentation via Hierarchical Visual Token Compression

Three-dimensional medical image segmentation is a fundamental yet computationally demanding task due to the cubic growth of voxel processing and the redundant computation on homogeneous regions. To address these limitations, we propose \textbf{TokenSeg}, a boundary-aware sparse token representation framework for efficient 3D medical volume segmentation. Specifically, (1) we design a \emph{multi-scale hierarchical encoder} that extracts 400 candidate tokens across four resolution levels to capture both global anatomical context and fine boundary details; (2) we introduce a \emph{boundary-aware tokenizer} that combines VQ-VAE quantization with importance scoring to select 100 salient tokens, over 60\% of which lie near tumor boundaries; and (3) we develop a \emph{sparse-to-dense decoder} that reconstructs full-resolution masks through token reprojection, progressive upsampling, and skip connections. Extensive experiments on a 3D breast DCE-MRI dataset comprising 960 cases demonstrate that TokenSeg achieves state-of-the-art performance with 94.49\% Dice and 89.61\% IoU, while reducing GPU memory and inference latency by 64\% and 68\%, respectively. To verify the generalization capability, our evaluations on MSD cardiac and brain MRI benchmark datasets demonstrate that TokenSeg consistently delivers optimal performance across heterogeneous anatomical structures. These results highlight the effectiveness of anatomically informed sparse representation for accurate and efficient 3D medical image segmentation.

preprint2023arXiv

Detachable Novel Views Synthesis of Dynamic Scenes Using Distribution-Driven Neural Radiance Fields

Representing and synthesizing novel views in real-world dynamic scenes from casual monocular videos is a long-standing problem. Existing solutions typically approach dynamic scenes by applying geometry techniques or utilizing temporal information between several adjacent frames without considering the underlying background distribution in the entire scene or the transmittance over the ray dimension, limiting their performance on static and occlusion areas. Our approach $\textbf{D}$istribution-$\textbf{D}$riven neural radiance fields offers high-quality view synthesis and a 3D solution to $\textbf{D}$etach the background from the entire $\textbf{D}$ynamic scene, which is called $\text{D}^4$NeRF. Specifically, it employs a neural representation to capture the scene distribution in the static background and a 6D-input NeRF to represent dynamic objects, respectively. Each ray sample is given an additional occlusion weight to indicate the transmittance lying in the static and dynamic components. We evaluate $\text{D}^4$NeRF on public dynamic scenes and our urban driving scenes acquired from an autonomous-driving dataset. Extensive experiments demonstrate that our approach outperforms previous methods in rendering texture details and motion areas while also producing a clean static background. Our code will be released at https://github.com/Luciferbobo/D4NeRF.

preprint2022arXiv

A Simple Baseline for Multi-Camera 3D Object Detection

3D object detection with surrounding cameras has been a promising direction for autonomous driving. In this paper, we present SimMOD, a Simple baseline for Multi-camera Object Detection, to solve the problem. To incorporate multi-view information as well as build upon previous efforts on monocular 3D object detection, the framework is built on sample-wise object proposals and designed to work in a two-stage manner. First, we extract multi-scale features and generate the perspective object proposals on each monocular image. Second, the multi-view proposals are aggregated and then iteratively refined with multi-view and multi-scale visual features in the DETR3D-style. The refined proposals are end-to-end decoded into the detection results. To further boost the performance, we incorporate the auxiliary branches alongside the proposal generation to enhance the feature learning. Also, we design the methods of target filtering and teacher forcing to promote the consistency of two-stage training. We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD and achieve new state-of-the-art performance. Code will be available at https://github.com/zhangyp15/SimMOD.

preprint2022arXiv

An Efficient Training Approach for Very Large Scale Face Recognition

Face recognition has achieved significant progress in deep learning era due to the ultra-large-scale and welllabeled datasets. However, training on the outsize datasets is time-consuming and takes up a lot of hardware resource. Therefore, designing an efficient training approach is indispensable. The heavy computational and memory costs mainly result from the million-level dimensionality of thefully connected (FC) layer. To this end, we propose a novel training approach, termed Faster Face Classification (F2C), to alleviate time and cost without sacrificing the performance. This method adopts Dynamic Class Pool (DCP) for storing and updating the identities features dynamically, which could be regarded as a substitute for the FC layer. DCP is efficiently time-saving and cost-saving, as its smaller size with the independence from the whole face identities together. We further validate the proposed F2C method across several face benchmarks and private datasets, and display comparable results, meanwhile the speed is faster than state-of-the-art FC-based methods in terms of recognition accuracy and hardware costs. Moreover, our method is further improved by a well-designed dual data loader including indentity-based and instancebased loaders, which makes it more efficient for the updating DCP parameters.

preprint2022arXiv

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Autonomous driving perceives its surroundings for decision making, which is one of the most complex scenarios in visual perception. The success of paradigm innovation in solving the 2D object detection task inspires us to seek an elegant, feasible, and scalable paradigm for fundamentally pushing the performance boundary in this area. To this end, we contribute the BEVDet paradigm in this paper. BEVDet performs 3D object detection in Bird-Eye-View (BEV), where most target values are defined and route planning can be handily performed. We merely reuse existing modules to build its framework but substantially develop its performance by constructing an exclusive data augmentation strategy and upgrading the Non-Maximum Suppression strategy. In the experiment, BEVDet offers an excellent trade-off between accuracy and time-efficiency. As a fast version, BEVDet-Tiny scores 31.2% mAP and 39.2% NDS on the nuScenes val set. It is comparable with FCOS3D, but requires just 11% computational budget of 215.3 GFLOPs and runs 9.2 times faster at 15.6 FPS. Another high-precision version dubbed BEVDet-Base scores 39.3% mAP and 47.2% NDS, significantly exceeding all published results. With a comparable inference speed, it surpasses FCOS3D by a large margin of +9.8% mAP and +10.0% NDS. The source code is publicly available for further research at https://github.com/HuangJunJie2017/BEVDet .

preprint2022arXiv

BEVerse: Unified Perception and Prediction in Birds-Eye-View for Vision-Centric Autonomous Driving

In this paper, we present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems. Unlike existing studies focusing on the improvement of single-task approaches, BEVerse features in producing spatio-temporal Birds-Eye-View (BEV) representations from multi-camera videos and jointly reasoning about multiple tasks for vision-centric autonomous driving. Specifically, BEVerse first performs shared feature extraction and lifting to generate 4D BEV representations from multi-timestamp and multi-view images. After the ego-motion alignment, the spatio-temporal encoder is utilized for further feature extraction in BEV. Finally, multiple task decoders are attached for joint reasoning and prediction. Within the decoders, we propose the grid sampler to generate BEV features with different ranges and granularities for different tasks. Also, we design the method of iterative flow for memory-efficient future prediction. We show that the temporal information improves 3D object detection and semantic map construction, while the multi-task learning can implicitly benefit motion prediction. With extensive experiments on the nuScenes dataset, we show that the multi-task BEVerse outperforms existing single-task methods on 3D object detection, semantic map construction, and motion prediction. Compared with the sequential paradigm, BEVerse also favors in significantly improved efficiency. The code and trained models will be released at https://github.com/zhangyp15/BEVerse.

preprint2022arXiv

CAFE: Learning to Condense Dataset by Aligning Features

Dataset condensation aims at reducing the network training effort through condensing a cumbersome training set into a compact synthetic one. State-of-the-art approaches largely rely on learning the synthetic data by matching the gradients between the real and synthetic data batches. Despite the intuitive motivation and promising results, such gradient-based methods, by nature, easily overfit to a biased set of samples that produce dominant gradients, and thus lack global supervision of data distribution. In this paper, we propose a novel scheme to Condense dataset by Aligning FEatures (CAFE), which explicitly attempts to preserve the real-feature distribution as well as the discriminant power of the resulting synthetic set, lending itself to strong generalization capability to various architectures. At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales, while accounting for the classification of real samples. Our scheme is further backed up by a novel dynamic bi-level optimization, which adaptively adjusts parameter updates to prevent over-/under-fitting. We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art: on the SVHN dataset, for example, the performance gain is up to 11%. Extensive experiments and analyses verify the effectiveness and necessity of proposed designs.

preprint2022arXiv

Crafting Better Contrastive Views for Siamese Representation Learning

Recent self-supervised contrastive learning methods greatly benefit from the Siamese structure that aims at minimizing distances between positive pairs. For high performance Siamese representation learning, one of the keys is to design good contrastive pairs. Most previous works simply apply random sampling to make different crops of the same image, which overlooks the semantic information that may degrade the quality of views. In this work, we propose ContrastiveCrop, which could effectively generate better crops for Siamese representation learning. Firstly, a semantic-aware object localization strategy is proposed within the training process in a fully unsupervised manner. This guides us to generate contrastive views which could avoid most false positives (i.e., object vs. background). Moreover, we empirically find that views with similar appearances are trivial for the Siamese model training. Thus, a center-suppressed sampling is further designed to enlarge the variance of crops. Remarkably, our method takes a careful consideration of positive pairs for contrastive learning with negligible extra training overhead. As a plug-and-play and framework-agnostic module, ContrastiveCrop consistently improves SimCLR, MoCo, BYOL, SimSiam by 0.4% ~ 2.0% classification accuracy on CIFAR-10, CIFAR-100, Tiny ImageNet and STL-10. Superior results are also achieved on downstream detection and segmentation tasks when pre-trained on ImageNet-1K.

preprint2022arXiv

Crafting Monocular Cues and Velocity Guidance for Self-Supervised Multi-Frame Depth Learning

Self-supervised monocular methods can efficiently learn depth information of weakly textured surfaces or reflective objects. However, the depth accuracy is limited due to the inherent ambiguity in monocular geometric modeling. In contrast, multi-frame depth estimation methods improve the depth accuracy thanks to the success of Multi-View Stereo (MVS), which directly makes use of geometric constraints. Unfortunately, MVS often suffers from texture-less regions, non-Lambertian surfaces, and moving objects, especially in real-world video sequences without known camera motion and depth supervision. Therefore, we propose MOVEDepth, which exploits the MOnocular cues and VElocity guidance to improve multi-frame Depth learning. Unlike existing methods that enforce consistency between MVS depth and monocular depth, MOVEDepth boosts multi-frame depth learning by directly addressing the inherent problems of MVS. The key of our approach is to utilize monocular depth as a geometric priority to construct MVS cost volume, and adjust depth candidates of cost volume under the guidance of predicted camera velocity. We further fuse monocular depth and MVS depth by learning uncertainty in the cost volume, which results in a robust depth estimation against ambiguity in multi-view geometry. Extensive experiments show MOVEDepth achieves state-of-the-art performance: Compared with Monodepth2 and PackNet, our method relatively improves the depth accuracy by 20\% and 19.8\% on the KITTI benchmark. MOVEDepth also generalizes to the more challenging DDAD benchmark, relatively outperforming ManyDepth by 7.2\%. The code is available at https://github.com/JeffWang987/MOVEDepth.

preprint2022arXiv

Decoupled Multi-task Learning with Cyclical Self-Regulation for Face Parsing

This paper probes intrinsic factors behind typical failure cases (e.g. spatial inconsistency and boundary confusion) produced by the existing state-of-the-art method in face parsing. To tackle these problems, we propose a novel Decoupled Multi-task Learning with Cyclical Self-Regulation (DML-CSR) for face parsing. Specifically, DML-CSR designs a multi-task model which comprises face parsing, binary edge, and category edge detection. These tasks only share low-level encoder weights without high-level interactions between each other, enabling to decouple auxiliary modules from the whole network at the inference stage. To address spatial inconsistency, we develop a dynamic dual graph convolutional network to capture global contextual information without using any extra pooling operation. To handle boundary confusion in both single and multiple face scenarios, we exploit binary and category edge detection to jointly obtain generic geometric structure and fine-grained semantic clues of human faces. Besides, to prevent noisy labels from degrading model generalization during training, cyclical self-regulation is proposed to self-ensemble several model instances to get a new model and the resulting model then is used to self-distill subsequent models, through alternating iterations. Experiments show that our method achieves the new state-of-the-art performance on the Helen, CelebAMask-HQ, and Lapa datasets. The source code is available at https://github.com/deepinsight/insightface/tree/master/parsing/dml_csr.

preprint2022arXiv

DenseCLIP: Language-Guided Dense Prediction with Context-Aware Prompting

Recent progress has shown that large-scale pre-training using contrastive image-text pairs can be a promising alternative for high-quality visual representation learning from natural language supervision. Benefiting from a broader source of supervision, this new paradigm exhibits impressive transferability to downstream classification tasks and datasets. However, the problem of transferring the knowledge learned from image-text pairs to more complex dense prediction tasks has barely been visited. In this work, we present a new framework for dense prediction by implicitly and explicitly leveraging the pre-trained knowledge from CLIP. Specifically, we convert the original image-text matching problem in CLIP to a pixel-text matching problem and use the pixel-text score maps to guide the learning of dense prediction models. By further using the contextual information from the image to prompt the language model, we are able to facilitate our model to better exploit the pre-trained knowledge. Our method is model-agnostic, which can be applied to arbitrary dense prediction systems and various pre-trained visual backbones including both CLIP models and ImageNet pre-trained models. Extensive experiments demonstrate the superior performance of our methods on semantic segmentation, object detection, and instance segmentation tasks. Code is available at https://github.com/raoyongming/DenseCLIP

preprint2022arXiv

Divide to Adapt: Mitigating Confirmation Bias for Domain Adaptation of Black-Box Predictors

Domain Adaptation of Black-box Predictors (DABP) aims to learn a model on an unlabeled target domain supervised by a black-box predictor trained on a source domain. It does not require access to both the source-domain data and the predictor parameters, thus addressing the data privacy and portability issues of standard domain adaptation. Existing DABP approaches mostly rely on model distillation from the black-box predictor, \emph{i.e.}, training the model with its noisy target-domain predictions, which however inevitably introduces the confirmation bias accumulated from the prediction noises. To mitigate such bias, we propose a new method, named BETA, to incorporate knowledge distillation and noisy label learning into one coherent framework. This is enabled by a new divide-to-adapt strategy. BETA divides the target domain into an easy-to-adapt subdomain with less noise and a hard-to-adapt subdomain. Then it deploys mutually-teaching twin networks to filter the predictor errors for each other and improve them progressively, from the easy to hard subdomains. As such, BETA effectively purifies the noisy labels and reduces error accumulation. We theoretically show that the target error of BETA is minimized by decreasing the noise ratio of the subdomains. Extensive experiments demonstrate BETA outperforms existing methods on all DABP benchmarks, and is even comparable with the standard domain adaptation methods that use the source-domain data.

preprint2022arXiv

Doped Mott Insulators in the Triangular Lattice Hubbard Model

We investigate the evolution of the Mott insulators in the triangular lattice Hubbard Model, as a function of hole doping $δ$ in both the strong and intermediate coupling limits. Using the advanced density matrix renormalization group (DMRG) method, at light hole doping $δ\lesssim 10\%$, we find a significant difference between strong and intermediate couplings. Notably, at intermediate coupling an unusual metallic state emerges, with short ranged spin correlations but long ranged spin-chirality order. Moreover, no clear Fermi surface or wave-vector is observed, this chiral metal also exhibits staggered loop current, which breaks the translational symmetry. These features disappear on increasing interaction strength or on further doping. At strong coupling, the 120 degree magnetic order of the insulating magnet persists for light doping, and produces hole pockets with a well defined Fermi surface. On further doping, $δ\approx 10\%\sim 20\%$ SDW order and coherent hole Fermi pockets are found at both strong and intermediate couplings. At even higher doping $δ\gtrsim 20\%$, the SDW order is suppressed and the spin-singlet Cooper pair correlations are simultaneously enhanced. We also briefly comment on the strong particle-hole asymmetry of the model.

preprint2022arXiv

FaceMAE: Privacy-Preserving Face Recognition via Masked Autoencoders

Face recognition, as one of the most successful applications in artificial intelligence, has been widely used in security, administration, advertising, and healthcare. However, the privacy issues of public face datasets have attracted increasing attention in recent years. Previous works simply mask most areas of faces or synthesize samples using generative models to construct privacy-preserving face datasets, which overlooks the trade-off between privacy protection and data utility. In this paper, we propose a novel framework FaceMAE, where the face privacy and recognition performance are considered simultaneously. Firstly, randomly masked face images are used to train the reconstruction module in FaceMAE. We tailor the instance relation matching (IRM) module to minimize the distribution gap between real faces and FaceMAE reconstructed ones. During the deployment phase, we use trained FaceMAE to reconstruct images from masked faces of unseen identities without extra training. The risk of privacy leakage is measured based on face retrieval between reconstructed and original datasets. Experiments prove that the identities of reconstructed images are difficult to be retrieved. We also perform sufficient privacy-preserving face recognition on several public face datasets (i.e. CASIA-WebFace and WebFace260M). Compared to previous state of the arts, FaceMAE consistently \textbf{reduces at least 50\% error rate} on LFW, CFP-FP and AgeDB.

preprint2022arXiv

HFT: Lifting Perspective Representations via Hybrid Feature Transformation

Autonomous driving requires accurate and detailed Bird's Eye View (BEV) semantic segmentation for decision making, which is one of the most challenging tasks for high-level scene perception. Feature transformation from frontal view to BEV is the pivotal technology for BEV semantic segmentation. Existing works can be roughly classified into two categories, i.e., Camera model-Based Feature Transformation (CBFT) and Camera model-Free Feature Transformation (CFFT). In this paper, we empirically analyze the vital differences between CBFT and CFFT. The former transforms features based on the flat-world assumption, which may cause distortion of regions lying above the ground plane. The latter is limited in the segmentation performance due to the absence of geometric priors and time-consuming computation. In order to reap the benefits and avoid the drawbacks of CBFT and CFFT, we propose a novel framework with a Hybrid Feature Transformation module (HFT). Specifically, we decouple the feature maps produced by HFT for estimating the layout of outdoor scenes in BEV. Furthermore, we design a mutual learning scheme to augment hybrid transformation by applying feature mimicking. Notably, extensive experiments demonstrate that with negligible extra overhead, HFT achieves a relative improvement of 13.3% on the Argoverse dataset and 16.8% on the KITTI 3D Object datasets compared to the best-performing existing method. The codes are available at https://github.com/JiayuZou2020/HFT.

preprint2022arXiv

Modeling Ride-Sourcing Matching and Pickup Processes based on Additive Gaussian Process Models

Matching and pickup processes are core features of ride-sourcing services. Previous studies have adopted abundant analytical models to depict the two processes and obtain operational insights; while the goodness of fit between models and data was dismissed. To simultaneously consider the fitness between models and data and analytically tractable formations, we propose a data-driven approach based on the additive Gaussian Process Model (AGPM) for ride-sourcing market modeling. The framework is tested based on real-world data collected in Hangzhou, China. We fit analytical models, machine learning models, and AGPMs, in which the number of matches or pickups are used as outputs and spatial, temporal, demand, and supply covariates are utilized as inputs. The results demonstrate the advantages of AGPMs in recovering the two processes in terms of estimation accuracy. Furthermore, we illustrate the modeling power of AGPM by utilizing the trained model to design and estimate idle vehicle relocation strategies.

preprint2022arXiv

MVSTER: Epipolar Transformer for Efficient Multi-View Stereo

Learning-based Multi-View Stereo (MVS) methods warp source images into the reference camera frustum to form 3D volumes, which are fused as a cost volume to be regularized by subsequent networks. The fusing step plays a vital role in bridging 2D semantics and 3D spatial associations. However, previous methods utilize extra networks to learn 2D information as fusing cues, underusing 3D spatial correlations and bringing additional computation costs. Therefore, we present MVSTER, which leverages the proposed epipolar Transformer to learn both 2D semantics and 3D spatial associations efficiently. Specifically, the epipolar Transformer utilizes a detachable monocular depth estimator to enhance 2D semantics and uses cross-attention to construct data-dependent 3D associations along epipolar line. Additionally, MVSTER is built in a cascade structure, where entropy-regularized optimal transport is leveraged to propagate finer depth estimations in each stage. Extensive experiments show MVSTER achieves state-of-the-art reconstruction performance with significantly higher efficiency: Compared with MVSNet and CasMVSNet, our MVSTER achieves 34% and 14% relative improvements on the DTU benchmark, with 80% and 51% relative reductions in running time. MVSTER also ranks first on Tanks&Temples-Advanced among all published works. Code is released at https://github.com/JeffWang987.

preprint2022arXiv

Predict the Rover Mobility over Soft Terrain using Articulated Wheeled Bevameter

Robot mobility is critical for mission success, especially in soft or deformable terrains, where the complex wheel-soil interaction mechanics often leads to excessive wheel slip and sinkage, causing the eventual mission failure. To improve the success rate, online mobility prediction using vision, infrared imaging, or model-based stochastic methods have been used in the literature. This paper proposes an on-board mobility prediction approach using an articulated wheeled bevameter that consists of a force-controlled arm and an instrumented bevameter (with force and vision sensors) as its end-effector. The proposed bevameter, which emulates the traditional terramechanics tests such as pressure-sinkage and shear experiments, can measure contact parameters ahead of the rover's body in real-time, and predict the slip and sinkage of supporting wheels over the probed region. Based on the predicted mobility, the rover can select a safer path in order to avoid dangerous regions such as those covered with quicksand. Compared to the literature, our proposed method can avoid the complicated terramechanics modeling and time-consuming stochastic prediction; it can also mitigate the inaccuracy issues arising in non-contact vision-based methods. We also conduct multiple experiments to validate the proposed approach.

preprint2022arXiv

Proposal for asymmetric photoemission and tunneling spectroscopies in quantum simulators of the triangular-lattice Fermi-Hubbard model

Recent realization of well-controlled quantum simulators of the triangular-lattice Fermi-Hubbard model, including the triangular optical lattices loaded with ultracold Fermions and the heterostructures of the transition-metal dichalcogenides, as well as the more advanced techniques to probe them, pave the way for studying frustrated Fermi-Hubbard physics. Here, we theoretically predict asymmetric photoemission and tunneling spectroscopies for a lightly hole-doped and electron-doped triangular Mott antiferromagnet, and reveal two distinct types of magnetic polarons: a \emph{lightly} renormalized quasiparticle with the same momentum as the spin background and a \emph{heavily} renormalized quasiparticle with a shifted momentum and a nearly flat band, using both analytical and unbiased numerical methods. We propose these theoretical findings to be verified in frustrated optical lattices and Moiré superlattices by probing various observables including the spectral function, the density of states, the energy dispersion and the quasiparticle weight. Moreover, we reveal the asymmetric response of the spin background against charge doping, demonstrating that the interplay between the local spin and charge degrees of freedom plays a vital role in doped triangular Mott antiferromagnets.

preprint2022arXiv

Reliable Label Correction is a Good Booster When Learning with Extremely Noisy Labels

Learning with noisy labels has aroused much research interest since data annotations, especially for large-scale datasets, may be inevitably imperfect. Recent approaches resort to a semi-supervised learning problem by dividing training samples into clean and noisy sets. This paradigm, however, is prone to significant degeneration under heavy label noise, as the number of clean samples is too small for conventional methods to behave well. In this paper, we introduce a novel framework, termed as LC-Booster, to explicitly tackle learning under extreme noise. The core idea of LC-Booster is to incorporate label correction into the sample selection, so that more purified samples, through the reliable label correction, can be utilized for training, thereby alleviating the confirmation bias. Experiments show that LC-Booster advances state-of-the-art results on several noisy-label benchmarks, including CIFAR-10, CIFAR-100, Clothing1M and WebVision. Remarkably, under the extreme 90\% noise ratio, LC-Booster achieves 92.9\% and 48.4\% accuracy on CIFAR-10 and CIFAR-100, surpassing state-of-the-art methods by a large margin.

preprint2022arXiv

Shapley-NAS: Discovering Operation Contribution for Neural Architecture Search

In this paper, we propose a Shapley value based method to evaluate operation contribution (Shapley-NAS) for neural architecture search. Differentiable architecture search (DARTS) acquires the optimal architectures by optimizing the architecture parameters with gradient descent, which significantly reduces the search cost. However, the magnitude of architecture parameters updated by gradient descent fails to reveal the actual operation importance to the task performance and therefore harms the effectiveness of obtained architectures. By contrast, we propose to evaluate the direct influence of operations on validation accuracy. To deal with the complex relationships between supernet components, we leverage Shapley value to quantify their marginal contributions by considering all possible combinations. Specifically, we iteratively optimize the supernet weights and update the architecture parameters by evaluating operation contributions via Shapley value, so that the optimal architectures are derived by selecting the operations that contribute significantly to the tasks. Since the exact computation of Shapley value is NP-hard, the Monte-Carlo sampling based algorithm with early truncation is employed for efficient approximation, and the momentum update mechanism is adopted to alleviate fluctuation of the sampling process. Extensive experiments on various datasets and various search spaces show that our Shapley-NAS outperforms the state-of-the-art methods by a considerable margin with light search cost. The code is available at https://github.com/Euphoria16/Shapley-NAS.git

preprint2022arXiv

Symmetric Mass Generation in the 1+1 Dimensional Chiral Fermion 3-4-5-0 Model

Lattice regularization of chiral fermions has been a long-standing problem in physics. In this work, we present the density matrix renormalization group (DMRG) simulation of the 3-4-5-0 model of (1+1)D chiral fermions with an anomaly-free chiral U(1) symmetry, which contains two left-moving and two right-moving fermions carrying U(1) charges 3,4 and 5,0, respectively. Following the Wang-Wen chiral fermion model, we realize the chiral fermions and their mirror partners on the opposite boundaries of a thin strip of (2+1)D lattice model of multi-layer Chern insulator, whose finite-width implies the quantum system is effectively (1+1)D. By introducing carefully designed two sets of six-fermion local interactions to the mirror sector only, we demonstrate that the mirror fermions can be gapped out by the interaction beyond a critical strength without breaking the chiral U(1) symmetry, via the symmetric mass generation (SMG) mechanism. We show that the interaction-driven gapping transition is in the Berezinskii-Kosterlitz-Thouless (BKT) universality class. We determine the evolution of Luttinger parameters before the transition, which confirms that the transition happens exactly at the point when the interaction term becomes marginal. As the mirror sector is gapped after the transition, we check that the fermions in the light chiral fermion sector remain gapless, which provides the desired lattice regularization of chiral fermions.

preprint2022arXiv

WebFace260M: A Benchmark for Million-Scale Deep Face Recognition

Face benchmarks empower the research community to train and evaluate high-performance face recognition systems. In this paper, we contribute a new million-scale recognition benchmark, containing uncurated 4M identities/260M faces (WebFace260M) and cleaned 2M identities/42M faces (WebFace42M) training data, as well as an elaborately designed time-constrained evaluation protocol. Firstly, we collect 4M name lists and download 260M faces from the Internet. Then, a Cleaning Automatically utilizing Self-Training (CAST) pipeline is devised to purify the tremendous WebFace260M, which is efficient and scalable. To the best of our knowledge, the cleaned WebFace42M is the largest public face recognition training set and we expect to close the data gap between academia and industry. Referring to practical deployments, Face Recognition Under Inference Time conStraint (FRUITS) protocol and a new test set with rich attributes are constructed. Besides, we gather a large-scale masked face sub-set for biometrics assessment under COVID-19. For a comprehensive evaluation of face matchers, three recognition tasks are performed under standard, masked and unbiased settings, respectively. Equipped with this benchmark, we delve into million-scale face recognition problems. A distributed framework is developed to train face recognition models efficiently without tampering with the performance. Enabled by WebFace42M, we reduce 40% failure rate on the challenging IJB-C set and rank 3rd among 430 entries on NIST-FRVT. Even 10% data (WebFace4M) shows superior performance compared with the public training sets. Furthermore, comprehensive baselines are established under the FRUITS-100/500/1000 milliseconds protocols. The proposed benchmark shows enormous potential on standard, masked and unbiased face recognition scenarios. Our WebFace260M website is https://www.face-benchmark.org.

preprint2021arXiv

WebFace260M: A Benchmark Unveiling the Power of Million-Scale Deep Face Recognition

In this paper, we contribute a new million-scale face benchmark containing noisy 4M identities/260M faces (WebFace260M) and cleaned 2M identities/42M faces (WebFace42M) training data, as well as an elaborately designed time-constrained evaluation protocol. Firstly, we collect 4M name list and download 260M faces from the Internet. Then, a Cleaning Automatically utilizing Self-Training (CAST) pipeline is devised to purify the tremendous WebFace260M, which is efficient and scalable. To the best of our knowledge, the cleaned WebFace42M is the largest public face recognition training set and we expect to close the data gap between academia and industry. Referring to practical scenarios, Face Recognition Under Inference Time conStraint (FRUITS) protocol and a test set are constructed to comprehensively evaluate face matchers. Equipped with this benchmark, we delve into million-scale face recognition problems. A distributed framework is developed to train face recognition models efficiently without tampering with the performance. Empowered by WebFace42M, we reduce relative 40% failure rate on the challenging IJB-C set, and ranks the 3rd among 430 entries on NIST-FRVT. Even 10% data (WebFace4M) shows superior performance compared with public training set. Furthermore, comprehensive baselines are established on our rich-attribute test set under FRUITS-100ms/500ms/1000ms protocol, including MobileNet, EfficientNet, AttentionNet, ResNet, SENet, ResNeXt and RegNet families. Benchmark website is https://www.face-benchmark.org.

preprint2020arXiv

${\bf 2k_F}$ Density Wave Instability of Composite Fermi Liquid

We investigate the $2k_F$ density-wave instability of non-Fermi liquid states by combining exact diagonalization with renormalization group analysis. At the half-filled zeroth Landau level, we study the fate of the composite Fermi liquid in the presence of the mass anisotropy and mixed Landau level form factors. These two experimentally accessible knobs trigger a phase transition towards a unidirectional charge-density-wave state with a wavevector equal to $2k_F$ of the composite Fermi liquid. Based on exact diagonalization, we identify such a transition by examining both the energy spectra and the static structure factor of charge density-density correlations. Moreover, the renormalization group analysis reveals that gauge fluctuations render the non-Fermi liquid state unstable against density-wave orders, consistent with numerical observations. Possible experimental probes of the density-wave instability are also discussed.

preprint2020arXiv

Complex Phase Diagram of Doped XXZ Ladder: Localization and Pairing

How the ground state nature can be dramatically changed by the distinct underlying spin correlation is a central issue of doped Mott insulators. The two-leg XXZ ladder provides a prototypical spin background, which can be tuned from a long-range Néel order to a short-range ``spin liquid'' via the superexchange anisotropy, giving rise to a complex phase diagram at finite doping. By density matrix renormalization group method, we show that although the charge is always self-localized in the Néel ordered phase, a second insulating phase emerges, in which the doped holes become paired but remain localized while the transverse spin-spin correlation reduces to short-ranged one to make the Néel order classical. Only when the Néel order totally disappears by further reducing anisotropy, does the pairing become truly coherent as characterized by a Luther-Emery state. In sharp contrast, the pairing is totally absent in the in-plane ferromagnetic XXZ regime, where a direct transition from the charge self-localization in the Néel ordered phase to a Fermi-gas-like state in the spin liquid phase is found. A consistent physical picture is briefly discussed.

preprint2020arXiv

Joint predictions of multi-modal ride-hailing demands: a deep multi-task multigraph learning-based approach

Ride-hailing platforms generally provide various service options to customers, such as solo ride services, shared ride services, etc. It is generally expected that demands for different service modes are correlated, and the prediction of demand for one service mode can benefit from historical observations of demands for other service modes. Moreover, an accurate joint prediction of demands for multiple service modes can help the platforms better allocate and dispatch vehicle resources. Although there is a large stream of literature on ride-hailing demand predictions for one specific service mode, little efforts have been paid towards joint predictions of ride-hailing demands for multiple service modes. To address this issue, we propose a deep multi-task multi-graph learning approach, which combines two components: (1) multiple multi-graph convolutional (MGC) networks for predicting demands for different service modes, and (2) multi-task learning modules that enable knowledge sharing across multiple MGC networks. More specifically, two multi-task learning structures are established. The first one is the regularized cross-task learning, which builds cross-task connections among the inputs and outputs of multiple MGC networks. The second one is the multi-linear relationship learning, which imposes a prior tensor normal distribution on the weights of various MGC networks. Although there are no concrete bridges between different MGC networks, the weights of these networks are constrained by each other and subject to a common prior distribution. Evaluated with the for-hire-vehicle datasets in Manhattan, we show that our propose approach outperforms the benchmark algorithms in prediction accuracy for different ride-hailing modes.

preprint2020arXiv

Magnetic Field Induced Spin Liquids in S=1 Kitaev Honeycomb Model

We investigate the ground state properties of the spin S=1 Kitaev honeycomb model under a magnetic field based on the density matrix renormalization group (DMRG) calculation. With the time reversal symmetry breaking due to the magnetic field, a gapped Kitaev spin liquid is identified for both ferromagnetic (FM) and antiferromagnetic (AFM) Kitaev couplings. The topological nature of such Kitaev spin liquid is manifested by the nearly quantized Wilson loop, degeneracy in the entanglement spectra and existence of edge modes. While the FM Kitaev spin liquid is destroyed by a weaker magnetic field $H_*^\text{FM}$, the AFM one demonstrates a robustness up to an order of magnitude larger critical field $H_*^\text{AFM}$. Moreover, an intermediate nonmagnetic phase appears only for the AFM case at larger fields, $H_*^\text{AFM} < H < H_{**}^\text{AFM}$, before the transition to a high-field polarized paramagnet. The stability of the Kitaev spin liquid against the Heisenberg interactions is also examined. Our findings may further inspire the investigation of recently proposed S=1 Kitaev materials.

preprint2020arXiv

Modeling indoor-level non-pharmaceutical interventions during the COVID-19 pandemic: a pedestrian dynamics-based microscopic simulation approach

Mathematical modeling of epidemic spreading has been widely adopted to estimate the threats of epidemic diseases (i.e., the COVID-19 pandemic) as well as to evaluate epidemic control interventions. The indoor place is considered to be a significant epidemic spreading risk origin, but existing widely-used epidemic spreading models are usually limited for indoor places since the dynamic physical distance changes between people are ignored, and the empirical features of the essential and non-essential travel are not differentiated. In this paper, we introduce a pedestrian-based epidemic spreading model that is capable of modeling indoor transmission risks of diseases during people&#39;s social activities. Taking advantage of the before-and-after mobility data from the University of Maryland COVID-19 Impact Analysis Platform, it&#39;s found that people tend to spend more time in grocery stores once their travel frequencies are restricted to a low level. In other words, an increase in dwell time could balance the decrease in travel frequencies and satisfy people&#39;s demand. Based on the pedestrian-based model and the empirical evidence, combined non-pharmaceutical interventions from different operational levels are evaluated. Numerical simulations show that restrictions on people&#39;s travel frequency and open-hours of indoor places may not be universally effective in reducing average infection risks for each pedestrian who visit the place. Entry limitations can be a widely effective alternative, whereas the decision-maker needs to balance the decrease in risky contacts and the increase in queue length outside the place that may impede people from fulfilling their travel needs.

preprint2020arXiv

The Devil is in the Details: Delving into Unbiased Data Processing for Human Pose Estimation

Being a fundamental component in training and inference, data processing has not been systematically considered in human pose estimation community, to the best of our knowledge. In this paper, we focus on this problem and find that the devil of human pose estimation evolution is in the biased data processing. Specifically, by investigating the standard data processing in state-of-the-art approaches mainly including coordinate system transformation and keypoint format transformation (i.e., encoding and decoding), we find that the results obtained by common flipping strategy are unaligned with the original ones in inference. Moreover, there is a statistical error in some keypoint format transformation methods. Two problems couple together, significantly degrade the pose estimation performance and thus lay a trap for the research community. This trap has given bone to many suboptimal remedies, which are always unreported, confusing but influential. By causing failure in reproduction and unfair in comparison, the unreported remedies seriously impedes the technological development. To tackle this dilemma from the source, we propose Unbiased Data Processing (UDP) consist of two technique aspect for the two aforementioned problems respectively (i.e., unbiased coordinate system transformation and unbiased keypoint format transformation). As a model-agnostic approach and a superior solution, UDP successfully pushes the performance boundary of human pose estimation and offers a higher and more reliable baseline for research community. Code is public available in https://github.com/HuangJunJie2017/UDP-Pose

preprint2020arXiv

Widely Tunable Quantum Phase Transition from Moore-Read to Composite Fermi Liquid in Bilayer Graphene

We develop a proposal to realise a widely tunable and clean quantum phase transition in bilayer graphene between two paradigmatic fractionalized phases of matter: the Moore-Read fractional quantum Hall state and the composite Fermi liquid metal. This transition can be realized at total fillings $ν=\pm 3+1/2$ and the critical point can be controllably accessed by tuning either the interlayer electric bias or the perpendicular magnetic field values over a wide range of parameters. We study the transition numerically within a model that contains all leading single particle corrections to the band-structure of bilayer graphene and includes the fluctuations between the $n=0$ and $n=1$ cyclotron orbitals of its zeroth Landau level to delineate the most favorable region of parameters to experimentally access this unconventional critical point. We also find evidence for a new anisotropic gapless phase stabilized near the level crossing of $n=0/1$ orbits.

preprint2019arXiv

Deformations of Bi-conformal Energy and a new Characterization of Quasiconformality

The concept of hyperelastic deformations of bi-conformal energy is developed as an extension of quasiconformality. These are homeomorphisms $h:X \to Y$ between domains $ X, Y \subset \mathbb R^n$ of the Sobolev class $W^{1,n}_{loc} (X, Y)$ whose inverse $f =h^{-1}:Y \to X$ also belongs to $W^{1,n}_{loc}(Y, X)$. Thus the paper opens new topics in Geometric Function Theory with connections to mathematical models of Nonlinear Elasticity. In seeking differences and similarities with quasiconformal mappings we examine closely the modulus of continuity of deformations of bi-conformal energy. This leads us to a new characterization of quasiconformality. Specifically, it is observed that quasiconformal mappings behave locally at every point like radial stretchings. Without going into detail, if a quasiconformal map $h$ admits a function $ϕ$ as its optimal modulus of continuity at a point $x_0$, then $f = h^{-1}$ admits the inverse function $ψ= ϕ^{-1}$ as its modulus of continuity at $y_0 = h(x_0)$. That is to say; a poor continuity of $h$ at a given point $x_0$ is always compensated by a better continuity of $f$ at $y_0$, and vice versa. Such a gain/loss property, seemingly overlooked by many authors, is actually characteristic of quasiconformal mappings. It turns out that the elastic deformations of bi-conformal energy are very different in this respect. Unexpectedly, such a map may have the same optimal modulus of continuity as its inverse deformation. In line with Hooke&#39;s Law, when trying to restore the original shape of the body (by the inverse transformation) the modulus of continuity may neither be improved nor become worse. However, examples to confirm this phenomenon are far from being obvious. We eventually hope that our examples will gain an interest in the materials science, particularly in mathematical models of hyperelasticity.

preprint2019arXiv

Predicting origin-destination ride-sourcing demand with a spatio-temporal encoder-decoder residual multi-graph convolutional network

With the rapid development of mobile-internet technologies, on-demand ride-sourcing services have become increasingly popular and largely reshaped the way people travel. Demand prediction is one of the most fundamental components in supply-demand management systems of ride-sourcing platforms. With accurate short-term prediction for origin-destination (OD) demand, the platforms make precise and timely decisions on real-time matching, idle vehicle reallocations and ride-sharing vehicle routing, etc. Compared to zone-based demand prediction that has been examined by many previous studies, OD-based demand prediction is more challenging. This is mainly due to the complicated spatial and temporal dependencies among demand of different OD pairs. To overcome this challenge, we propose the Spatio-Temporal Encoder-Decoder Residual Multi-Graph Convolutional network (ST-ED-RMGC), a novel deep learning model for predicting ride-sourcing demand of various OD pairs. Firstly, the model constructs OD graphs, which utilize adjacent matrices to characterize the non-Euclidean pair-wise geographical and semantic correlations among different OD pairs. Secondly, based on the constructed graphs, a residual multi-graph convolutional (RMGC) network is designed to encode the contextual-aware spatial dependencies, and a long-short term memory (LSTM) network is used to encode the temporal dependencies, into a dense vector space. Finally, we reuse the RMGC networks to decode the compressed vector back to OD graphs and predict the future OD demand. Through extensive experiments on the for-hire-vehicles datasets in Manhattan, New York City, we show that our proposed deep learning framework outperforms the state-of-arts by a significant margin.

preprint2019arXiv

Tackling Challenges in Seebeck Coefficient Measurement of Ultra-High Resistance Samples with an AC Technique

Seebeck coefficient is a widely-studied semiconductor property. Conventional Seebeck coefficient measurements are based on DC voltage measurement. Normally this is performed on samples with low resistances below a few Mohm level. Meanwhile, certain semiconductors are highly intrinsic and resistive, many examples can be found in optical and photovoltaic materials. The hybrid halide perovskites that have gained extensive attention recently are a good example. Few credible studies exist on the Seebeck coefficient of, CH3NH3PbI3, for example. We report here an AC technique based Seebeck coefficient measurement, which makes high quality voltage measurement on samples with resistances up to 100Gohm. This is achieved through a specifically designed setup to enhance sample isolation and reduce meter loading. As a demonstration, we performed Seebeck coefficient measurement of a CH3NH3PbI3 thin film at dark and found S = +550 microV/K. Such property of this material has not been successfully studied before.