Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
61works
0followers
29topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

61 published item(s)

preprint2026arXiv

Di-BiLPS: Denoising induced Bidirectional Latent-PDE-Solver under Sparse Observations

Partial differential equations (PDEs) are fundamental for modeling complex natural and physical phenomena. In many real-world applications, however, observational data are extremely sparse, which severely limits the applicability of both classical numerical solvers and existing neural approaches. While neural methods have shown promising results under moderately sparse observations, their inference efficiency at high resolutions is limited, and their accuracy degrades substantially in the extremely sparse regime. In this work, we propose the Di-BiLPS, a unified neural framework that effectively handle both forward and inverse PDE problems under extremely sparse observations. Di-BiLPS combines a variational autoencoder to compress high-dimensional inputs into a compact latent space, a latent diffusion module to model uncertainty, and contrastive learning to align representations. Operating entirely in this latent space, the framework achieves efficient inference while retaining flexible input-output mapping. In addition, we introduce a PDE-informed denoising algorithm based on a variance-preserving diffusion process, which further improves inference efficiency. Extensive experiments on multiple PDE benchmarks demonstrate that Di-BiLPS consistently achieves SOTA performance under extremely sparse inputs (as low as 3%), while substantially reducing computational cost. Moreover, Di-BiLPS enables zero-shot super-resolution, as it allows predictions over continuous spatial-temporal domains.

preprint2026arXiv

EponaV2: Driving World Model with Comprehensive Future Reasoning

Data scaling plays a pivotal role in the pursuit of general intelligence. However, the prevailing perception-planning paradigm in autonomous driving relies heavily on expensive manual annotations to supervise trajectory planning, which severely limits its scalability. Conversely, although existing perception-free driving world models achieve impressive driving performance, their real-world reasoning ability for planning is solely built on next frame image forecasting. Due to the lack of enough supervision, these models often struggle with comprehensive scene understanding, resulting in unsatisfactory trajectory planning. In this paper, we propose EponaV2, a novel paradigm of driving world models, which achieves high-quality planning with comprehensive future reasoning. Inspired by how human drivers anticipate 3D geometry and semantics, we train our model to forecast more comprehensive future representations, which can be additionally decoded to future geometry and semantic maps. Extracting the 3D and semantic modalities enables our model to deeply understand the surrounding environment, and the future prediction task significantly enhances the real-world reasoning capabilities of EponaV2, ultimately leading to improved trajectory planning. Moreover, inspired by the training recipe of Large Language Models (LLMs), we introduce a flow matching group relative policy optimization mechanism to further improve planning accuracy. The state-of-the-art (SOTA) performances of EponaV2 among perception-free models on three NAVSIM benchmarks (+1.3PDMS, +5.5EPDMS) demonstrate the effectiveness of our methods.

preprint2026arXiv

HorizonDrive: Self-Corrective Autoregressive World Model for Long-horizon Driving Simulation

Closed-loop driving simulation requires real-time interaction beyond short offline clips, pushing current driving world models toward autoregressive (AR) rollout. Existing AR distillation approaches typically rely on frame sinks or student-side degradation training. The former transfers poorly to driving due to fast ego-motion and rapid scene changes, while the latter remains bounded by the teacher's single-pass output length and thus provides only a limited supervision horizon. A natural question is: can the teacher itself be extended via AR rollout to provide unbounded-horizon supervision at bounded memory cost? The key difficulty is that a standard teacher drifts under its own predictions, contaminating the supervision it provides. Our key insight is to make the teacher rollout-capable, ensuring reliable supervision from its own AR rollouts. This is instantiated as HorizonDrive, an anti-drifting training-and-distillation framework for AR driving simulation. First, scheduled rollout recovery (SRR) trains the base model to reconstruct ground-truth future clips from prediction-corrupted histories, yielding a teacher that remains stable across long AR rollouts. Second, the rollout-capable teacher is extended via AR rollout, providing long-horizon distribution-matching supervision under bounded memory, while a short-window student aligns to it with teacher rollout DMD (TRD) for efficient real-time deployment. HorizonDrive natively supports minute-scale AR rollout under bounded memory; on nuScenes, HorizonDrive reduces FID by 52% and FVD by 37%, and lowers ARE and DTW by 21% and 9% relative to the strongest long-horizon streaming baselines, while remaining competitive with single-pass driving video generators.

preprint2026arXiv

REAP: Reinforcement-Learning End-to-End Autonomous Parking with Gaussian Splatting Simulator for Real2Sim2Real Transfer

In recent years, autonomous parking has made significant advances, yet parking tasks still face challenges in extreme scenarios such as mechanical and dead-end parking slots, often resulting in failures. This is mainly due to traditional parking methods adopting a multistage approach, lacking the ability to optimize the parking problem as a whole. End-to-end methods enable joint optimization across perception and planning modules to eliminate the accumulation of errors, enhancing algorithm performance in extreme scenarios. Although several end-to-end parking methods use imitation or reinforcement learning, the former is limited by data cost and distribution coverage, while the latter suffers from inefficient exploration. To address these challenges, we propose a Reinforcement learning End-to-end Autonomous Parking method (REAP). REAP employs Soft Actor-Critic (SAC) within an asymmetric reinforcement learning framework to improve training efficiency and inference performance. To accelerate model convergence, we distill the capabilities of a rule-based planner into the end-to-end network through behavior cloning. We further introduce a soft predictive collision penalty mechanism to reduce collision rates by penalizing obstacle-approaching actions. To ensure that the trained reinforcement learning network can directly transfer to real-world scenarios, we have established a Real2Sim2Real simulator. In the Real2Sim step, we use 3D Gaussian Splatting (3DGS) to transform real-world scenes into digital scenes. In the Sim2Real step, we deploy the end-to-end model onto the vehicle to bridge the Sim2Real gap. Trained in the 3DGS simulator and deployed on physical vehicles, REAP successfully parks in various types of parking spaces, especially demonstrating the feasibility of end-to-end RL parking in extremely narrow mechanical slots.

preprint2026arXiv

RS2-SAM2: Customized SAM2 for Referring Remote Sensing Image Segmentation

Referring Remote Sensing Image Segmentation (RRSIS) aims to segment target objects in remote sensing (RS) images based on textual descriptions. Although Segment Anything Model 2 (SAM2) has shown remarkable performance in various segmentation tasks, its application to RRSIS presents several challenges, including understanding the text-described RS scenes and generating effective prompts from text. To address these issues, we propose \textbf{RS2-SAM2}, a novel framework that adapts SAM2 to RRSIS by aligning the adapted RS features and textual features while providing pseudo-mask-based dense prompts. Specifically, we employ a union encoder to jointly encode the visual and textual inputs, generating aligned visual and text embeddings as well as multimodal class tokens. A bidirectional hierarchical fusion module is introduced to adapt SAM2 to RS scenes and align adapted visual features with the visually enhanced text embeddings, improving the model's interpretation of text-described RS scenes. To provide precise target cues for SAM2, we design a mask prompt generator, which takes the visual embeddings and class tokens as input and produces a pseudo-mask as the dense prompt of SAM2. Experimental results on several RRSIS benchmarks demonstrate that RS2-SAM2 achieves state-of-the-art performance.

preprint2024arXiv

Efficient Scenario Generation for Chance-constrained Economic Dispatch Considering Ambient Wind Conditions

Scenario generation is an effective data-driven method for solving chance-constrained optimization while ensuring desired risk guarantees with a finite number of samples. Crucial challenges in deploying this technique in the real world arise due to the absence of appropriate risk-tuning models tailored for the desired application. In this paper, we focus on designing efficient scenario generation schemes for economic dispatch in power systems. We propose a novel scenario generation method based on filtering scenarios using ambient wind conditions. These filtered scenarios are deployed incrementally in order to meet desired risk levels while using minimum resources. In order to study the performance of the proposed scheme, we illustrate the procedure on case studies performed for both 24-bus and 118-bus systems with real-world wind power forecasting data. Numerical results suggest that the proposed filter-and-increment scenario generation model leads to a precise and efficient solution for the chance-constrained economic dispatch problem.

preprint2024arXiv

Enhancing RAW-to-sRGB with Decoupled Style Structure in Fourier Domain

RAW to sRGB mapping, which aims to convert RAW images from smartphones into RGB form equivalent to that of Digital Single-Lens Reflex (DSLR) cameras, has become an important area of research. However, current methods often ignore the difference between cell phone RAW images and DSLR camera RGB images, a difference that goes beyond the color matrix and extends to spatial structure due to resolution variations. Recent methods directly rebuild color mapping and spatial structure via shared deep representation, limiting optimal performance. Inspired by Image Signal Processing (ISP) pipeline, which distinguishes image restoration and enhancement, we present a novel Neural ISP framework, named FourierISP. This approach breaks the image down into style and structure within the frequency domain, allowing for independent optimization. FourierISP is comprised of three subnetworks: Phase Enhance Subnet for structural refinement, Amplitude Refine Subnet for color learning, and Color Adaptation Subnet for blending them in a smooth manner. This approach sharpens both color and structure, and extensive evaluations across varied datasets confirm that our approach realizes state-of-the-art results. Code will be available at ~\url{https://github.com/alexhe101/FourierISP}.

preprint2023arXiv

AI Mobile Application for Archaeological Dating of Bronze Dings

We develop an AI application for archaeological dating of bronze Dings. A classification model is employed to predict the period of the input Ding, and a detection model is used to show the feature parts for making a decision of archaeological dating. To train the two deep learning models, we collected a large number of Ding images from published materials, and annotated the period and the feature parts on each image by archaeological experts. Furthermore, we design a user system and deploy our pre-trained models based on the platform of WeChat Mini Program for ease of use. Only need a smartphone installed WeChat APP, users can easily know the result of intelligent archaeological dating, the feature parts, and other reference artifacts, by taking a photo of a bronze Ding. To use our application, please scan this QR code by WeChat.

preprint2023arXiv

Fully H(gradcurl)-nonconforming Finite Element Method for The Singularly Perturbed Quad-curl Problem on Cubical Meshes

In this paper, we develop two fully nonconforming (both H(grad curl)-nonconforming and H(curl)-nonconforming) finite elements on cubical meshes which can fit into the Stokes complex. The newly proposed elements have 24 and 36 degrees of freedom, respectively. Different from the fully H(grad curl)-nonconforming tetrahedral finite elements in [9], the elements in this paper lead to a robust finite element method to solve the singularly perturbed quad-curl problem. To confirm this, we prove the optimal convergence of order $O(h)$ for a fixed parameter $ε$ and the uniform convergence of order $O(h^{1/2})$ for any value of $ε$. Some numerical examples are used to verify the correctness of the theoretical analysis.

preprint2022arXiv

Adversarial Relighting Against Face Recognition

Deep face recognition (FR) has achieved significantly high accuracy on several challenging datasets and fosters successful real-world applications, even showing high robustness to the illumination variation that is usually regarded as a main threat to the FR system. However, in the real world, illumination variation caused by diverse lighting conditions cannot be fully covered by the limited face dataset. In this paper, we study the threat of lighting against FR from a new angle, i.e., adversarial attack, and identify a new task, i.e., adversarial relighting. Given a face image, adversarial relighting aims to produce a naturally relighted counterpart while fooling the state-of-the-art deep FR methods. To this end, we first propose the physical modelbased adversarial relighting attack (ARA) denoted as albedoquotient-based adversarial relighting attack (AQ-ARA). It generates natural adversarial light under the physical lighting model and guidance of FR systems and synthesizes adversarially relighted face images. Moreover, we propose the auto-predictive adversarial relighting attack (AP-ARA) by training an adversarial relighting network (ARNet) to automatically predict the adversarial light in a one-step manner according to different input faces, allowing efficiency-sensitive applications. More importantly, we propose to transfer the above digital attacks to physical ARA (PhyARA) through a precise relighting device, making the estimated adversarial lighting condition reproducible in the real world. We validate our methods on three state-of-the-art deep FR methods, i.e., FaceNet, ArcFace, and CosFace, on two public datasets. The extensive and insightful results demonstrate our work can generate realistic adversarial relighted face images fooling face recognition tasks easily, revealing the threat of specific light directions and strengths.

preprint2022arXiv

An Active Contour Model with Local Variance Force Term and Its Efficient Minimization Solver for Multi-phase Image Segmentation

In this paper, we propose an active contour model with a local variance force (LVF) term that can be applied to multi-phase image segmentation problems. With the LVF, the proposed model is very effective in the segmentation of images with noise. To solve this model efficiently, we represent the regularization term by characteristic functions and then design a minimization algorithm based on a modification of the iterative convolution-thresholding method (ICTM), namely ICTM-LVF. This minimization algorithm enjoys the energy-decaying property under some conditions and has highly efficient performance in the segmentation. To overcome the initialization issue of active contour models, we generalize the inhomogeneous graph Laplacian initialization method (IGLIM) to the multi-phase case and then apply it to give the initial contour of the ICTM-LVF solver. Numerical experiments are conducted on synthetic images and real images to demonstrate the capability of our initialization method, and the effectiveness of the local variance force for noise robustness in the multi-phase image segmentation.

preprint2022arXiv

Aspect-driven User Preference and News Representation Learning for News Recommendation

News recommender systems are essential for helping users to efficiently and effectively find out those interesting news from a large amount of news. Most of existing news recommender systems usually learn topic-level representations of users and news for recommendation, and neglect to learn more informative aspect-level features of users and news for more accurate recommendation. As a result, they achieve limited recommendation performance. Aiming at addressing this deficiency, we propose a novel Aspect-driven News Recommender System (ANRS) built on aspect-level user preference and news representation learning. Here, news aspect is fine-grained semantic information expressed by a set of related words, which indicates specific aspects described by the news. In ANRS, news aspect-level encoder and user aspect-level encoder are devised to learn the fine-grained aspect-level representations of user's preferences and news characteristics respectively, which are fed into click predictor to judge the probability of the user clicking the candidate news. Extensive experiments are done on the commonly used real-world dataset MIND, which demonstrate the superiority of our method compared with representative and state-of-the-art methods.

preprint2022arXiv

Atomic Filter: a Weak Form of Shift Operator for Graph Signals

The shift operation plays a crucial role in the classical signal processing. It is the generator of all the filters and the basic operation for time-frequency analysis, such as windowed Fourier transform and wavelet transform. With the rapid development of internet technology and big data science, a large amount of data are expressed as signals defined on graphs. In order to establish the theory of filtering, windowed Fourier transform and wavelet transform in the setting of graph signals, we need to extend the shift operation of classical signals to graph signals. It is a fundamental problem since the vertex set of a graph is usually not a vector space and the addition operation cannot be defined on the vertex set of the graph. In this paper, based on our understanding on the core role of shift operation in classical signal processing we propose the concept of atomic filters, which can be viewed as a weak form of the shift operator for graph signals. Then, we study the conditions such that an atomic filter is norm-preserving, periodic, or real-preserving. The property of real-preserving holds naturally in the classical signal processing, but no the research has been reported on this topic in the graph signal setting. With these conditions we propose the concept of normal atomic filters for graph signals, which degenerates into the classical shift operator under mild conditions if the graph is circulant. Typical examples of graphs that have or have not normal atomic filters are given. Finally, as an application, atomic filters are utilized to construct time-frequency atoms which constitute a frame of the graph signal space.

preprint2022arXiv

AziNorm: Exploiting the Radial Symmetry of Point Cloud for Azimuth-Normalized 3D Perception

Studying the inherent symmetry of data is of great importance in machine learning. Point cloud, the most important data format for 3D environmental perception, is naturally endowed with strong radial symmetry. In this work, we exploit this radial symmetry via a divide-and-conquer strategy to boost 3D perception performance and ease optimization. We propose Azimuth Normalization (AziNorm), which normalizes the point clouds along the radial direction and eliminates the variability brought by the difference of azimuth. AziNorm can be flexibly incorporated into most LiDAR-based perception methods. To validate its effectiveness and generalization ability, we apply AziNorm in both object detection and semantic segmentation. For detection, we integrate AziNorm into two representative detection methods, the one-stage SECOND detector and the state-of-the-art two-stage PV-RCNN detector. Experiments on Waymo Open Dataset demonstrate that AziNorm improves SECOND and PV-RCNN by 7.03 mAPH and 3.01 mAPH respectively. For segmentation, we integrate AziNorm into KPConv. On SemanticKitti dataset, AziNorm improves KPConv by 1.6/1.1 mIoU on val/test set. Besides, AziNorm remarkably improves data efficiency and accelerates convergence, reducing the requirement of data amounts or training epochs by an order of magnitude. SECOND w/ AziNorm can significantly outperform fully trained vanilla SECOND, even trained with only 10% data or 10% epochs. Code and models are available at https://github.com/hustvl/AziNorm.

preprint2022arXiv

Contrastive Siamese Network for Semi-supervised Speech Recognition

This paper introduces contrastive siamese (c-siam) network, an architecture for leveraging unlabeled acoustic data in speech recognition. c-siam is the first network that extracts high-level linguistic information from speech by matching outputs of two identical transformer encoders. It contains augmented and target branches which are trained by: (1) masking inputs and matching outputs with a contrastive loss, (2) incorporating a stop gradient operation on the target branch, (3) using an extra learnable transformation on the augmented branch, (4) introducing new temporal augment functions to prevent the shortcut learning problem. We use the Libri-light 60k unsupervised data and the LibriSpeech 100hrs/960hrs supervised data to compare c-siam and other best-performing systems. Our experiments show that c-siam provides 20% relative word error rate improvement over wav2vec baselines. A c-siam network with 450M parameters achieves competitive results compared to the state-of-the-art networks with 600M parameters.

preprint2022arXiv

Cross-Image Relational Knowledge Distillation for Semantic Segmentation

Current Knowledge Distillation (KD) methods for semantic segmentation often guide the student to mimic the teacher's structured information generated from individual data samples. However, they ignore the global semantic relations among pixels across various images that are valuable for KD. This paper proposes a novel Cross-Image Relational KD (CIRKD), which focuses on transferring structured pixel-to-pixel and pixel-to-region relations among the whole images. The motivation is that a good teacher network could construct a well-structured feature space in terms of global pixel dependencies. CIRKD makes the student mimic better structured semantic relations from the teacher, thus improving the segmentation performance. Experimental results over Cityscapes, CamVid and Pascal VOC datasets demonstrate the effectiveness of our proposed approach against state-of-the-art distillation methods. The code is available at https://github.com/winycg/CIRKD.

preprint2022arXiv

DNN-Driven Compressive Offloading for Edge-Assisted Semantic Video Segmentation

Deep learning has shown impressive performance in semantic segmentation, but it is still unaffordable for resource-constrained mobile devices. While offloading computation tasks is promising, the high traffic demands overwhelm the limited bandwidth. Existing compression algorithms are not fit for semantic segmentation, as the lack of obvious and concentrated regions of interest (RoIs) forces the adoption of uniform compression strategies, leading to low compression ratios or accuracy. This paper introduces STAC, a DNN-driven compression scheme tailored for edge-assisted semantic video segmentation. STAC is the first to exploit DNN's gradients as spatial sensitivity metrics for spatial adaptive compression and achieves superior compression ratio and accuracy. Yet, it is challenging to adapt this content-customized compression to videos. Practical issues include varying spatial sensitivity and huge bandwidth consumption for compression strategy feedback and offloading. We tackle these issues through a spatiotemporal adaptive scheme, which (1) takes partial strategy generation operations offline to reduce communication load, and (2) propagates compression strategies and segmentation results across frames through dense optical flow, and adaptively offloads keyframes to accommodate video content. We implement STAC on a commodity mobile device. Experiments show that STAC can save up to 20.95% of bandwidth without losing accuracy, compared to the state-of-the-art algorithm.

preprint2022arXiv

Efficient and Robust 2D-to-BEV Representation Learning via Geometry-guided Kernel Transformer

Learning Bird's Eye View (BEV) representation from surrounding-view cameras is of great importance for autonomous driving. In this work, we propose a Geometry-guided Kernel Transformer (GKT), a novel 2D-to-BEV representation learning mechanism. GKT leverages the geometric priors to guide the transformer to focus on discriminative regions and unfolds kernel features to generate BEV representation. For fast inference, we further introduce a look-up table (LUT) indexing method to get rid of the camera's calibrated parameters at runtime. GKT can run at $72.3$ FPS on 3090 GPU / $45.6$ FPS on 2080ti GPU and is robust to the camera deviation and the predefined BEV height. And GKT achieves the state-of-the-art real-time segmentation results, i.e., 38.0 mIoU (100m$\times$100m perception range at a 0.5m resolution) on the nuScenes val set. Given the efficiency, effectiveness, and robustness, GKT has great practical values in autopilot scenarios, especially for real-time running systems. Code and models will be available at \url{https://github.com/hustvl/GKT}.

preprint2022arXiv

ELMformer: Efficient Raw Image Restoration with a Locally Multiplicative Transformer

In order to get raw images of high quality for downstream Image Signal Process (ISP), in this paper we present an Efficient Locally Multiplicative Transformer called ELMformer for raw image restoration. ELMformer contains two core designs especially for raw images whose primitive attribute is single-channel. The first design is a Bi-directional Fusion Projection (BFP) module, where we consider both the color characteristics of raw images and spatial structure of single-channel. The second one is that we propose a Locally Multiplicative Self-Attention (L-MSA) scheme to effectively deliver information from the local space to relevant parts. ELMformer can efficiently reduce the computational consumption and perform well on raw image restoration tasks. Enhanced by these two core designs, ELMformer achieves the highest performance and keeps the lowest FLOPs on raw denoising and raw deblurring benchmarks compared with state-of-the-arts. Extensive experiments demonstrate the superiority and generalization ability of ELMformer. On SIDD benchmark, our method has even better denoising performance than ISP-based methods which need huge amount of additional sRGB training images. The codes are release at https://github.com/leonmakise/ELMformer.

preprint2022arXiv

Featurized Query R-CNN

The query mechanism introduced in the DETR method is changing the paradigm of object detection and recently there are many query-based methods have obtained strong object detection performance. However, the current query-based detection pipelines suffer from the following two issues. Firstly, multi-stage decoders are required to optimize the randomly initialized object queries, incurring a large computation burden. Secondly, the queries are fixed after training, leading to unsatisfying generalization capability. To remedy the above issues, we present featurized object queries predicted by a query generation network in the well-established Faster R-CNN framework and develop a Featurized Query R-CNN. Extensive experiments on the COCO dataset show that our Featurized Query R-CNN obtains the best speed-accuracy trade-off among all R-CNN detectors, including the recent state-of-the-art Sparse R-CNN detector. The code is available at {https://github.com/hustvl/Featurized-QueryRCNN.

preprint2022arXiv

Forgery Attack Detection in Surveillance Video Streams Using Wi-Fi Channel State Information

The cybersecurity breaches expose surveillance video streams to forgery attacks, under which authentic streams are falsified to hide unauthorized activities. Traditional video forensics approaches can localize forgery traces using spatial-temporal analysis on relatively long video clips, while falling short in real-time forgery detection. The recent work correlates time-series camera and wireless signals to detect looped videos but cannot realize fine-grained forgery localization. To overcome these limitations, we propose Secure-Pose, which exploits the pervasive coexistence of surveillance and Wi-Fi infrastructures to defend against video forgery attacks in a real-time and fine-grained manner. We observe that coexisting camera and Wi-Fi signals convey common human semantic information and forgery attacks on video streams will decouple such information correspondence. Particularly, retrievable human pose features are first extracted from concurrent video and Wi-Fi channel state information (CSI) streams. Then, a lightweight detection network is developed to accurately discover forgery attacks and an efficient localization algorithm is devised to seamlessly track forgery traces in video streams. We implement Secure-Pose using one Logitech camera and two Intel 5300 NICs and evaluate it in different environments. Secure-Pose achieves a high detection accuracy of 98.7% and localizes abnormal objects under playback and tampering attacks.

preprint2022arXiv

ICAF: Iterative Contrastive Alignment Framework for Multimodal Abstractive Summarization

Integrating multimodal knowledge for abstractive summarization task is a work-in-progress research area, with present techniques inheriting fusion-then-generation paradigm. Due to semantic gaps between computer vision and natural language processing, current methods often treat multiple data points as separate objects and rely on attention mechanisms to search for connection in order to fuse together. In addition, missing awareness of cross-modal matching from many frameworks leads to performance reduction. To solve these two drawbacks, we propose an Iterative Contrastive Alignment Framework (ICAF) that uses recurrent alignment and contrast to capture the coherences between images and texts. Specifically, we design a recurrent alignment (RA) layer to gradually investigate fine-grained semantical relationships between image patches and text tokens. At each step during the encoding process, cross-modal contrastive losses are applied to directly optimize the embedding space. According to ROUGE, relevance scores, and human evaluation, our model outperforms the state-of-the-art baselines on MSMO dataset. Experiments on the applicability of our proposed framework and hyperparameters settings have been also conducted.

preprint2022arXiv

Illumination-Invariant Active Camera Relocalization for Fine-Grained Change Detection in the Wild

Active camera relocalization (ACR) is a new problem in computer vision that significantly reduces the false alarm caused by image distortions due to camera pose misalignment in fine-grained change detection (FGCD). Despite the fruitful achievements that ACR can support, it still remains a challenging problem caused by the unstable results of relative pose estimation, especially for outdoor scenes, where the lighting condition is out of control, i.e., the twice observations may have highly varied illuminations. This paper studies an illumination-invariant active camera relocalization method, it improves both in relative pose estimation and scale estimation. We use plane segments as an intermediate representation to facilitate feature matching, thus further boosting pose estimation robustness and reliability under lighting variances. Moreover, we construct a linear system to obtain the absolute scale in each ACR iteration by minimizing the image warping error, thus, significantly reduce the time consume of ACR process, it is nearly $1.6$ times faster than the state-of-the-art ACR strategy. Our work greatly expands the feasibility of real-world fine-grained change monitoring tasks for cultural heritages. Extensive experiments tests and real-world applications verify the effectiveness and robustness of the proposed pose estimation method using for ACR tasks.

preprint2022arXiv

Learning Dynamic View Synthesis With Few RGBD Cameras

There have been significant advancements in dynamic novel view synthesis in recent years. However, current deep learning models often require (1) prior models (e.g., SMPL human models), (2) heavy pre-processing, or (3) per-scene optimization. We propose to utilize RGBD cameras to remove these limitations and synthesize free-viewpoint videos of dynamic indoor scenes. We generate feature point clouds from RGBD frames and then render them into free-viewpoint videos via a neural renderer. However, the inaccurate, unstable, and incomplete depth measurements induce severe distortions, flickering, and ghosting artifacts. We enforce spatial-temporal consistency via the proposed Cycle Reconstruction Consistency and Temporal Stabilization module to reduce these artifacts. We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views. Additionally, we present a Human-Things Interactions dataset to validate our approach and facilitate future research. The dataset consists of 43 multi-view RGBD video sequences of everyday activities, capturing complex interactions between human subjects and their surroundings. Experiments on the HTI dataset show that our method outperforms the baseline per-frame image fidelity and spatial-temporal consistency. We will release our code, and the dataset on the website soon.

preprint2022arXiv

Learning Quality-aware Representation for Multi-person Pose Regression

Off-the-shelf single-stage multi-person pose regression methods generally leverage the instance score (i.e., confidence of the instance localization) to indicate the pose quality for selecting the pose candidates. We consider that there are two gaps involved in existing paradigm:~1) The instance score is not well interrelated with the pose regression quality.~2) The instance feature representation, which is used for predicting the instance score, does not explicitly encode the structural pose information to predict the reasonable score that represents pose regression quality. To address the aforementioned issues, we propose to learn the pose regression quality-aware representation. Concretely, for the first gap, instead of using the previous instance confidence label (e.g., discrete {1,0} or Gaussian representation) to denote the position and confidence for person instance, we firstly introduce the Consistent Instance Representation (CIR) that unifies the pose regression quality score of instance and the confidence of background into a pixel-wise score map to calibrates the inconsistency between instance score and pose regression quality. To fill the second gap, we further present the Query Encoding Module (QEM) including the Keypoint Query Encoding (KQE) to encode the positional and semantic information for each keypoint and the Pose Query Encoding (PQE) which explicitly encodes the predicted structural pose information to better fit the Consistent Instance Representation (CIR). By using the proposed components, we significantly alleviate the above gaps. Our method outperforms previous single-stage regression-based even bottom-up methods and achieves the state-of-the-art result of 71.7 AP on MS COCO test-dev set.

preprint2022arXiv

MixSKD: Self-Knowledge Distillation from Mixup for Image Recognition

Unlike the conventional Knowledge Distillation (KD), Self-KD allows a network to learn knowledge from itself without any guidance from extra networks. This paper proposes to perform Self-KD from image Mixture (MixSKD), which integrates these two techniques into a unified framework. MixSKD mutually distills feature maps and probability distributions between the random pair of original images and their mixup images in a meaningful way. Therefore, it guides the network to learn cross-image knowledge by modelling supervisory signals from mixup images. Moreover, we construct a self-teacher network by aggregating multi-stage feature maps for providing soft labels to supervise the backbone classifier, further improving the efficacy of self-boosting. Experiments on image classification and transfer learning to object detection and semantic segmentation demonstrate that MixSKD outperforms other state-of-the-art Self-KD and data augmentation methods. The code is available at https://github.com/winycg/Self-KD-Lib.

preprint2022arXiv

Modeling Complex Dependencies for Session-based Recommendations via Graph Neural Networks

Session-based recommendations (SBRs) capture items' dependencies from the sessions to recommend the next item. In recent years, Graph neural networks (GNN) based SBRs have become the mainstream of SBRs benefited from the superiority of GNN in modeling complex dependencies. Based on a strong assumption of adjacent dependency, any two adjacent items in a session are necessarily dependent in most GNN-based SBRs. However, we argue that due to the uncertainty and complexity of user behaviors, adjacency does not necessarily indicate dependency. However, the above assumptions do not always hold in actual recommendation scenarios, so it can easily lead to two drawbacks: (1) false dependencies occur in the session because there are adjacent but not really dependent items, and (2) the missing of true dependencies occur in the session because there are non-adjacent but actually dependent items. These drawbacks significantly affect item representation learning, degrading the downstream recommendation performance. To address these deficiencies, we propose a novel review-refined inter-item graph neural network (RI-GNN), which utilizes topic information extracted from the reviews of items to improve dependencies between items. Experiments on two public real-world datasets demonstrate that RI-GNN outperforms SOTA methods.

preprint2022arXiv

Multi-phase image segmentation by the Allen--Cahn Chan--Vese model

This paper proposes an Allen-Cahn Chan-Vese model to settle the multi-phase image segmentation. We first integrate the Allen--Cahn term and the Chan--Vese fitting energy term to establish an energy functional, whose minimum locates the segmentation contour. The subsequent minimization process can be attributed to variational calculation on fitting intensities and the solution approximation of several Allen-Cahn equations, wherein $n$ Allen-Cahn equations are enough to partition $m = 2^n$ segments. The derived Allen-Cahn equations are solved by efficient numerical solvers with exponential time integrations and finite difference space discretization. The discrete maximum bound principle and energy stability of the proposed numerical schemes are proved. Finally, the capability of our segmentation method is verified in various experiments for different types of images.

preprint2022arXiv

Partially discontinuous nodal finite elements for $H(\mathrm{curl})$ and $H(\mathrm{div})$

We investigate discretization of $H(\mathrm{curl})$ and $H(\mathrm{div})$ in two and three space dimensions by partially discontinuous nodal finite elements, i.e., vector-valued Lagrange finite elements with discontinuity in certain directions. These spaces can be implemented as a combination of continuous and discontinuous Lagrange elements and fit in de~Rham complexes. We construct well-conditioned nodal bases.

preprint2022arXiv

RACS2: A Framework of Remote Autonomous Control System for Telescope Observation and its application

As the demand of astronomical observation rising, the telescope systems are becoming more and more complex. Thus, the observatory control software needs to be more intelligent, they have to control each instrument inside the observatory, finish the observation tasks autonomously, and report the information to users if needed. We developed a distributed autonomous observatory control framework named Remote Autonomous Control System 2nd, RACS2 to meet these requirements. The RACS2 framework uses decentralized distributed architecture, instrument control software and system service such as observation control service are implemented as different components. The communication between components is implemented based on a high-performance serialization library and a light-weighted messaging library.The interfaces towards python and Experimental Physics and Industrial Control System (EPICS) are implemented, so the RACS2 framework can communicate with EPICS based device control software and python-based software. Several system components including log, executor, scheduler and other modules are developed to help observation. Observation tasks can be programmed with python language, and the plans are scheduled by the scheduler component to achieve autonomous observation.A set of web service is implemented based on the FastAPI framework, with which user can control and manage the framework remotely.Based on the RACS2 framework, we have implemented the DATs telescope's observation system and the space object observation system.We performed remote autonomous observation and received many data with these systems.

preprint2022arXiv

Sparse Instance Activation for Real-Time Instance Segmentation

In this paper, we propose a conceptually novel, efficient, and fully convolutional framework for real-time instance segmentation. Previously, most instance segmentation methods heavily rely on object detection and perform mask prediction based on bounding boxes or dense centers. In contrast, we propose a sparse set of instance activation maps, as a new object representation, to highlight informative regions for each foreground object. Then instance-level features are obtained by aggregating features according to the highlighted regions for recognition and segmentation. Moreover, based on bipartite matching, the instance activation maps can predict objects in a one-to-one style, thus avoiding non-maximum suppression (NMS) in post-processing. Owing to the simple yet effective designs with instance activation maps, SparseInst has extremely fast inference speed and achieves 40 FPS and 37.9 AP on the COCO benchmark, which significantly outperforms the counterparts in terms of speed and accuracy. Code and models are available at https://github.com/hustvl/SparseInst.

preprint2022arXiv

Vision-based Uneven BEV Representation Learning with Polar Rasterization and Surface Estimation

In this work, we propose PolarBEV for vision-based uneven BEV representation learning. To adapt to the foreshortening effect of camera imaging, we rasterize the BEV space both angularly and radially, and introduce polar embedding decomposition to model the associations among polar grids. Polar grids are rearranged to an array-like regular representation for efficient processing. Besides, to determine the 2D-to-3D correspondence, we iteratively update the BEV surface based on a hypothetical plane, and adopt height-based feature transformation. PolarBEV keeps real-time inference speed on a single 2080Ti GPU, and outperforms other methods for both BEV semantic segmentation and BEV instance segmentation. Thorough ablations are presented to validate the design. The code will be released at \url{https://github.com/SuperZ-Liu/PolarBEV}.

preprint2022arXiv

Weakly-supervised 3D Human Pose Estimation with Cross-view U-shaped Graph Convolutional Network

Although monocular 3D human pose estimation methods have made significant progress, it is far from being solved due to the inherent depth ambiguity. Instead, exploiting multi-view information is a practical way to achieve absolute 3D human pose estimation. In this paper, we propose a simple yet effective pipeline for weakly-supervised cross-view 3D human pose estimation. By only using two camera views, our method can achieve state-of-the-art performance in a weakly-supervised manner, requiring no 3D ground truth but only 2D annotations. Specifically, our method contains two steps: triangulation and refinement. First, given the 2D keypoints that can be obtained through any classic 2D detection methods, triangulation is performed across two views to lift the 2D keypoints into coarse 3D poses. Then, a novel cross-view U-shaped graph convolutional network (CV-UGCN), which can explore the spatial configurations and cross-view correlations, is designed to refine the coarse 3D poses. In particular, the refinement progress is achieved through weakly-supervised learning, in which geometric and structure-aware consistency checks are performed. We evaluate our method on the standard benchmark dataset, Human3.6M. The Mean Per Joint Position Error on the benchmark dataset is 27.4 mm, which outperforms existing state-of-the-art methods remarkably (27.4 mm vs 30.2 mm).

preprint2021arXiv

Causality-Aware Neighborhood Methods for Recommender Systems

The business objectives of recommenders, such as increasing sales, are aligned with the causal effect of recommendations. Previous recommenders targeting for the causal effect employ the inverse propensity scoring (IPS) in causal inference. However, IPS is prone to suffer from high variance. The matching estimator is another representative method in causal inference field. It does not use propensity and hence free from the above variance problem. In this work, we unify traditional neighborhood recommendation methods with the matching estimator, and develop robust ranking methods for the causal effect of recommendations. Our experiments demonstrate that the proposed methods outperform various baselines in ranking metrics for the causal effect. The results suggest that the proposed methods can achieve more sales and user engagement than previous recommenders.

preprint2021arXiv

Efficient Fuzz Testing for Apache Spark Using Framework Abstraction

The emerging data-intensive applications are increasingly dependent on data-intensive scalable computing (DISC) systems, such as Apache Spark, to process large data. Despite their popularity, DISC applications are hard to test. In recent years, fuzz testing has been remarkably successful; however, it is nontrivial to apply such traditional fuzzing to big data analytics directly because: (1) the long latency of DISC systems prohibits the applicability of fuzzing, and (2) conventional branch coverage is unlikely to identify application logic from the DISC framework implementation. We devise a novel fuzz testing tool called BigFuzz that automatically generates concrete data for an input Apache Spark program. The key essence of our approach is that we abstract the dataflow behavior of the DISC framework with executable specifications and we design schema-aware mutations based on common error types in DISC applications. Our experiments show that compared to random fuzzing, BigFuzz is able to speed up the fuzzing time by 1477X, improves application code coverage by 271%, and achieves 157% improvement in detecting application errors. The demonstration video of BigFuzz is available at https://www.youtube.com/watch?v=YvYQISILQHs&feature=youtu.be.

preprint2021arXiv

Forest R-CNN: Large-Vocabulary Long-Tailed Object Detection and Instance Segmentation

Despite the previous success of object analysis, detecting and segmenting a large number of object categories with a long-tailed data distribution remains a challenging problem and is less investigated. For a large-vocabulary classifier, the chance of obtaining noisy logits is much higher, which can easily lead to a wrong recognition. In this paper, we exploit prior knowledge of the relations among object categories to cluster fine-grained classes into coarser parent classes, and construct a classification tree that is responsible for parsing an object instance into a fine-grained category via its parent class. In the classification tree, as the number of parent class nodes are significantly less, their logits are less noisy and can be utilized to suppress the wrong/noisy logits existed in the fine-grained class nodes. As the way to construct the parent class is not unique, we further build multiple trees to form a classification forest where each tree contributes its vote to the fine-grained classification. To alleviate the imbalanced learning caused by the long-tail phenomena, we propose a simple yet effective resampling method, NMS Resampling, to re-balance the data distribution. Our method, termed as Forest R-CNN, can serve as a plug-and-play module being applied to most object recognition models for recognizing more than 1000 categories. Extensive experiments are performed on the large vocabulary dataset LVIS. Compared with the Mask R-CNN baseline, the Forest R-CNN significantly boosts the performance with 11.5% and 3.9% AP improvements on the rare categories and overall categories, respectively. Moreover, we achieve state-of-the-art results on the LVIS dataset. Code is available at https://github.com/JialianW/Forest_RCNN.

preprint2021arXiv

High Fidelity Face Manipulation with Extreme Poses and Expressions

Face manipulation has shown remarkable advances with the flourish of Generative Adversarial Networks. However, due to the difficulties of controlling structures and textures, it is challenging to model poses and expressions simultaneously, especially for the extreme manipulation at high-resolution. In this paper, we propose a novel framework that simplifies face manipulation into two correlated stages: a boundary prediction stage and a disentangled face synthesis stage. The first stage models poses and expressions jointly via boundary images. Specifically, a conditional encoder-decoder network is employed to predict the boundary image of the target face in a semi-supervised way. Pose and expression estimators are introduced to improve the prediction performance. In the second stage, the predicted boundary image and the input face image are encoded into the structure and the texture latent space by two encoder networks, respectively. A proxy network and a feature threshold loss are further imposed to disentangle the latent space. Furthermore, due to the lack of high-resolution face manipulation databases to verify the effectiveness of our method, we collect a new high-quality Multi-View Face (MVF-HQ) database. It contains 120,283 images at 6000x4000 resolution from 479 identities with diverse poses, expressions, and illuminations. MVF-HQ is much larger in scale and much higher in resolution than publicly available high-resolution face manipulation databases. We will release MVF-HQ soon to push forward the advance of face manipulation. Qualitative and quantitative experiments on four databases show that our method dramatically improves the synthesis quality.

preprint2021arXiv

Pressure-Driven Magneto-Topological Phase Transition in a magnetic Weyl semimetal

The co-occurrence of phase transitions with local and global order parameters, such as the entangled magnetization and topological invariant, is attractive but has been seldom realized experimentally. Here, by using high-pressure in-situ X-ray diffraction, high-pressure electric transport measurements and high-pressure first-principles calculations, we report a magneto-topological phase transition, i.e., the phenomenon of magnetic materials undergoing different magnetic and topological phases during the process of pressure loading, in a recently discovered magnetic Weyl semimetal Co3Sn2S2. By considering both out-of-plane ferromagnetic and in-plane anti-ferromagnetic components, the calculated results can well fit the experimental data. The calculation results furtherly reveal a pristine Weyl phase with four more pairs of Weyl nodes under low pressures, and a generally-defined Z2 topological insulator phase after the restoration of time-reversal symmetry. Remarkably, the present magneto-topological phase transition involves a pair of crossing bands of two spin channels becoming degenerate. Thus, all the chiral Weyl nodes annihilate with their counterparts from another spin channel, in contrast to the typical annihilation of Weyl pairs from the same bands in inversion-asymmetric systems. Our experiments and theoretical calculations uncover a manner to modulate the diverse topological states by controlling the internal exchange splitting via external physical knobs in topological magnets.

preprint2021arXiv

Record high $T_{\rm c}$ and robust superconductivity in transition metal $δ$-Ti phase at megabar pressure

We report a record high superconducting transition temperature ($T_{\rm c}$) up to 23.6 K under high pressure in the elemental metal Ti, one of the top ten most abundant elements in Earth's crust. The $T_{\rm c}$ increases monotonically from 2.3 K at 40.3 GPa to 23.6 K at 144.9 GPa, which surpasses all known records from elemental metals reported so far. With further compression, a robust $T_{\rm c}$ of ~23 K is observed between 144.9 and 183 GPa in the $δ$-Ti phase. The pressure-dependent $T_{\rm c}$ can be well described by the conventional electron-phonon coupling (EPC) mechanism. Density Functional Theory calculations show the Fermi nesting and the phonon softening of optical branches at the $γ$-Ti to $δ$-Ti phase transition pressure enhance EPC, which results in the record high $T_{\rm c}$. We attribute the robust superconductivity in $δ$-Ti to the apparent robustness of its strong EPC against lattice compression. These results provide new insight into exploring new high-$T_{\rm c}$ elemental metals and Ti-based superconducting alloys.

preprint2021arXiv

Rethinking Soft Labels for Knowledge Distillation: A Bias-Variance Tradeoff Perspective

Knowledge distillation is an effective approach to leverage a well-trained network or an ensemble of them, named as the teacher, to guide the training of a student network. The outputs from the teacher network are used as soft labels for supervising the training of a new network. Recent studies \citep{muller2019does,yuan2020revisiting} revealed an intriguing property of the soft labels that making labels soft serves as a good regularization to the student network. From the perspective of statistical learning, regularization aims to reduce the variance, however how bias and variance change is not clear for training with soft labels. In this paper, we investigate the bias-variance tradeoff brought by distillation with soft labels. Specifically, we observe that during training the bias-variance tradeoff varies sample-wisely. Further, under the same distillation temperature setting, we observe that the distillation performance is negatively associated with the number of some specific samples, which are named as regularization samples since these samples lead to bias increasing and variance decreasing. Nevertheless, we empirically find that completely filtering out regularization samples also deteriorates distillation performance. Our discoveries inspired us to propose the novel weighted soft labels to help the network adaptively handle the sample-wise bias-variance tradeoff. Experiments on standard evaluation benchmarks validate the effectiveness of our method. Our code is available at \url{https://github.com/bellymonster/Weighted-Soft-Label-Distillation}.

preprint2021arXiv

Towards Cross-Modal Forgery Detection and Localization on Live Surveillance Videos

The cybersecurity breaches render surveillance systems vulnerable to video forgery attacks, under which authentic live video streams are tampered to conceal illegal human activities under surveillance cameras. Traditional video forensics approaches can detect and localize forgery traces in each video frame using computationally-expensive spatial-temporal analysis, while falling short in real-time verification of live video feeds. The recent work correlates time-series camera and wireless signals to recognize replayed surveillance videos using event-level timing information but it cannot realize fine-grained forgery detection and localization on each frame. To fill this gap, this paper proposes Secure-Pose, a novel cross-modal forgery detection and localization system for live surveillance videos using WiFi signals near the camera spot. We observe that coexisting camera and WiFi signals convey common human semantic information and the presence of forgery attacks on video frames will decouple such information correspondence. Secure-Pose extracts effective human pose features from synchronized multi-modal signals and detects and localizes forgery traces under both inter-frame and intra-frame attacks in each frame. We implement Secure-Pose using a commercial camera and two Intel 5300 NICs and evaluate it in real-world environments. Secure-Pose achieves a high detection accuracy of 95.1% and can effectively localize tampered objects under different forgery attacks.

preprint2020arXiv

A priori and a posteriori error estimates for the quad-curl eigenvalue problem

In this paper, we propose a new family of H(curl^2)-conforming elements for the quad-curl eigenvalue problem in 2D. The accuracy of this family is one order higher than that in [32]. We prove a priori and a posteriori error estimates. The a priori estimate of the eigenvalue with a convergence order 2(s-1) is obtained if the eigenvector u\in H^{s+1}(Ω). For the a posteriori estimate, by analyzing the associated source problem, we obtain lower and upper bounds for the eigenvector in an energy norm and an upper bound for the eigenvalues. Numerical examples are presented for validation.

preprint2020arXiv

Active Lighting Recurrence by Parallel Lighting Analogy for Fine-Grained Change Detection

This paper studies a new problem, namely active lighting recurrence (ALR) that physically relocalizes a light source to reproduce the lighting condition from single reference image for a same scene, which may suffer from fine-grained changes during twice observations. ALR is of great importance for fine-grained visual inspection and change detection, because some phenomena or minute changes can only be clearly observed under particular lighting conditions. Therefore, effective ALR should be able to online navigate a light source toward the target pose, which is challenging due to the complexity and diversity of real-world lighting and imaging processes. To this end, we propose to use the simple parallel lighting as an analogy model and based on Lambertian law to compose an instant navigation ball for this purpose. We theoretically prove the feasibility, i.e., equivalence and convergence, of this ALR approach for realistic near point light source and small near surface light source. Besides, we also theoretically prove the invariance of our ALR approach to the ambiguity of normal and lighting decomposition. The effectiveness and superiority of the proposed approach have been verified by both extensive quantitative experiments and challenging real-world tasks on fine-grained change detection of cultural heritages. We also validate the generality of our approach to non-Lambertian scenes.

preprint2020arXiv

Authenticating On-Body IoT Devices: An Adversarial Learning Approach

By adding users as a new dimension to connectivity, on-body Internet-of-Things (IoT) devices have gained considerable momentum in recent years, while raising serious privacy and safety issues. Existing approaches to authenticate these devices limit themselves to dedicated sensors or specified user motions, undermining their widespread acceptance. This paper overcomes these limitations with a general authentication solution by integrating wireless physical layer (PHY) signatures with upper-layer protocols. The key enabling techniques are constructing representative radio propagation profiles from received signals, and developing an adversarial multi-player neural network to accurately recognize underlying radio propagation patterns and facilitate on-body device authentication. Once hearing a suspicious transmission, our system triggers a PHY-based challenge-response protocol to defend in depth against active attacks. We prove that at equilibrium, our adversarial model can extract all information about propagation patterns and eliminate any irrelevant information caused by motion variances and environment changes. We build a prototype of our system using Universal Software Radio Peripheral (USRP) devices and conduct extensive experiments with various static and dynamic body motions in typical indoor and outdoor environments. The experimental results show that our system achieves an average authentication accuracy of 91.6%, with a high area under the receiver operating characteristic curve (AUROC) of 0.96 and a better generalization performance compared with the conventional non-adversarial approach.

preprint2020arXiv

AutoPose: Searching Multi-Scale Branch Aggregation for Pose Estimation

We present AutoPose, a novel neural architecture search(NAS) framework that is capable of automatically discovering multiple parallel branches of cross-scale connections towards accurate and high-resolution 2D human pose estimation. Recently, high-performance hand-crafted convolutional networks for pose estimation show growing demands on multi-scale fusion and high-resolution representations. However, current NAS works exhibit limited flexibility on scale searching, they dominantly adopt simplified search spaces of single-branch architectures. Such simplification limits the fusion of information at different scales and fails to maintain high-resolution representations. The presentedAutoPose framework is able to search for multi-branch scales and network depth, in addition to the cell-level microstructure. Motivated by the search space, a novel bi-level optimization method is presented, where the network-level architecture is searched via reinforcement learning, and the cell-level search is conducted by the gradient-based method. Within 2.5 GPU days, AutoPose is able to find very competitive architectures on the MS COCO dataset, that are also transferable to the MPII dataset. Our code is available at https://github.com/VITA-Group/AutoPose.

preprint2020arXiv

Boundary Guidance Hierarchical Network for Real-Time Tongue Segmentation

Automated tongue image segmentation in tongue images is a challenging task for two reasons: 1) there are many pathological details on the tongue surface, which affect the extraction of the boundary; 2) the shapes of the tongues captured from various persons (with different diseases) are quite different. To deal with the challenge, a novel end-to-end Boundary Guidance Hierarchical Network (BGHNet) with a new hybrid loss is proposed in this paper. In the new approach, firstly Context Feature Encoder Module (CFEM) is built upon the bottomup pathway to confront with the shrinkage of the receptive field. Secondly, a novel hierarchical recurrent feature fusion module (HRFFM) is adopt to progressively and hierarchically refine object maps to recover image details by integrating local context information. Finally, the proposed hybrid loss in a four hierarchy-pixel, patch, map and boundary guides the network to effectively segment the tongue regions and accurate tongue boundaries. BGHNet is applied to a set of tongue images. The experimental results suggest that the proposed approach can achieve the latest tongue segmentation performance. And in the meantime, the lightweight network contains only 15.45M parameters and performs only 11.22GFLOPS.

preprint2020arXiv

Cardiovascular risk and work stress in biomedical researchers in China: An observational, big data study protocol

Introduction: Internet technologies could strengthen data collection and integration and have been used extensively in public health research. It is necessary to apply this technology to further investigate the behaviour and health of biomedical researchers. A browser-based extension was developed by researchers and clinicians to promote the collection and analysis of researchers' behavioural and psychological data. This protocol illustrates an observational study aimed at (1) characterising the health status of biomedical researchers in China and assessing work stress, job satisfaction, role conflict, role ambiguity, and family support; (2) identifying the association between work, behaviour, and health; and (3) investigating the association between behaviour and mental status. Our findings will contribute to the understanding of the influences of job, work environment, and family support on the mental and physical health of biomedical researchers. Methods and analysis: This is a prospective observational study; all candidates will be recruited from China. Participants will install an extension on their Internet browsers, which will collect data when they are accessing PubMed. A web-based survey will be sent to the user interfaces every 6 months that will involve sociodemographic variables, perceived stress scale, job satisfaction scale, role conflict and ambiguity scale, and family support scale. Machine-learning algorithms will analyse the data generated during daily access. Ethics and dissemination: This study received ethical approval from the ethics committee of the Shanghai Children's Medical Centre (reference number SCMCIRB-K2018082). Study results will be disseminated through peer-reviewed publications and conference presentations.

preprint2020arXiv

Densely Connected Search Space for More Flexible Neural Architecture Search

Neural architecture search (NAS) has dramatically advanced the development of neural network design. We revisit the search space design in most previous NAS methods and find the number and widths of blocks are set manually. However, block counts and block widths determine the network scale (depth and width) and make a great influence on both the accuracy and the model cost (FLOPs/latency). In this paper, we propose to search block counts and block widths by designing a densely connected search space, i.e., DenseNAS. The new search space is represented as a dense super network, which is built upon our designed routing blocks. In the super network, routing blocks are densely connected and we search for the best path between them to derive the final architecture. We further propose a chained cost estimation algorithm to approximate the model cost during the search. Both the accuracy and model cost are optimized in DenseNAS. For experiments on the MobileNetV2-based search space, DenseNAS achieves 75.3% top-1 accuracy on ImageNet with only 361MB FLOPs and 17.9ms latency on a single TITAN-XP. The larger model searched by DenseNAS achieves 76.1% accuracy with only 479M FLOPs. DenseNAS further promotes the ImageNet classification accuracies of ResNet-18, -34 and -50-B by 1.5%, 0.5% and 0.3% with 200M, 600M and 680M FLOPs reduction respectively. The related code is available at https://github.com/JaminFong/DenseNAS.

preprint2020arXiv

Enabling Low-Power OFDM for IoT by Exploiting Asymmetric Clock Rates

The conventional high-speed Wi-Fi has recently become a contender for low-power Internet-of-Things (IoT) communications. OFDM continues its adoption in the new IoT Wi-Fi standard due to its spectrum efficiency that can support the demand of massive IoT connectivity. While the IoT Wi-Fi standard offers many new features to improve power and spectrum efficiency, the basic physical layer (PHY) structure of transceiver design still conforms to its conventional design rationale where access points (AP) and clients employ the same OFDM PHY. In this paper, we argue that current Wi-Fi PHY design does not take full advantage of the inherent asymmetry between AP and IoT. To fill the gap, we propose an asymmetric design where IoT devices transmit uplink packets using the lowest power while pushing all the decoding burdens to the AP side. Such a design utilizes the sufficient power and computational resources at AP to trade for the transmission (TX) power of IoT devices. The core technique enabling this asymmetric design is that the AP takes full power of its high clock rate to boost the decoding ability. We provide an implementation of our design and show that it can reduce up to 88% of the IoT's TX power when the AP sets $8\times$ clock rate.

preprint2020arXiv

Fast Neural Network Adaptation via Parameter Remapping and Architecture Search

Deep neural networks achieve remarkable performance in many computer vision tasks. Most state-of-the-art (SOTA) semantic segmentation and object detection approaches reuse neural network architectures designed for image classification as the backbone, commonly pre-trained on ImageNet. However, performance gains can be achieved by designing network architectures specifically for detection and segmentation, as shown by recent neural architecture search (NAS) research for detection and segmentation. One major challenge though, is that ImageNet pre-training of the search space representation (a.k.a. super network) or the searched networks incurs huge computational cost. In this paper, we propose a Fast Neural Network Adaptation (FNA) method, which can adapt both the architecture and parameters of a seed network (e.g. a high performing manually designed backbone) to become a network with different depth, width, or kernels via a Parameter Remapping technique, making it possible to utilize NAS for detection/segmentation tasks a lot more efficiently. In our experiments, we conduct FNA on MobileNetV2 to obtain new networks for both segmentation and detection that clearly out-perform existing networks designed both manually and by NAS. The total computation cost of FNA is significantly less than SOTA segmentation/detection NAS approaches: 1737$\times$ less than DPC, 6.8$\times$ less than Auto-DeepLab and 7.4$\times$ less than DetNAS. The code is available at https://github.com/JaminFong/FNA.

preprint2020arXiv

FasterSeg: Searching for Faster Real-time Semantic Segmentation

We present FasterSeg, an automatically designed semantic segmentation network with not only state-of-the-art performance but also faster speed than current methods. Utilizing neural architecture search (NAS), FasterSeg is discovered from a novel and broader search space integrating multi-resolution branches, that has been recently found to be vital in manually designed segmentation models. To better calibrate the balance between the goals of high accuracy and low latency, we propose a decoupled and fine-grained latency regularization, that effectively overcomes our observed phenomenons that the searched networks are prone to "collapsing" to low-latency yet poor-accuracy models. Moreover, we seamlessly extend FasterSeg to a new collaborative search (co-searching) framework, simultaneously searching for a teacher and a student network in the same single run. The teacher-student distillation further boosts the student model's accuracy. Experiments on popular segmentation benchmarks demonstrate the competency of FasterSeg. For example, FasterSeg can run over 30% faster than the closest manually designed competitor on Cityscapes, while maintaining comparable accuracy.

preprint2020arXiv

Half-Heusler thermoelectric materials: NMR studies

We report $^{59}$Co, $^{93}$Nb, and $^{121}$Sb nuclear magnetic resonance (NMR) measurements combined with density functional theory (DFT) calculations on a series of half-Heusler semiconductors, including NbCoSn, ZrCoSb, TaFeSb and NbFeSb, to better understand their electronic properties and general composition-dependent trends. These materials are of interest as potentially high efficiency thermoelectric materials. Compared to the other materials, we find that ZrCoSb tends to have a relatively large amount of local disorder, apparently antisite defects. This contributes to a small excitation gap corresponding to an impurity band near the band edge. In NbCoSn and TaFeSb, Curie-Weiss-type behavior is revealed, which indicates a small density of interacting paramagnetic defects. Very large paramagnetic chemical shifts are observed associated with a Van Vleck mechanism due to closely spaced $d$ bands splitting between the conduction and valence bands. Meanwhile, DFT methods were generally successful in reproducing the chemical shift trend for these half-Heusler materials, and we identify an enhancement of the larger-magnitude shifts, which we connect to electron interaction effects. The general trend is connected to changes in $d$-electron hybridization across the series.

preprint2020arXiv

Learning Where to Focus for Efficient Video Object Detection

Transferring existing image-based detectors to the video is non-trivial since the quality of frames is always deteriorated by part occlusion, rare pose, and motion blur. Previous approaches exploit to propagate and aggregate features across video frames by using optical flow-warping. However, directly applying image-level optical flow onto the high-level features might not establish accurate spatial correspondences. Therefore, a novel module called Learnable Spatio-Temporal Sampling (LSTS) has been proposed to learn semantic-level correspondences among adjacent frame features accurately. The sampled locations are first randomly initialized, then updated iteratively to find better spatial correspondences guided by detection supervision progressively. Besides, Sparsely Recursive Feature Updating (SRFU) module and Dense Feature Aggregation (DFA) module are also introduced to model temporal relations and enhance per-frame features, respectively. Without bells and whistles, the proposed method achieves state-of-the-art performance on the ImageNet VID dataset with less computational complexity and real-time speed. Code will be made available at https://github.com/jiangzhengkai/LSTS.

preprint2020arXiv

Role of rotational coherence in femtosecond-pulse-driven nitrogen ion lasing

We experimentally investigated the rotationally resolved polarization characteristics of N$_2^+$ lasing at 391 and 428 nm using a pump-seed scheme. By varying the relative angle between the linear polarizations of the pump and seed, it is found that the polarizations of the P and R branches of 391-nm lasing are counter-rotated. By contrast, both branches of 428-nm lasing remain polarized along the pump. The origin of the puzzled abnormal polarization characteristics is found based on a complete physical model that simultaneously includes the transient photoionization and the subsequent coupling among the electronic, vibrational and rotational quantum states of ions.It suggests that the cascaded resonant Raman processes following ionization create negative coherence between the rotational states of $J$ and $J$+2 in the ionic ground state X$^2Σ_g^+(ν=0)$, which leads to mirror-symmetrical polarization for the P and R branches of 391-nm lasing. Both the experiment and theory indicate that the demonstrated rotational coherence plays an extremely pivotal role in clarifying the gain mechanism of N$_2^+$ lasing and opens up the route toward quantum optics under ultrafast strong fields.

preprint2020arXiv

SiamParseNet: Joint Body Parsing and Label Propagation in Infant Movement Videos

General movement assessment (GMA) of infant movement videos (IMVs) is an effective method for the early detection of cerebral palsy (CP) in infants. Automated body parsing is a crucial step towards computer-aided GMA, in which infant body parts are segmented and tracked over time for movement analysis. However, acquiring fully annotated data for video-based body parsing is particularly expensive due to the large number of frames in IMVs. In this paper, we propose a semi-supervised body parsing model, termed SiamParseNet (SPN), to jointly learn single frame body parsing and label propagation between frames in a semi-supervised fashion. The Siamese-structured SPN consists of a shared feature encoder, followed by two separate branches: one for intra-frame body parts segmentation, and one for inter-frame label propagation. The two branches are trained jointly, taking pairs of frames from the same videos as their input. An adaptive training process is proposed that alternates training modes between using input pairs of only labeled frames and using inputs of both labeled and unlabeled frames. During testing, we employ a multi-source inference mechanism, where the final result for a test frame is either obtained via the segmentation branch or via propagation from a nearby key frame. We conduct extensive experiments on a partially-labeled IMV dataset where SPN outperforms all prior arts, demonstrating the effectiveness of our proposed method.

preprint2020arXiv

Survey of the Detection and Classification of Pulmonary Lesions via CT and X-Ray

In recent years, the prevalence of several pulmonary diseases, especially the coronavirus disease 2019 (COVID-19) pandemic, has attracted worldwide attention. These diseases can be effectively diagnosed and treated with the help of lung imaging. With the development of deep learning technology and the emergence of many public medical image datasets, the diagnosis of lung diseases via medical imaging has been further improved. This article reviews pulmonary CT and X-ray image detection and classification in the last decade. It also provides an overview of the detection of lung nodules, pneumonia, and other common lung lesions based on the imaging characteristics of various lesions. Furthermore, this review introduces 26 commonly used public medical image datasets, summarizes the latest technology, and discusses current challenges and future research directions.

preprint2020arXiv

The oxygen partial pressure in solid oxide electrolysis cells with two layer electrolytes

A number of degradation mechanisms have been observed during the long-term operation of solid oxide electrolysis cells (SOEC). Using an electrolyte charge carrier transport model, we quantify the oxygen potentials across the electrolyte and thereby provide insights into these degradation mechanisms. Our model describes the transport of charge carriers in the electrolyte when the oxygen partial pressure is extremely low by accounting for the spatial variation of the concentration of oxygen vacancies in the electrolyte. Moreover, we identify four quantities that characterize the distribution of oxygen partial pressure in the electrolyte, which are directly related to the degradation mechanisms in the electrolyte as well: the two oxygen partial pressures at the interfaces of the electrodes and the electrolyte, the oxygen partial pressure at the interface of YSZ/GDC, and the position of the abrupt change in oxygen potential near the p-n junction that develops in YSZ when one side of the cell is exposed to fuel (low oxygen potential, n-type conduction) and the other side is exposed to oxidant (high oxygen potential, p-type conduction). We give analytical estimates for all of these quantities. These analytical expressions provide guidance on the parameters that need to be controlled to suppress the degradation observed in the electrolyte. In addition, the effects of operating conditions, particularly current density and operating temperature, on degradation are discussed.

preprint2020arXiv

Three families of grad-div-conforming finite elements

Several smooth finite element de Rham complexes are constructed in three-dimensional space, which yield three families of grad-div conforming finite elements. The simplest element has only 8 degrees of freedom (DOFs) for a tetrahedron and 14 DOFs for a cuboid. These elements naturally lead to conforming approximations to quad-div problems. Numerical experiments for each family validate the correctness and efficiency of the elements for solving the quad-div problem.

preprint2020arXiv

Transformer Transducer: A Streamable Speech Recognition Model with Transformer Encoders and RNN-T Loss

In this paper we present an end-to-end speech recognition model with Transformer encoders that can be used in a streaming speech recognition system. Transformer computation blocks based on self-attention are used to encode both audio and label sequences independently. The activations from both audio and label encoders are combined with a feed-forward layer to compute a probability distribution over the label space for every combination of acoustic frame position and label history. This is similar to the Recurrent Neural Network Transducer (RNN-T) model, which uses RNNs for information encoding instead of Transformer encoders. The model is trained with the RNN-T loss well-suited to streaming decoding. We present results on the LibriSpeech dataset showing that limiting the left context for self-attention in the Transformer layers makes decoding computationally tractable for streaming, with only a slight degradation in accuracy. We also show that the full attention version of our model beats the-state-of-the art accuracy on the LibriSpeech benchmarks. Our results also show that we can bridge the gap between full attention and limited attention versions of our model by attending to a limited number of future frames.

preprint2020arXiv

VarGNet: Variable Group Convolutional Neural Network for Efficient Embedded Computing

In this paper, we propose a novel network design mechanism for efficient embedded computing. Inspired by the limited computing patterns, we propose to fix the number of channels in a group convolution, instead of the existing practice that fixing the total group numbers. Our solution based network, named Variable Group Convolutional Network (VarGNet), can be optimized easier on hardware side, due to the more unified computing schemes among the layers. Extensive experiments on various vision tasks, including classification, detection, pixel-wise parsing and face recognition, have demonstrated the practical value of our VarGNet.

preprint2019arXiv

Improved Hybrid Layered Image Compression using Deep Learning and Traditional Codecs

Recently deep learning-based methods have been applied in image compression and achieved many promising results. In this paper, we propose an improved hybrid layered image compression framework by combining deep learning and the traditional image codecs. At the encoder, we first use a convolutional neural network (CNN) to obtain a compact representation of the input image, which is losslessly encoded by the FLIF codec as the base layer of the bit stream. A coarse reconstruction of the input is obtained by another CNN from the reconstructed compact representation. The residual between the input and the coarse reconstruction is then obtained and encoded by the H.265/HEVC-based BPG codec as the enhancement layer of the bit stream. Experimental results using the Kodak and Tecnick datasets show that the proposed scheme outperforms the state-of-the-art deep learning-based layered coding scheme and traditional codecs including BPG in both PSNR and MS-SSIM metrics across a wide range of bit rates, when the images are coded in the RGB444 domain.