Source author record

Peng Gao

Peng Gao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

103works

32topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein). However, these models present significant challenges for direct continuous supervised fine-tuning. For example, applying the commonly used fine-tuning technique would compromise their inherent few-step inference capability. To address this, we propose D-OPSD, a novel training paradigm for step-distilled diffusion models that enables on-policy learning during supervised fine-tuning. We first find that the modern diffusion models, where the LLM/VLM serves as the encoder, can inherit its encoder's in-context capabilities. This enables us to formulate the training as an on-policy self-distillation process. Specifically, during training, we make the model act as both the teacher and the student with different contexts, where the student is conditioned only on the text feature, while the teacher is conditioned on the multimodal feature of both the text prompt and the target image. Training minimizes the two predicted distributions over the student's own roll-outs. By optimizing on the model's own trajectory and under its own supervision, D-OPSD enables the model to learn new concepts, styles, etc., without sacrificing the original few-step capacity.

preprint2025arXiv

CAML: Collaborative Auxiliary Modality Learning for Multi-Agent Systems

Multi-modal learning has emerged as a key technique for improving performance across domains such as autonomous driving, robotics, and reasoning. However, in certain scenarios, particularly in resource-constrained environments, some modalities available during training may be absent during inference. While existing frameworks effectively utilize multiple data sources during training and enable inference with reduced modalities, they are primarily designed for single-agent settings. This poses a critical limitation in dynamic environments such as connected autonomous vehicles (CAV), where incomplete data coverage can lead to decision-making blind spots. Conversely, some works explore multi-agent collaboration but without addressing missing modality at test time. To overcome these limitations, we propose Collaborative Auxiliary Modality Learning (CAML), a novel multi-modal multi-agent framework that enables agents to collaborate and share multi-modal data during training, while allowing inference with reduced modalities during testing. Experimental results in collaborative decision-making for CAV in accident-prone scenarios demonstrate that CAML achieves up to a 58.1% improvement in accident detection. Additionally, we validate CAML on real-world aerial-ground robot data for collaborative semantic segmentation, achieving up to a 10.6% improvement in mIoU.

preprint2022arXiv

Antiphase boundary in CH$_3$NH$_3$PbI$_3$ repels charge carriers while promotes fast ion migrations

Defects in organic-inorganic hybrid perovskites (OIHPs) greatly influence their optoelectronic properties. Identification and better understanding of defects existing in OIHPs is an essential step towards fabricating high-performance perovskite solar cells. However, direct visualizing the defects is still a challenge for OIHPs due to their sensitivity during electron microscopy characterizations. Here, by using low dose scanning transmission electron microscopy techniques, we observe the common existence of antiphase boundary (APB) in CH$_3$NH$_3$PbI$_3$ (MAPbI$_3$), resolve its atomic structure, and correlate it to the electrical/ionic activities and structural instabilities. Such an APB is caused by the half-unit-cell shift of [PbI$_6$]$_4$-octahedron along the [100]/[010] direction, leading to the transformation from corner-sharing [PbI$_6$]$_4$-octahedron in bulk MAPbI$_3$ into edge-sharing ones at the APB. Based on the identified atomic-scale configuration, we further carry out density functional theory calculations and reveal that the APB in MAPbI$_3$ repels both electrons and holes while serves as a fast ion-migration channel, causing a rapid decomposition into PbI$_2$ that is detrimental to optoelectronic performance. These findings provide valuable insights into the relationships between structures and optoelectronic properties of OIHPs and suggest that controlling the APB is essential for their stability.

preprint2022arXiv

Average values of quadratic Hecke character sums

We study smoothed character sums involving $\sum_{m,n} ( \frac{m}{n} )_2$, where $( \frac{m}{n} )_2$ denotes the quadratic symbol in the Gaussian field. We extend previously known results to obtain asymptotic formulas for the sums considered to larger ranges of $m$ and $n$.

preprint2022arXiv

CandidateDrug4Cancer: An Open Molecular Graph Learning Benchmark on Drug Discovery for Cancer

Anti-cancer drug discoveries have been serendipitous, we sought to present the Open Molecular Graph Learning Benchmark, named CandidateDrug4Cancer, a challenging and realistic benchmark dataset to facilitate scalable, robust, and reproducible graph machine learning research for anti-cancer drug discovery. CandidateDrug4Cancer dataset encompasses multiple most-mentioned 29 targets for cancer, covering 54869 cancer-related drug molecules which are ranged from pre-clinical, clinical and FDA-approved. Besides building the datasets, we also perform benchmark experiments with effective Drug Target Interaction (DTI) prediction baselines using descriptors and expressive graph neural networks. Experimental results suggest that CandidateDrug4Cancer presents significant challenges for learning molecular graphs and targets in practical application, indicating opportunities for future researches on developing candidate drugs for treating cancers.

preprint2022arXiv

Consecutive Pretraining: A Knowledge Transfer Learning Strategy with Relevant Unlabeled Data for Remote Sensing Domain

Currently, under supervised learning, a model pretrained by a large-scale nature scene dataset and then fine-tuned on a few specific task labeling data is the paradigm that has dominated the knowledge transfer learning. It has reached the status of consensus solution for task-aware model training in remote sensing domain (RSD). Unfortunately, due to different categories of imaging data and stiff challenges of data annotation, there is not a large enough and uniform remote sensing dataset to support large-scale pretraining in RSD. Moreover, pretraining models on large-scale nature scene datasets by supervised learning and then directly fine-tuning on diverse downstream tasks seems to be a crude method, which is easily affected by inevitable labeling noise, severe domain gaps and task-aware discrepancies. Thus, in this paper, considering the self-supervised pretraining and powerful vision transformer (ViT) architecture, a concise and effective knowledge transfer learning strategy called ConSecutive PreTraining (CSPT) is proposed based on the idea of not stopping pretraining in natural language processing (NLP), which can gradually bridge the domain gap and transfer knowledge from the nature scene domain to the RSD. The proposed CSPT also can release the huge potential of unlabeled data for task-aware model training. Finally, extensive experiments are carried out on twelve datasets in RSD involving three types of downstream tasks (e.g., scene classification, object detection and land cover classification) and two types of imaging data (e.g., optical and SAR). The results show that by utilizing the proposed CSPT for task-aware model training, almost all downstream tasks in RSD can outperform the previous method of supervised pretraining-then-fine-tuning and even surpass the state-of-the-art (SOTA) performance without any expensive labeling consumption and careful model design.

preprint2022arXiv

ConvMAE: Masked Convolution Meets Masked Autoencoders

Vision Transformers (ViT) become widely-adopted architectures for various vision tasks. Masked auto-encoding for feature pretraining and multi-scale hybrid convolution-transformer architectures can further unleash the potentials of ViT, leading to state-of-the-art performances on image classification, detection and semantic segmentation. In this paper, our ConvMAE framework demonstrates that multi-scale hybrid convolution-transformer can learn more discriminative representations via the mask auto-encoding scheme. However, directly using the original masking strategy leads to the heavy computational cost and pretraining-finetuning discrepancy. To tackle the issue, we adopt the masked convolution to prevent information leakage in the convolution blocks. A simple block-wise masking strategy is proposed to ensure computational efficiency. We also propose to more directly supervise the multi-scale features of the encoder to boost multi-scale features. Based on our pretrained ConvMAE models, ConvMAE-Base improves ImageNet-1K finetuning accuracy by 1.4% compared with MAE-Base. On object detection, ConvMAE-Base finetuned for only 25 epochs surpasses MAE-Base fined-tuned for 100 epochs by 2.9% box AP and 2.2% mask AP respectively. Code and pretrained models are available at https://github.com/Alpha-VL/ConvMAE.

preprint2022arXiv

Direct observation of local antiferroelectricity induced phonon softening at a SrTiO3 defect

Defects in oxides usually exhibit exotic properties that may be associated with the local lattice dynamics. Here, at atomic spatial resolution, we directly measure phonon modes of an antiphase boundary (APB) in SrTiO3 freestanding membrane and correlate them with the picometer-level structural distortion. We find that the SrTiO3 APB introduces new defect phonon modes that are absent in bulk SrTiO3. These modes are highly sensitive to the subtle structure distortion, i.e., the SrTiO3 APB generates the local electric dipoles forming an antiferroelectric configuration, which significantly softens the transverse optical (TO) and longitudinal optical (LO) modes at Γ point. Correlating the local phonons with the subtle structural distortion, our findings provide valuable insights into understanding the defect properties in complex oxides and essential information for their applications such as thermoelectric devices.

preprint2022arXiv

Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning

In this paper, we propose a simple and general framework for self-supervised point cloud representation learning. Human beings understand the 3D world by extracting two levels of information and establishing the relationship between them. One is the global shape of an object, and the other is the local structures of it. However, few existing studies in point cloud representation learning explored how to learn both global shapes and local-to-global relationships without a specified network architecture. Inspired by how human beings understand the world, we utilize knowledge distillation to learn both global shape information and the relationship between global shape and local structures. At the same time, we combine contrastive learning with knowledge distillation to make the teacher network be better updated. Our method achieves the state-of-the-art performance on linear classification and multiple other downstream tasks. Especially, we develop a variant of ViT for 3D point cloud feature extraction, which also achieves comparable results with existing backbones when combined with our framework, and visualization of the attention maps show that our model does understand the point cloud by combining the global shape information and multiple local structural information, which is consistent with the inspiration of our representation learning method. Our code will be released soon.

preprint2022arXiv

Electron microscopy probing electron-photon interactions in SiC nanowires with ultra-wide energy and momentum match

Nanoscale materials usually can trap light and strongly interact with it leading to many photonic device applications. The light-matter interactions are commonly probed by optical spectroscopy, which, however, have some limitations such as diffraction-limited spatial resolution, tiny momentum transfer and non-continuous excitation/detection. In this work, using scanning transmission electron microscopy-electron energy loss spectroscopy (STEM-EELS) with ultra-wide energy and momentum match and sub-nanometer spatial resolution, we study the optical microcavity resonant spectroscopy in a single SiC nanowire. The longitudinal Fabry-Perot (FP) resonating modes and the transverse whispering-gallery modes (WGMs) are simultaneously excited and detected, which span from near-infrared (~ 1.2 μm) to ultraviolet (~ 0.2 μm) spectral regime and the momentum transfer can be ranging up to 108 cm{^{-1}}. The size effects on the resonant spectra of nanowires are also revealed. Moreover, the nanoscale decay length of resonant EELS is revealed, which is contributed by the strongly localized electron-photon interactions in the SiC nanowire. This work provides a new alternative technique to investigate the optical resonating spectroscopy of a single nanowire structure and to explore the light-matter interactions in dielectric nanostructures, which is also promising for modulating free electrons via photonic structures.

preprint2022arXiv

Frozen CLIP Models are Efficient Video Learners

Video recognition has been dominated by the end-to-end learning paradigm -- first initializing a video recognition model with weights of a pretrained image model and then conducting end-to-end training on videos. This enables the video network to benefit from the pretrained image model. However, this requires substantial computation and memory resources for finetuning on videos and the alternative of directly using pretrained image features without finetuning the image backbone leads to subpar results. Fortunately, recent advances in Contrastive Vision-Language Pre-training (CLIP) pave the way for a new route for visual recognition tasks. Pretrained on large open-vocabulary image-text pair data, these models learn powerful visual representations with rich semantics. In this paper, we present Efficient Video Learning (EVL) -- an efficient framework for directly training high-quality video recognition models with frozen CLIP features. Specifically, we employ a lightweight Transformer decoder and learn a query token to dynamically collect frame-level spatial features from the CLIP image encoder. Furthermore, we adopt a local temporal module in each decoder layer to discover temporal clues from adjacent frames and their attention maps. We show that despite being efficient to train with a frozen backbone, our models learn high quality video representations on a variety of video recognition datasets. Code is available at https://github.com/OpenGVLab/efficient-video-recognition.

preprint2022arXiv

Learning Decoupling Features Through Orthogonality Regularization

Keyword spotting (KWS) and speaker verification (SV) are two important tasks in speech applications. Research shows that the state-of-art KWS and SV models are trained independently using different datasets since they expect to learn distinctive acoustic features. However, humans can distinguish language content and the speaker identity simultaneously. Motivated by this, we believe it is important to explore a method that can effectively extract common features while decoupling task-specific features. Bearing this in mind, a two-branch deep network (KWS branch and SV branch) with the same network structure is developed and a novel decoupling feature learning method is proposed to push up the performance of KWS and SV simultaneously where speaker-invariant keyword representations and keyword-invariant speaker representations are expected respectively. Experiments are conducted on Google Speech Commands Dataset (GSCD). The results demonstrate that the orthogonality regularization helps the network to achieve SOTA EER of 1.31% and 1.87% on KWS and SV, respectively.

preprint2022arXiv

Lower bounds for negative moments of quadratic Dirichlet $L$-functions

We establish lower bounds for the $2k$-th moment of families of quadratic Dirichlet $L$-functions at the central point for all real $k<0$, assuming a conjecture of S. Chowla on the non-vanishing of these $L$-values.

preprint2022arXiv

Non-vanishing of quadratic twists of modular $L$-functions of prime-related moduli

In this paper, we study central values of the family of quadratic twists of modular $L$-functions of moduli $8p$, with $p$ ranging over odd primes. Assuming the truth of the generalized Riemann hypothesis, we establish a positive proportion non-vanishing result for the corresponding $L$-values.

preprint2022arXiv

Optimal convergence order for multi-scale stochastic Burgers equation

In this paper, we study the strong and weak convergence rates for multi-scale one-dimensional stochastic Burgers equation. Based on the techniques of Galerkin approximation, Kolmogorov equation and Poisson equation, we obtain the slow component strongly and weakly converges to the solution of the corresponding averaged equation with optimal orders 1/2 and 1 respectively. The highly nonlinear term in system brings us huge difficulties, we develop new technique to overcome these difficulties. To the best of our knowledge, this work seems to be the first result in which the optimal convergence orders in strong and weak sense for multi-scale stochastic partial differential equations with highly nonlinear term.

preprint2022arXiv

POS-BERT: Point Cloud One-Stage BERT Pre-Training

Recently, the pre-training paradigm combining Transformer and masked language modeling has achieved tremendous success in NLP, images, and point clouds, such as BERT. However, directly extending BERT from NLP to point clouds requires training a fixed discrete Variational AutoEncoder (dVAE) before pre-training, which results in a complex two-stage method called Point-BERT. Inspired by BERT and MoCo, we propose POS-BERT, a one-stage BERT pre-training method for point clouds. Specifically, we use the mask patch modeling (MPM) task to perform point cloud pre-training, which aims to recover masked patches information under the supervision of the corresponding tokenizer output. Unlike Point-BERT, its tokenizer is extra-trained and frozen. We propose to use the dynamically updated momentum encoder as the tokenizer, which is updated and outputs the dynamic supervision signal along with the training process. Further, in order to learn high-level semantic representation, we combine contrastive learning to maximize the class token consistency between different transformation point clouds. Extensive experiments have demonstrated that POS-BERT can extract high-quality pre-training features and promote downstream tasks to improve performance. Using the pre-training model without any fine-tuning to extract features and train linear SVM on ModelNet40, POS-BERT achieves the state-of-the-art classification accuracy, which exceeds Point-BERT by 3.5\%. In addition, our approach has significantly improved many downstream tasks, such as fine-tuned classification, few-shot classification, part segmentation. The code and trained-models will be available at: \url{https://github.com/fukexue/POS-BERT}.

preprint2022arXiv

Prototypical Contrast Adaptation for Domain Adaptive Semantic Segmentation

Unsupervised Domain Adaptation (UDA) aims to adapt the model trained on the labeled source domain to an unlabeled target domain. In this paper, we present Prototypical Contrast Adaptation (ProCA), a simple and efficient contrastive learning method for unsupervised domain adaptive semantic segmentation. Previous domain adaptation methods merely consider the alignment of the intra-class representational distributions across various domains, while the inter-class structural relationship is insufficiently explored, resulting in the aligned representations on the target domain might not be as easily discriminated as done on the source domain anymore. Instead, ProCA incorporates inter-class information into class-wise prototypes, and adopts the class-centered distribution alignment for adaptation. By considering the same class prototypes as positives and other class prototypes as negatives to achieve class-centered distribution alignment, ProCA achieves state-of-the-art performance on classical domain adaptation tasks, {\em i.e., GTA5 $\to$ Cityscapes \text{and} SYNTHIA $\to$ Cityscapes}. Code is available at \href{https://github.com/jiangzhengkai/ProCA}{ProCA}

preprint2022arXiv

RestoreDet: Degradation Equivariant Representation for Object Detection in Low Resolution Images

Image restoration algorithms such as super resolution (SR) are indispensable pre-processing modules for object detection in degraded images. However, most of these algorithms assume the degradation is fixed and known a priori. When the real degradation is unknown or differs from assumption, both the pre-processing module and the consequent high-level task such as object detection would fail. Here, we propose a novel framework, RestoreDet, to detect objects in degraded low resolution images. RestoreDet utilizes the downsampling degradation as a kind of transformation for self-supervised signals to explore the equivariant representation against various resolutions and other degradation conditions. Specifically, we learn this intrinsic visual structure by encoding and decoding the degradation transformation from a pair of original and randomly degraded images. The framework could further take the advantage of advanced SR architectures with an arbitrary resolution restoring decoder to reconstruct the original correspondence from the degraded input image. Both the representation learning and object detection are optimized jointly in an end-to-end training fashion. RestoreDet is a generic framework that could be implemented on any mainstream object detection architectures. The extensive experiment shows that our framework based on CenterNet has achieved superior performance compared with existing methods when facing variant degradation situations. Our code would be released soon.

preprint2022arXiv

SFE-AI at SemEval-2022 Task 11: Low-Resource Named Entity Recognition using Large Pre-trained Language Models

Large scale pre-training models have been widely used in named entity recognition (NER) tasks. However, model ensemble through parameter averaging or voting can not give full play to the differentiation advantages of different models, especially in the open domain. This paper describes our NER system in the SemEval 2022 task11: MultiCoNER. We proposed an effective system to adaptively ensemble pre-trained language models by a Transformer layer. By assigning different weights to each model for different inputs, we adopted the Transformer layer to integrate the advantages of diverse models effectively. Experimental results show that our method achieves superior performances in Farsi and Dutch.

preprint2022arXiv

TerViT: An Efficient Ternary Vision Transformer

Vision transformers (ViTs) have demonstrated great potential in various visual tasks, but suffer from expensive computational and memory cost problems when deployed on resource-constrained devices. In this paper, we introduce a ternary vision transformer (TerViT) to ternarize the weights in ViTs, which are challenged by the large loss surface gap between real-valued and ternary parameters. To address the issue, we introduce a progressive training scheme by first training 8-bit transformers and then TerViT, and achieve a better optimization than conventional methods. Furthermore, we introduce channel-wise ternarization, by partitioning each matrix to different channels, each of which is with an unique distribution and ternarization interval. We apply our methods to popular DeiT and Swin backbones, and extensive results show that we can achieve competitive performance. For example, TerViT can quantize Swin-S to 13.1MB model size while achieving above 79% Top-1 accuracy on ImageNet dataset.

preprint2022arXiv

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs. It shows impressive performance on downstream tasks by zero-shot knowledge transfer. To further enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules, which significantly improves the few-shot performance but introduces extra training time and computational resources. In this paper, we propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter, which not only inherits the training-free advantage of zero-shot CLIP but also performs comparably to those training-required approaches. Tip-Adapter constructs the adapter via a key-value cache model from the few-shot training set, and updates the prior knowledge encoded in CLIP by feature retrieval. On top of that, the performance of Tip-Adapter can be further boosted to be state-of-the-art on ImageNet by fine-tuning the cache model for 10$\times$ fewer epochs than existing methods, which is both effective and efficient. We conduct extensive experiments of few-shot classification on 11 datasets to demonstrate the superiority of our proposed methods. Code is released at https://github.com/gaopengcuhk/Tip-Adapter.

preprint2022arXiv

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

It is a challenging task to learn rich and multi-scale spatiotemporal semantics from high-dimensional videos, due to large local redundancy and complex global dependency between video frames. The recent advances in this research have been mainly driven by 3D convolutional neural networks and vision transformers. Although 3D convolution can efficiently aggregate local context to suppress local redundancy from a small 3D neighborhood, it lacks the capability to capture global dependency because of the limited receptive field. Alternatively, vision transformers can effectively capture long-range dependency by self-attention mechanism, while having the limitation on reducing local redundancy with blind similarity comparison among all the tokens in each layer. Based on these observations, we propose a novel Unified transFormer (UniFormer) which seamlessly integrates merits of 3D convolution and spatiotemporal self-attention in a concise transformer format, and achieves a preferable balance between computation and accuracy. Different from traditional transformers, our relation aggregator can tackle both spatiotemporal redundancy and dependency, by learning local and global token affinity respectively in shallow and deep layers. We conduct extensive experiments on the popular video benchmarks, e.g., Kinetics-400, Kinetics-600, and Something-Something V1&V2. With only ImageNet-1K pretraining, our UniFormer achieves 82.9%/84.8% top-1 accuracy on Kinetics-400/Kinetics-600, while requiring 10x fewer GFLOPs than other state-of-the-art methods. For Something-Something V1 and V2, our UniFormer achieves new state-of-the-art performances of 60.9% and 71.2% top-1 accuracy respectively. Code is available at https://github.com/Sense-X/UniFormer.

preprint2021arXiv

A System for Automated Open-Source Threat Intelligence Gathering and Management

To remain aware of the fast-evolving cyber threat landscape, open-source Cyber Threat Intelligence (OSCTI) has received growing attention from the community. Commonly, knowledge about threats is presented in a vast number of OSCTI reports. Despite the pressing need for high-quality OSCTI, existing OSCTI gathering and management platforms, however, have primarily focused on isolated, low-level Indicators of Compromise. On the other hand, higher-level concepts (e.g., adversary tactics, techniques, and procedures) and their relationships have been overlooked, which contain essential knowledge about threat behaviors that is critical to uncovering the complete threat scenario. To bridge the gap, we propose SecurityKG, a system for automated OSCTI gathering and management. SecurityKG collects OSCTI reports from various sources, uses a combination of AI and NLP techniques to extract high-fidelity knowledge about threat behaviors, and constructs a security knowledge graph. SecurityKG also provides a UI that supports various types of interactivity to facilitate knowledge graph exploration.

preprint2021arXiv

A System for Efficiently Hunting for Cyber Threats in Computer Systems Using Threat Intelligence

Log-based cyber threat hunting has emerged as an important solution to counter sophisticated cyber attacks. However, existing approaches require non-trivial efforts of manual query construction and have overlooked the rich external knowledge about threat behaviors provided by open-source Cyber Threat Intelligence (OSCTI). To bridge the gap, we build ThreatRaptor, a system that facilitates cyber threat hunting in computer systems using OSCTI. Built upon mature system auditing frameworks, ThreatRaptor provides (1) an unsupervised, light-weight, and accurate NLP pipeline that extracts structured threat behaviors from unstructured OSCTI text, (2) a concise and expressive domain-specific query language, TBQL, to hunt for malicious system activities, (3) a query synthesis mechanism that automatically synthesizes a TBQL query from the extracted threat behaviors, and (4) an efficient query execution engine to search the big system audit logging data.

preprint2021arXiv

Atomic-Scale Probing of Heterointerface Phonon Bridges in Nitride Semiconductor

Interface phonon modes that are generated by several atomic layers at the heterointerface play a major role in the interface thermal conductance for nanoscale high-power devices such as nitride-based high-electron-mobility transistors and light emitting diodes. Here we measure the local phonon spectra across AlN/Si and AlN/Al interfaces using atomically resolved vibrational electron energy-loss spectroscopy in a scanning transmission electron microscope. At the AlN/Si interface, we observe various localized phonon modes, of which the extended and interfacial modes act as bridges to connect the bulk AlN modes and bulk Si modes, and are expected to boost the inelastic phonon transport thus substantially contribute to interface thermal conductance. In comparison, no such phonon bridge is observed at the AlN/Al interface, for which partially extended modes dominate the interface thermal conductivity. This work provides valuable insights into understanding the interfacial thermal transport in nitride semiconductors and useful guidance for thermal management via interface engineering.

preprint2021arXiv

Bounds for moments of Dirichlet $L$-functions to a fixed modulus

We study the $2k$-th moment of central values of the family of Dirichlet $L$-functions to a fixed prime modulus. We establish sharp lower bounds for all real $k \geq 0$ and sharp upper bounds for $k$ in the range $0 \leq k \leq 1$.

preprint2021arXiv

CHAMP: Characterizing Undesired App Behaviors from User Comments based on Market Policies

Millions of mobile apps have been available through various app markets. Although most app markets have enforced a number of automated or even manual mechanisms to vet each app before it is released to the market, thousands of low-quality apps still exist in different markets, some of which violate the explicitly specified market policies.In order to identify these violations accurately and timely, we resort to user comments, which can form an immediate feedback for app market maintainers, to identify undesired behaviors that violate market policies, including security-related user concerns. Specifically, we present the first large-scale study to detect and characterize the correlations between user comments and market policies. First, we propose CHAMP, an approach that adopts text mining and natural language processing (NLP) techniques to extract semantic rules through a semi-automated process, and classifies comments into 26 pre-defined types of undesired behaviors that violate market policies. Our evaluation on real-world user comments shows that it achieves both high precision and recall ($>0.9$) in classifying comments for undesired behaviors. Then, we curate a large-scale comment dataset (over 3 million user comments) from apps in Google Play and 8 popular alternative Android app markets, and apply CHAMP to understand the characteristics of undesired behavior comments in the wild. The results confirm our speculation that user comments can be used to pinpoint suspicious apps that violate policies declared by app markets. The study also reveals that policy violations are widespread in many app markets despite their extensive vetting efforts. CHAMP can be a \textit{whistle blower} that assigns policy-violation scores and identifies most informative comments for apps.

preprint2021arXiv

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human about the audio-visual content. This task thus poses a challenging multi-modal representation learning and reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce a semantics-controlled multi-modal shuffled Transformer reasoning framework, consisting of a sequence of Transformer modules, each taking a modality as input and producing representations conditioned on the input question. Our proposed Transformer variant uses a shuffling scheme on their multi-head outputs, demonstrating better regularization. To encode fine-grained visual information, we present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing spatio-semantic graph representations for every frame, and an inter-frame aggregation module capturing temporal cues. Our entire pipeline is trained end-to-end. We present experiments on the benchmark AVSD dataset, both on answer generation and selection tasks. Our results demonstrate state-of-the-art performances on all evaluation metrics.

preprint2021arXiv

Dynamics of Polar Skyrmion Bubbles under Electric Fields

Room-temperature polar skyrmion bubbles that are recently found in oxide superlattice, have received enormous interests for their potential applications in nanoelectronics due to the nanometer size, emergent chirality, and negative capacitance. For practical applications, the ability to controllably manipulate them by using external stimuli is prerequisite. Here, we study the dynamics of individual polar skyrmion bubbles at the nanoscale by using in situ biasing in a scanning transmission electron microscope. The reversible electric field-driven phase transition between topological and trivial polar states are demonstrated. We create, erase and monitor the shrinkage and expansion of individual polar skyrmions. We find that their transition behaviors are substantially different from that of magnetic analogue. The underlying mechanism is discussed by combing with the phase-field simulations. The controllable manipulation of nanoscale polar skyrmions allows us to tune the dielectric permittivity at atomic scale and detailed knowledge of their phase transition behaviors provides fundamentals for their applications in nanoelectronics.

preprint2021arXiv

Enabling Efficient Cyber Threat Hunting With Cyber Threat Intelligence

Log-based cyber threat hunting has emerged as an important solution to counter sophisticated attacks. However, existing approaches require non-trivial efforts of manual query construction and have overlooked the rich external threat knowledge provided by open-source Cyber Threat Intelligence (OSCTI). To bridge the gap, we propose ThreatRaptor, a system that facilitates threat hunting in computer systems using OSCTI. Built upon system auditing frameworks, ThreatRaptor provides (1) an unsupervised, light-weight, and accurate NLP pipeline that extracts structured threat behaviors from unstructured OSCTI text, (2) a concise and expressive domain-specific query language, TBQL, to hunt for malicious system activities, (3) a query synthesis mechanism that automatically synthesizes a TBQL query for hunting, and (4) an efficient query execution engine to search the big audit logging data. Evaluations on a broad set of attack cases demonstrate the accuracy and efficiency of ThreatRaptor in practical threat hunting.

preprint2021arXiv

Engineering of Atomic-Scale Flexoelectricity at Grain Boundaries

Flexoelectricity is a type of ubiquitous and prominent electromechanical coupling, pertaining to the response of electrical polarization to mechanical strain gradients while not restricted to the symmetry of materials. However, large elastic deformation in most solids is usually difficult to achieve and the strain gradient at minuscule is challenging to control. Here we exploit the exotic structural inhomogeneity of grain boundary to achieve a huge strain gradient (~ 1.2 nm-1) within 3 ~ 4 unit-cells, and thus obtain atomic-scale flexoelectric polarization up to ~ 38 μC/cm2 at a 24 LaAlO3 grain boundary. The nanoscale flexoelectricity also modifies the electrical activity of grain boundaries. Moreover, we prove that it is a general and feasible way to form large strain gradients at atomic scale by altering the misorientation angles of grain boundaries in different dielectric materials. Thus, engineering of grain boundaries provides an effective pathway to achieve tunable flexoelectricity and broadens the electromechanical functionalities of non-piezoelectric materials.

preprint2021arXiv

Fast Convergence of DETR with Spatially Modulated Co-Attention

The recently proposed Detection Transformer (DETR) model successfully applies Transformer to objects detection and achieves comparable performance with two-stage object detection frameworks, such as Faster-RCNN. However, DETR suffers from its slow convergence. Training DETR \cite{carion2020end} from scratch needs 500 epochs to achieve a high accuracy. To accelerate its convergence, we propose a simple yet effective scheme for improving the DETR framework, namely Spatially Modulated Co-Attention (SMCA) mechanism. The core idea of SMCA is to conduct regression-aware co-attention in DETR by constraining co-attention responses to be high near initially estimated bounding box locations. Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder while keeping other operations in DETR unchanged. Furthermore, by integrating multi-head and scale-selection attention designs into SMCA, our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone (45.6 mAP at 108 epochs vs. 43.3 mAP at 500 epochs). We perform extensive ablation studies on COCO dataset to validate the effectiveness of the proposed SMCA.

preprint2021arXiv

Measuring phonon dispersion at an interface

The breakdown of translational symmetry at heterointerfaces leads to the emergence of new phonon modes localized near the interface. These interface phonons play an essential role in thermal/electrical transport properties in devices especially in miniature ones wherein the interface may dominate the entire response of the device. Knowledge of phonon dispersion at interfaces is therefore highly desirable for device design and optimization. Although theoretical work has begun decades ago, experimental research is totally absent due to challenges in achieving combined spatial, momentum and spectral resolutions required to probe localized phonon modes. Here we use electron energy loss spectroscopy in an electron microscope to directly measure both the local phonon density of states and the interface phonon dispersion relation for an epitaxial cBN-diamond heterointerface. In addition to bulk phonon modes, we observe acoustic and optical phonon modes localized at the interface, and modes isolated away from the interface. These features only appear within ~ 1 nm around the interface. The experimental results can be nicely reproduced by ab initio calculations. Our findings provide insights into lattice dynamics at heterointerfaces and should be practically useful in thermal/electrical engineering.

preprint2021arXiv

Multi-view Sensor Fusion by Integrating Model-based Estimation and Graph Learning for Collaborative Object Localization

Collaborative object localization aims to collaboratively estimate locations of objects observed from multiple views or perspectives, which is a critical ability for multi-agent systems such as connected vehicles. To enable collaborative localization, several model-based state estimation and learning-based localization methods have been developed. Given their encouraging performance, model-based state estimation often lacks the ability to model the complex relationships among multiple objects, while learning-based methods are typically not able to fuse the observations from an arbitrary number of views and cannot well model uncertainty. In this paper, we introduce a novel spatiotemporal graph filter approach that integrates graph learning and model-based estimation to perform multi-view sensor fusion for collaborative object localization. Our approach models complex object relationships using a new spatiotemporal graph representation and fuses multi-view observations in a Bayesian fashion to improve location estimation under uncertainty. We evaluate our approach in the applications of connected autonomous driving and multiple pedestrian localization. Experimental results show that our approach outperforms previous techniques and achieves the state-of-the-art performance on collaboration localization.

preprint2021arXiv

RomeBERT: Robust Training of Multi-Exit BERT

BERT has achieved superior performances on Natural Language Understanding (NLU) tasks. However, BERT possesses a large number of parameters and demands certain resources to deploy. For acceleration, Dynamic Early Exiting for BERT (DeeBERT) has been proposed recently, which incorporates multiple exits and adopts a dynamic early-exit mechanism to ensure efficient inference. While obtaining an efficiency-performance tradeoff, the performances of early exits in multi-exit BERT are significantly worse than late exits. In this paper, we leverage gradient regularized self-distillation for RObust training of Multi-Exit BERT (RomeBERT), which can effectively solve the performance imbalance problem between early and late exits. Moreover, the proposed RomeBERT adopts a one-stage joint training strategy for multi-exits and the BERT backbone while DeeBERT needs two stages that require more training time. Extensive experiments on GLUE datasets are performed to demonstrate the superiority of our approach. Our code is available at https://github.com/romebert/RomeBERT.

preprint2021arXiv

Self-supervised learning for fast and scalable time series hyper-parameter tuning

Hyper-parameters of time series models play an important role in time series analysis. Slight differences in hyper-parameters might lead to very different forecast results for a given model, and therefore, selecting good hyper-parameter values is indispensable. Most of the existing generic hyper-parameter tuning methods, such as Grid Search, Random Search, Bayesian Optimal Search, are based on one key component - search, and thus they are computationally expensive and cannot be applied to fast and scalable time-series hyper-parameter tuning (HPT). We propose a self-supervised learning framework for HPT (SSL-HPT), which uses time series features as inputs and produces optimal hyper-parameters. SSL-HPT algorithm is 6-20x faster at getting hyper-parameters compared to other search based algorithms while producing comparable accurate forecasting results in various applications.

preprint2021arXiv

Sharp lower bounds for moments of quadratic Dirichlet $L$-functions

We establish sharp lower bounds for the $k$-th moment in the range $0 \leq k \leq 1$ of the family of quadratic Dirichlet $L$-functions at the central point.

preprint2021arXiv

Sharp upper bounds for moments of quadratic Dirichlet $L$-functions

We establish unconditional sharp upper bounds of the $k$-th moments of the family of quadratic Dirichlet $L$-functions at the central point for $0 \leq k \leq 2$.

preprint2021arXiv

Switching magnon chirality in artificial antiferromagnet

Magnons in antiferromagnets can support both right-handed and left-handed chiralities, which shed a light on the chirality-based spintronics. Here we demonstrate the switching and reading of magnon chirality in an artificial antiferromagnet. The coexisting antiferromagnetic and ferromagnetic characteristic resonance modes are discovered, which permits a high tunability in the modulation of magnon chirality. The reading of the chirality is accomplished via the chirality-dependent spin pumping as well as spin rectification effect. Our result illustrates an ideal antiferromagnetic platform for handling magnon chirality and paves the way for chirality-based spintronics.

Peng Gao

What is connected

Connect this record

See the researcher in context

Building this map preview

103 published item(s)

D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

CAML: Collaborative Auxiliary Modality Learning for Multi-Agent Systems

Antiphase boundary in CH$_3$NH$_3$PbI$_3$ repels charge carriers while promotes fast ion migrations

Average values of quadratic Hecke character sums

CandidateDrug4Cancer: An Open Molecular Graph Learning Benchmark on Drug Discovery for Cancer

Consecutive Pretraining: A Knowledge Transfer Learning Strategy with Relevant Unlabeled Data for Remote Sensing Domain

ConvMAE: Masked Convolution Meets Masked Autoencoders

Direct observation of local antiferroelectricity induced phonon softening at a SrTiO3 defect

Distillation with Contrast is All You Need for Self-Supervised Point Cloud Representation Learning

Electron microscopy probing electron-photon interactions in SiC nanowires with ultra-wide energy and momentum match

Frozen CLIP Models are Efficient Video Learners

Learning Decoupling Features Through Orthogonality Regularization

Lower bounds for negative moments of quadratic Dirichlet $L$-functions

Non-vanishing of quadratic twists of modular $L$-functions of prime-related moduli

Optimal convergence order for multi-scale stochastic Burgers equation

POS-BERT: Point Cloud One-Stage BERT Pre-Training

Prototypical Contrast Adaptation for Domain Adaptive Semantic Segmentation

RestoreDet: Degradation Equivariant Representation for Object Detection in Low Resolution Images

SFE-AI at SemEval-2022 Task 11: Low-Resource Named Entity Recognition using Large Pre-trained Language Models

TerViT: An Efficient Ternary Vision Transformer

Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification

UniFormer: Unified Transformer for Efficient Spatiotemporal Representation Learning

A System for Automated Open-Source Threat Intelligence Gathering and Management

A System for Efficiently Hunting for Cyber Threats in Computer Systems Using Threat Intelligence

Atomic-Scale Probing of Heterointerface Phonon Bridges in Nitride Semiconductor

Bounds for moments of Dirichlet $L$-functions to a fixed modulus

CHAMP: Characterizing Undesired App Behaviors from User Comments based on Market Policies

Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

Dynamics of Polar Skyrmion Bubbles under Electric Fields

Enabling Efficient Cyber Threat Hunting With Cyber Threat Intelligence

Engineering of Atomic-Scale Flexoelectricity at Grain Boundaries

Fast Convergence of DETR with Spatially Modulated Co-Attention

Measuring phonon dispersion at an interface

Multi-view Sensor Fusion by Integrating Model-based Estimation and Graph Learning for Collaborative Object Localization

RomeBERT: Robust Training of Multi-Exit BERT

Self-supervised learning for fast and scalable time series hyper-parameter tuning

Sharp lower bounds for moments of quadratic Dirichlet $L$-functions

Sharp upper bounds for moments of quadratic Dirichlet $L$-functions

Switching magnon chirality in artificial antiferromagnet

Character Matters: Video Story Understanding with Character-Aware Relations

Contrastive Visual-Linguistic Pretraining

Controllable generations of several nonlinear waves in optical fibers with third-order dispersion

Creating topological polar structure in a nonpolar matter

Eightfold Fermionic Excitation in a Charge Density Wave Compound

Extreme Low-Light Imaging with Multi-granulation Cooperative Networks

First Moments of Some Hecke $L$-functions of Prime Moduli

Four-dimensional Vibrational Spectroscopy for Nanoscale Mapping of Phonon Dispersion in BN Nanotubes

Gradient Regularized Contrastive Learning for Continual Domain Adaptation

Interlayer Decoupling in 30° Twisted Bilayer Graphene Quasicrystal

Learning Reinforced Attentional Representation for End-to-End Visual Tracking

Learning Where to Focus for Efficient Video Object Detection

Low-lying zeros of a family of quadratic Hecke $L$-functions via ratios conjecture

Moments and Non-vanishing of central values of Quadratic Hecke $L$-functions in the Gaussian Field

Moments of central values of cubic Hecke $L$-functions of $\mathbb{Q}(i)$

Multi-Layer Content Interaction Through Quaternion Product For Visual Question Answering

Non-vanishing of central values of quadratic Hecke $L$-functions of prime moduli in the Gaussian field

One level density of low-lying zeros of quadratic Hecke $L$-functions of Imaginary Quadratic Number Fields

Querying Streaming System Monitoring Data for Enterprise System Anomaly Detection

Reconstruction Regularized Deep Metric Learning for Multi-label Image Classification

The fourth moment of central values of quadratic Hecke $L$-functions in the Gaussian field

Atomic Imaging of Mechanically Induced Topological Transition of Ferroelectric Vortices

Atomic Origin of Spin-Valve Magnetoresistance at the SrRuO3 Grain Boundary

Chiral spin-wave velocities induced by all-garnet interfacial Dzyaloshinskii-Moriya interaction in ultrathin yttrium iron garnet films

Correlating the Electronic Structures of Metallic/Semiconductor MoTe2 Interface to its Atomic Structures

First Moment of Hecke $L$-functions with quartic characters at the central point

Moments of Quadratic Hecke $L$-functions of Imaginary Quadratic Number Fields

One level density of low-lying zeros of quadratic and quartic Hecke $L$-functions

Siamese Attentional Keypoint Network for High Performance Visual Tracking

Thickness-dependent in-plane polarization and structural phase transition in van der Waals Ferroelectric CuInP2S6

Weighted first moments of some special quadratic Dirichlet $L$-functions

Atomic Origin of Ti Deficient Dislocation in SrTiO3 Bicrystal and Their Electronic Structures

A complement to Diananda's inequality

A complete monotonicity result involving the $q$-polygamma functions

Crystal Structure Manipulation of the Exchange Bias in an Antiferromagnetic Film