Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
42works
0followers
23topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

42 published item(s)

preprint2026arXiv

DAOS: A Multimodal In-cabin Behavior Monitoring with Driver Action-Object Synergy Dataset

In driver activity monitoring, movements are mostly limited to the upper body, which makes many actions look similar. To tell these actions apart, human often rely on the objects the driver is using, such as holding a phone compared with gripping the steering wheel. However, most existing driver-monitoring datasets lack accurate object-location annotations or do not link objects to their associated actions, leaving a critical gap for reliable action recognition. To address this, we introduce the Driver Action with Object Synergy (DAOS) dataset, comprising 9,787 video clips annotated with 36 fine-grained driver actions and 15 object classes, totaling more than 2.5 million corresponding object instances. DAOS offers multi-modal, multi-view data (RGB, IR, and depth) from front, face, left, and right perspectives. Although DAOS captures a wide range of cabin objects, only a few are directly relevant to each action for prediction, so focusing on task-specific human-object relations is essential. To tackle this challenge, we propose the Action-Object-Relation Network (AOR-Net). AOR-Net comprehends complex driver actions through multi-level reasoning and a chain-of-action prompting mechanism that models the logical relationships among actions, objects, and their relations. Additionally, the Mixture of Thoughts module is introduced to dynamically select essential knowledge at each stage, enhancing robustness in object-rich and object-scarce conditions. Extensive experiments demonstrate that our model outperforms other state-of-the-art methods on various datasets.

preprint2026arXiv

Lagrangian Flow Matching: A Least-Action Framework for Principled Path Design

Flow matching trains a neural velocity field by regression against a target velocity associated with a prescribed probability path connecting a simple initial distribution to the data distribution. A central design choice is the path itself. Existing constructions, including rectified and optimal-transport-based paths, transport samples along straight lines between coupled endpoints and thus cover only a narrow class of dynamics. We observe that this corresponds to the simplest case of the least-action principle in classical mechanics, in which the kinetic Lagrangian yields free-particle straight-line trajectories. Building on this observation, we propose Lagrangian flow matching, a physics-based framework in which the probability path and velocity field are determined by minimizing the action of a general Lagrangian subject to the continuity equation and the prescribed endpoints. We show that this dynamic problem admits an equivalent static optimal transport (OT) formulation, yielding a family of simulation-free training objectives that recover OT-based flow matching as the kinetic special case and the trigonometric variance-preserving diffusion path as the harmonic-oscillator case. More general Lagrangians give rise to new probability paths and velocity fields, and numerical experiments show that they induce meaningful changes in the learned dynamics while remaining competitive with existing conditional flow matching models.

preprint2026arXiv

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

Diffusion large language models (dLLMs) generate text by iteratively denoising masked token sequences. Although dLLMs can predict all masked positions in parallel within each step, the large number of denoising iterations still makes inference expensive. This cost can be reduced spatially by unmasking multiple tokens per step, or temporally by collapsing multiple denoising steps into one verification call. We propose Parallel Speculative Decoding (PSD), a training-free framework that jointly improves inference along both axes. Using the confidence scores from a single forward pass, PSD selects positions to unmask via a configurable, adaptive unmasking policy and constructs multi-depth speculative drafts without extra model calls. A final batched verification pass then applies hierarchical acceptance, keeping the deepest draft that remains consistent with the updated predictions. Experiments on three dLLMs across reasoning and code generation tasks show that PSD achieves favorable trade-offs between inference efficiency and generation quality, reaching up to $5.5\times$ tokens per forward pass with accuracy comparable to greedy decoding.

preprint2026arXiv

Revisiting Judge Decoding from First Principles via Training-Free Distributional Divergence

Judge Decoding accelerates LLM inference by relaxing the strict verification of Speculative Decoding, yet it typically relies on expensive and noisy supervision. In this work, we revisit this paradigm from first principles, revealing that the ``criticality'' scores learned via costly supervision are intrinsically encoded in the draft-target distributional divergence. We theoretically prove a structural correspondence between learned linear judges and Kullback-Leibler (KL) divergence, demonstrating they rely on the same underlying logit primitives. Guided by this, we propose a simple, training-free verification mechanism based on KL divergence. Extensive experiments across reasoning and coding benchmarks show that our method matches or outperforms complex trained judges (e.g., AutoJudge), offering superior robustness to domain shifts and eliminating the supervision bottleneck entirely.

preprint2026arXiv

SwiftMem: Fast Agentic Memory via Query-aware Indexing

Agentic memory systems have become critical for enabling LLM agents to maintain long-term context and retrieve relevant information efficiently. However, existing memory frameworks suffer from a fundamental limitation: they perform exhaustive retrieval across the entire storage layer regardless of query characteristics. This brute-force approach creates severe latency bottlenecks as memory grows, hindering real-time agent interactions. We propose SwiftMem, a query-aware agentic memory system that achieves sub-linear retrieval through specialized indexing over temporal and semantic dimensions. Our temporal index enables logarithmic-time range queries for time-sensitive retrieval, while the semantic DAG-Tag index maps queries to relevant topics through hierarchical tag structures. To address memory fragmentation during growth, we introduce an embedding-tag co-consolidation mechanism that reorganizes storage based on semantic clusters to improve cache locality. Experiments on LoCoMo and LongMemEval benchmarks demonstrate that SwiftMem achieves 47$\times$ faster search compared to state-of-the-art baselines while maintaining competitive accuracy, enabling practical deployment of memory-augmented LLM agents.

preprint2026arXiv

TIE: Time Interval Encoding for Video Generation over Events

Director-style prompting, robotic action prediction, and interactive video agents demand temporal grounding over concurrent events -- a regime in which 68% of general clips and over 99% of robotics/gameplay clips contain overlapping events, yet existing multi-event generators rest on a single-active-prompt assumption. However, modern video generators, such as Diffusion Transformers (DiT), represent time as discrete points through point-wise positional encodings. This formulation creates a fundamental dimension mismatch: temporally extended intervals and overlapping events are mathematically unrepresentable to the attention mechanism. In this paper, we propose Time Interval Encoding (TIE), a principled, plug-and-play interval-aware generalization of rotary embeddings that elevates time intervals to first-class primitives inside DiT cross-attention. Rather than introducing another heuristic interval embedding, we show that, within RoPE-compatible bilinear attention, TIE is characterized by two basic principles: Temporal Integrability, which requires an event to aggregate positional evidence over its full duration, and Duration Invariance, which removes the trivial bias toward longer intervals. Under a uniform kernel, this characterization yields an efficient closed-form sinc-based solution that preserves the standard attention interface and naturally attenuates boundary noise through interval integration. Empirically, TIE preserves the visual quality of the base DiT model while substantially improving temporal controllability. In our experiments on the OmniEvents dataset, it improves human-verified Temporal Constraint Satisfaction Rate from 77.34% to 96.03% and reduces temporal boundary error from 0.261s to 0.073s, while also improving trajectory-level temporal alignment metrics. The code and dataset are available at https://github.com/MatrixTeam-AI/TIE.

preprint2026arXiv

Towards Generalist Game Players: An Investigation of Foundation Models in the Game Multiverse

The real world unfolds along a single set of physics laws, yet human intelligence demonstrates a remarkable capacity to generalize experiences from this singular physical existence into a multiverse of games, each governed by entirely different rules, aesthetics, physics, and objectives. This omni-reality adaptability is a hallmark of general intelligence. As Artificial Intelligence progresses towards Artificial General Intelligence, the multiverse of games has evolved from mere entertainment into the ultimate ground for training and evaluating AGI. The pursuit of this generality has unfolded across four eras: from environment-specific symbolic and reinforcement learning agents, to current large foundation models acting as generalist players, and toward a future creator stage where agent both creates new game worlds and continually evolves within them. We trace the full lifecycle of a generalist game player along four interdependent pillars: Dataset, Model, Harness, and Benchmark. Every advance across these pillars can be read as an attempt to break one of five fundamental trade-offs that currently bound the whole system. Building on this end-to-end view, we chart a five-level roadmap, progressing from single-game mastery to the ultimate creator stage in which the agent simultaneously creates and evolves within theoretical game multiverse. Taken together, our work offers a unified lens onto a rapidly shifting field,and a principled path toward the omnipotent generalist agent capable of seamlessly mastering any challenge within the multiverse of games, thereby paving the way for AGI.

preprint2025arXiv

Poincaré Duality and Multiplicative Structures on Quantum Codes

Quantum LDPC codes have attracted intense interest due to their advantageous properties for realizing efficient fault-tolerant quantum computing. In particular, sheaf codes represent a novel framework that encompasses all well-known good qLDPC codes with profound underlying mathematics. In this work, we generalize Poincaré duality from manifolds to both classical and quantum codes defined via sheaf theory on $t$-dimensional cell complexes. Viewing important code properties including the encoding rate, code distance, local testability soundness, and efficient decoders as parameters of the underlying (co)chain complexes, we rigorously prove a duality relationship between the $i$-th chain and the $(t-i)$-th cochain of sheaf codes. We further build multiplicative structures such as cup and cap products on sheaved chain complexes, inspired by the standard notions of multiplicative structures and Poincaré duality on manifolds. This immediately leads to an explicit isomorphism between (co)homology groups of sheaf codes via a cap product. As an application, we obtain transversal disjoint logical $\mathrm{C}Z$ gates with $k_{\mathrm{C}Z}=Θ(n)$ on families of good qLDPC and almost-good quantum locally testable codes. Moreover, we provide multiple new methods to construct transversal circuits composed of $\mathrm{C}\mathrm{C}Z$ gates as well as for higher order controlled-$Z$ that are provably logical operations on the code space. We conjecture that they generate nontrivial logical actions, pointing towards fault-tolerant non-Clifford gates on nearly optimal qLDPC sheaf codes. Mathematically, our results are built on establishing the equivalence between sheaf cohomology in the derived-functor sense, Čech cohomology, and the cohomology of sheaf codes, thereby introducing new mathematical tools into quantum coding theory.

preprint2024arXiv

Geo2SigMap: High-Fidelity RF Signal Mapping Using Geographic Databases

Radio frequency (RF) signal mapping, which is the process of analyzing and predicting the RF signal strength and distribution across specific areas, is crucial for cellular network planning and deployment. Traditional approaches to RF signal mapping rely on statistical models constructed based on measurement data, which offer low complexity but often lack accuracy, or ray tracing tools, which provide enhanced precision for the target area but suffer from increased computational complexity. Recently, machine learning (ML) has emerged as a data-driven method for modeling RF signal propagation, which leverages models trained on synthetic datasets to perform RF signal mapping in "unseen" areas. In this paper, we present Geo2SigMap, an ML-based framework for efficient and high-fidelity RF signal mapping using geographic databases. First, we develop an automated framework that seamlessly integrates three open-source tools: OpenStreetMap (geographic databases), Blender (computer graphics), and Sionna (ray tracing), enabling the efficient generation of large-scale 3D building maps and ray tracing models. Second, we propose a cascaded U-Net model, which is pre-trained on synthetic datasets and employed to generate detailed RF signal maps, leveraging environmental information and sparse measurement data. Finally, we evaluate the performance of Geo2SigMap via a real-world measurement campaign, where three types of user equipment (UE) collect over 45,000 data points related to cellular information from six LTE cells operating in the citizens broadband radio service (CBRS) band. Our results show that Geo2SigMap achieves an average root-mean-square-error (RMSE) of 6.04 dB for predicting the reference signal received power (RSRP) at the UE, representing an average RMSE improvement of 3.59 dB compared to existing methods.

preprint2022arXiv

Adaptive Local Implicit Image Function for Arbitrary-scale Super-resolution

Image representation is critical for many visual tasks. Instead of representing images discretely with 2D arrays of pixels, a recent study, namely local implicit image function (LIIF), denotes images as a continuous function where pixel values are expansion by using the corresponding coordinates as inputs. Due to its continuous nature, LIIF can be adopted for arbitrary-scale image super-resolution tasks, resulting in a single effective and efficient model for various up-scaling factors. However, LIIF often suffers from structural distortions and ringing artifacts around edges, mostly because all pixels share the same model, thus ignoring the local properties of the image. In this paper, we propose a novel adaptive local image function (A-LIIF) to alleviate this problem. Specifically, our A-LIIF consists of two main components: an encoder and a expansion network. The former captures cross-scale image features, while the latter models the continuous up-scaling function by a weighted combination of multiple local implicit image functions. Accordingly, our A-LIIF can reconstruct the high-frequency textures and structures more accurately. Experiments on multiple benchmark datasets verify the effectiveness of our method. Our codes are available at \url{https://github.com/LeeHW-THU/A-LIIF}.

preprint2022arXiv

Backdoor Defense via Decoupling the Training Process

Recent studies have revealed that deep neural networks (DNNs) are vulnerable to backdoor attacks, where attackers embed hidden backdoors in the DNN model by poisoning a few training samples. The attacked model behaves normally on benign samples, whereas its prediction will be maliciously changed when the backdoor is activated. We reveal that poisoned samples tend to cluster together in the feature space of the attacked DNN model, which is mostly due to the end-to-end supervised training paradigm. Inspired by this observation, we propose a novel backdoor defense via decoupling the original end-to-end training process into three stages. Specifically, we first learn the backbone of a DNN model via \emph{self-supervised learning} based on training samples without their labels. The learned backbone will map samples with the same ground-truth label to similar locations in the feature space. Then, we freeze the parameters of the learned backbone and train the remaining fully connected layers via standard training with all (labeled) training samples. Lastly, to further alleviate side-effects of poisoned samples in the second stage, we remove labels of some `low-credible' samples determined based on the learned model and conduct a \emph{semi-supervised fine-tuning} of the whole model. Extensive experiments on multiple benchmark datasets and DNN models verify that the proposed defense is effective in reducing backdoor threats while preserving high accuracy in predicting benign samples. Our code is available at \url{https://github.com/SCLBD/DBD}.

preprint2022arXiv

Backdoor Learning: A Survey

Backdoor attack intends to embed hidden backdoor into deep neural networks (DNNs), so that the attacked models perform well on benign samples, whereas their predictions will be maliciously changed if the hidden backdoor is activated by attacker-specified triggers. This threat could happen when the training process is not fully controlled, such as training on third-party datasets or adopting third-party models, which poses a new and realistic threat. Although backdoor learning is an emerging and rapidly growing research area, its systematic review, however, remains blank. In this paper, we present the first comprehensive survey of this realm. We summarize and categorize existing backdoor attacks and defenses based on their characteristics, and provide a unified framework for analyzing poisoning-based backdoor attacks. Besides, we also analyze the relation between backdoor attacks and relevant fields ($i.e.,$ adversarial attacks and data poisoning), and summarize widely adopted benchmark datasets. Finally, we briefly outline certain future research directions relying upon reviewed works. A curated list of backdoor-related resources is also available at \url{https://github.com/THUYimingLi/backdoor-learning-resources}.

preprint2022arXiv

Deep Convolutional Neural Networks for Molecular Subtyping of Gliomas Using Magnetic Resonance Imaging

Knowledge of molecular subtypes of gliomas can provide valuable information for tailored therapies. This study aimed to investigate the use of deep convolutional neural networks (DCNNs) for noninvasive glioma subtyping with radiological imaging data according to the new taxonomy announced by the World Health Organization in 2016. Methods: A DCNN model was developed for the prediction of the five glioma subtypes based on a hierarchical classification paradigm. This model used three parallel, weight-sharing, deep residual learning networks to process 2.5-dimensional input of trimodal MRI data, including T1-weighted, T1-weighted with contrast enhancement, and T2-weighted images. A data set comprising 1,016 real patients was collected for evaluation of the developed DCNN model. The predictive performance was evaluated via the area under the curve (AUC) from the receiver operating characteristic analysis. For comparison, the performance of a radiomics-based approach was also evaluated. Results: The AUCs of the DCNN model for the four classification tasks in the hierarchical classification paradigm were 0.89, 0.89, 0.85, and 0.66, respectively, as compared to 0.85, 0.75, 0.67, and 0.59 of the radiomics approach. Conclusion: The results showed that the developed DCNN model can predict glioma subtypes with promising performance, given sufficient, non-ill-balanced training data.

preprint2022arXiv

Egocentric Prediction of Action Target in 3D

We are interested in anticipating as early as possible the target location of a person's object manipulation action in a 3D workspace from egocentric vision. It is important in fields like human-robot collaboration, but has not yet received enough attention from vision and learning communities. To stimulate more research on this challenging egocentric vision task, we propose a large multimodality dataset of more than 1 million frames of RGB-D and IMU streams, and provide evaluation metrics based on our high-quality 2D and 3D labels from semi-automatic annotation. Meanwhile, we design baseline methods using recurrent neural networks and conduct various ablation studies to validate their effectiveness. Our results demonstrate that this new task is worthy of further study by researchers in robotics, vision, and learning communities.

preprint2022arXiv

Enhanced Atmospheric Turbulence Resiliency with Successive Interference Cancellation DSP in Mode Division Multiplexing Free-Space Optical Links

We experimentally demonstrate the enhanced atmospheric turbulence resiliency in a 137.8 Gbit/s/mode mode-division multiplexing free-space optical communication link through the application of a successive interference cancellation digital signal processing algorithm. The turbulence resiliency is further enhanced through redundant receive channels in the mode-division multiplexing link. The proof of concept demonstration is performed using commercially available mode-selective photonic lanterns, a commercial transponder, and a spatial light modulator based turbulence emulator. In this link, 5 spatial modes with each mode carrying 34.46 GBaud dual-polarization quadrature phase shift keying signals are successfully transmitted with an average bit error rate lower than the hard-decision forward error correction limit. As a result, we achieved a record-high mode- and polarization-division multiplexing channel number of 10, a record-high line rate of 689.23 Gbit/s, and a record-high net spectral efficiency of 13.9 bit/s/Hz in emulated turbulent links in a mode-division multiplexing free-space optical system.

preprint2022arXiv

Few-Shot Backdoor Attacks on Visual Object Tracking

Visual object tracking (VOT) has been widely adopted in mission-critical applications, such as autonomous driving and intelligent surveillance systems. In current practice, third-party resources such as datasets, backbone networks, and training platforms are frequently used to train high-performance VOT models. Whilst these resources bring certain convenience, they also introduce new security threats into VOT models. In this paper, we reveal such a threat where an adversary can easily implant hidden backdoors into VOT models by tempering with the training process. Specifically, we propose a simple yet effective few-shot backdoor attack (FSBA) that optimizes two losses alternately: 1) a \emph{feature loss} defined in the hidden feature space, and 2) the standard \emph{tracking loss}. We show that, once the backdoor is embedded into the target model by our FSBA, it can trick the model to lose track of specific objects even when the \emph{trigger} only appears in one or a few frames. We examine our attack in both digital and physical-world settings and show that it can significantly degrade the performance of state-of-the-art VOT trackers. We also show that our attack is resistant to potential defenses, highlighting the vulnerability of VOT models to potential backdoor attacks.

preprint2022arXiv

Learning Distilled Collaboration Graph for Multi-Agent Perception

To promote better performance-bandwidth trade-off for multi-agent perception, we propose a novel distilled collaboration graph (DiscoGraph) to model trainable, pose-aware, and adaptive collaboration among agents. Our key novelties lie in two aspects. First, we propose a teacher-student framework to train DiscoGraph via knowledge distillation. The teacher model employs an early collaboration with holistic-view inputs; the student model is based on intermediate collaboration with single-view inputs. Our framework trains DiscoGraph by constraining post-collaboration feature maps in the student model to match the correspondences in the teacher model. Second, we propose a matrix-valued edge weight in DiscoGraph. In such a matrix, each element reflects the inter-agent attention at a specific spatial region, allowing an agent to adaptively highlight the informative regions. During inference, we only need to use the student model named as the distilled collaboration network (DiscoNet). Attributed to the teacher-student framework, multiple agents with the shared DiscoNet could collaboratively approach the performance of a hypothetical teacher model with a holistic view. Our approach is validated on V2X-Sim 1.0, a large-scale multi-agent perception dataset that we synthesized using CARLA and SUMO co-simulation. Our quantitative and qualitative experiments in multi-agent 3D object detection show that DiscoNet could not only achieve a better performance-bandwidth trade-off than the state-of-the-art collaborative perception methods, but also bring more straightforward design rationale. Our code is available on https://github.com/ai4ce/DiscoNet.

preprint2022arXiv

Spontaneous Radiative Cooling to Enhance the Operational Stability of Perovskite Solar Cells via a Black-body-like Full Carbon Electrode

Operational stability of perovskite solar cells is remarkably influenced by the device temperature, therefore, decreasing the interior temperature of the device is one of the most effective approaches to prolong the service life. Herein, we introduce the spontaneous radiative cooling effect into the perovskite solar cell and amplified this effect via functional structure design of a full-carbon electrode (F-CE). Firstly, with interface engineering, >19% and >23% power conversion efficiencies of F-CE based inorganic CsPbI3 and hybrid perovskite solar cells have been achieved, respectively, both of which are the highest reported efficiencies based on carbon electrode and are comparative to the results for metal electrodes. Highly efficient thermal radiation of this F-CE can reduce the temperature of the operating cell by about 10 °C. Compared with the conventional metal electrode-based control cells, the operational stability of the above two types of cells have been significantly improved due to this cooling effect. Especially, the CsPbI3 PSCs exhibited no efficiency degradation after 2000 hours of continuous operational tracking.

preprint2022arXiv

V2X-Sim: Multi-Agent Collaborative Perception Dataset and Benchmark for Autonomous Driving

Vehicle-to-everything (V2X) communication techniques enable the collaboration between vehicles and many other entities in the neighboring environment, which could fundamentally improve the perception system for autonomous driving. However, the lack of a public dataset significantly restricts the research progress of collaborative perception. To fill this gap, we present V2X-Sim, a comprehensive simulated multi-agent perception dataset for V2X-aided autonomous driving. V2X-Sim provides: (1) \hl{multi-agent} sensor recordings from the road-side unit (RSU) and multiple vehicles that enable collaborative perception, (2) multi-modality sensor streams that facilitate multi-modality perception, and (3) diverse ground truths that support various perception tasks. Meanwhile, we build an open-source testbed and provide a benchmark for the state-of-the-art collaborative perception algorithms on three tasks, including detection, tracking and segmentation. V2X-Sim seeks to stimulate collaborative perception research for autonomous driving before realistic datasets become widely available. Our dataset and code are available at \url{https://ai4ce.github.io/V2X-Sim/}.

preprint2021arXiv

$t$-$k$-means: A Robust and Stable $k$-means Variant

$k$-means algorithm is one of the most classical clustering methods, which has been widely and successfully used in signal processing. However, due to the thin-tailed property of the Gaussian distribution, $k$-means algorithm suffers from relatively poor performance on the dataset containing heavy-tailed data or outliers. Besides, standard $k$-means algorithm also has relatively weak stability, $i.e.$ its results have a large variance, which reduces its credibility. In this paper, we propose a robust and stable $k$-means variant, dubbed the $t$-$k$-means, as well as its fast version to alleviate those problems. Theoretically, we derive the $t$-$k$-means and analyze its robustness and stability from the aspect of the loss function and the expression of the clustering center, respectively. Extensive experiments are also conducted, which verify the effectiveness and efficiency of the proposed method. The code for reproducing main results is available at \url{https://github.com/THUYimingLi/t-k-means}.

preprint2021arXiv

An Optimized H.266/VVC Software Decoder On Mobile Platform

As the successor of H.265/HEVC, the new versatile video coding standard (H.266/VVC) can provide up to 50% bitrate saving with the same subjective quality, at the cost of increased decoding complexity. To accelerate the application of the new coding standard, a real-time H.266/VVC software decoder that can support various platforms is implemented, where SIMD technologies, parallelism optimization, and the acceleration strategies based on the characteristics of each coding tool are applied. As the mobile devices have become an essential carrier for video services nowadays, the mentioned optimization efforts are not only implemented for the x86 platform, but more importantly utilized to highly optimize the decoding performance on the ARM platform in this work. The experimental results show that when running on the Apple A14 SoC (iPhone 12pro), the average single-thread decoding speed of the present implementation can achieve 53fps (RA and LB) for full HD (1080p) bitstreams generated by VTM-11.0 reference software using 8bit Common Test Conditions (CTC). When multi-threading is enabled, an average of 32 fps (RA) can be achieved when decoding the 4K bitstreams.

preprint2021arXiv

Backdoor Attack against Speaker Verification

Speaker verification has been widely and successfully adopted in many mission-critical areas for user identification. The training of speaker verification requires a large amount of data, therefore users usually need to adopt third-party data ($e.g.$, data from the Internet or third-party data company). This raises the question of whether adopting untrusted third-party data can pose a security threat. In this paper, we demonstrate that it is possible to inject the hidden backdoor for infecting speaker verification models by poisoning the training data. Specifically, we design a clustering-based attack scheme where poisoned samples from different clusters will contain different triggers ($i.e.$, pre-defined utterances), based on our understanding of verification tasks. The infected models behave normally on benign samples, while attacker-specified unenrolled triggers will successfully pass the verification even if the attacker has no information about the enrolled speaker. We also demonstrate that existing backdoor attacks cannot be directly adopted in attacking speaker verification. Our approach not only provides a new perspective for designing novel attacks, but also serves as a strong baseline for improving the robustness of verification methods. The code for reproducing main results is available at \url{https://github.com/zhaitongqing233/Backdoor-attack-against-speaker-verification}.

preprint2021arXiv

Rethinking the Trigger of Backdoor Attack

Backdoor attack intends to inject hidden backdoor into the deep neural networks (DNNs), such that the prediction of the infected model will be maliciously changed if the hidden backdoor is activated by the attacker-defined trigger, while it performs well on benign samples. Currently, most of existing backdoor attacks adopted the setting of \emph{static} trigger, $i.e.,$ triggers across the training and testing images follow the same appearance and are located in the same area. In this paper, we revisit this attack paradigm by analyzing the characteristics of the static trigger. We demonstrate that such an attack paradigm is vulnerable when the trigger in testing images is not consistent with the one used for training. We further explore how to utilize this property for backdoor defense, and discuss how to alleviate such vulnerability of existing attacks.

preprint2021arXiv

Targeted Attack against Deep Neural Networks via Flipping Limited Weight Bits

To explore the vulnerability of deep neural networks (DNNs), many attack paradigms have been well studied, such as the poisoning-based backdoor attack in the training stage and the adversarial attack in the inference stage. In this paper, we study a novel attack paradigm, which modifies model parameters in the deployment stage for malicious purposes. Specifically, our goal is to misclassify a specific sample into a target class without any sample modification, while not significantly reduce the prediction accuracy of other samples to ensure the stealthiness. To this end, we formulate this problem as a binary integer programming (BIP), since the parameters are stored as binary bits ($i.e.$, 0 and 1) in the memory. By utilizing the latest technique in integer programming, we equivalently reformulate this BIP problem as a continuous optimization problem, which can be effectively and efficiently solved using the alternating direction method of multipliers (ADMM) method. Consequently, the flipped critical bits can be easily determined through optimization, rather than using a heuristic strategy. Extensive experiments demonstrate the superiority of our method in attacking DNNs.

preprint2021arXiv

Visual Privacy Protection via Mapping Distortion

Privacy protection is an important research area, which is especially critical in this big data era. To a large extent, the privacy of visual classification data is mainly in the mapping between the image and its corresponding label, since this relation provides a great amount of information and can be used in other scenarios. In this paper, we propose the mapping distortion based protection (MDP) and its augmentation-based extension (AugMDP) to protect the data privacy by modifying the original dataset. In the modified dataset generated by MDP, the image and its label are not consistent ($e.g.$, a cat-like image is labeled as the dog), whereas the DNNs trained on it can still achieve good performance on benign testing set. As such, this method can protect privacy when the dataset is leaked. Extensive experiments are conducted, which verify the effectiveness and feasibility of our method. The code for reproducing main results is available at \url{https://github.com/PerdonLiu/Visual-Privacy-Protection-via-Mapping-Distortion}.

preprint2020arXiv

Automatic Failure Recovery and Re-Initialization for Online UAV Tracking with Joint Scale and Aspect Ratio Optimization

Current unmanned aerial vehicle (UAV) visual tracking algorithms are primarily limited with respect to: (i) the kind of size variation they can deal with, (ii) the implementation speed which hardly meets the real-time requirement. In this work, a real-time UAV tracking algorithm with powerful size estimation ability is proposed. Specifically, the overall tracking task is allocated to two 2D filters: (i) translation filter for location prediction in the space domain, (ii) size filter for scale and aspect ratio optimization in the size domain. Besides, an efficient two-stage re-detection strategy is introduced for long-term UAV tracking tasks. Large-scale experiments on four UAV benchmarks demonstrate the superiority of the presented method which has computation feasibility on a low-cost CPU.

preprint2020arXiv

AutoTrack: Towards High-Performance Visual Tracking for UAV with Automatic Spatio-Temporal Regularization

Most existing trackers based on discriminative correlation filters (DCF) try to introduce predefined regularization term to improve the learning of target objects, e.g., by suppressing background learning or by restricting change rate of correlation filters. However, predefined parameters introduce much effort in tuning them and they still fail to adapt to new situations that the designer did not think of. In this work, a novel approach is proposed to online automatically and adaptively learn spatio-temporal regularization term. Spatially local response map variation is introduced as spatial regularization to make DCF focus on the learning of trust-worthy parts of the object, and global response map variation determines the updating rate of the filter. Extensive experiments on four UAV benchmarks have proven the superiority of our method compared to the state-of-the-art CPU- and GPU-based trackers, with a speed of ~60 frames per second running on a single CPU. Our tracker is additionally proposed to be applied in UAV localization. Considerable tests in the indoor practical scenarios have proven the effectiveness and versatility of our localization method. The code is available at https://github.com/vision4robotics/AutoTrack.

preprint2020arXiv

Context-Aware Refinement Network Incorporating Structural Connectivity Prior for Brain Midline Delineation

Brain midline delineation can facilitate the clinical evaluation of brain midline shift, which plays an important role in the diagnosis and prognosis of various brain pathology. Nevertheless, there are still great challenges with brain midline delineation, such as the largely deformed midline caused by the mass effect and the possible morphological failure that the predicted midline is not a connected curve. To address these challenges, we propose a context-aware refinement network (CAR-Net) to refine and integrate the feature pyramid representation generated by the UNet. Consequently, the proposed CAR-Net explores more discriminative contextual features and a larger receptive field, which is of great importance to predict largely deformed midline. For keeping the structural connectivity of the brain midline, we introduce a novel connectivity regular loss (CRL) to punish the disconnectivity between adjacent coordinates. Moreover, we address the ignored prerequisite of previous regression-based methods that the brain CT image must be in the standard pose. A simple pose rectification network is presented to align the source input image to the standard pose image. Extensive experimental results on the CQ dataset and one inhouse dataset show that the proposed method requires fewer parameters and outperforms three state-of-the-art methods in terms of four evaluation metrics. Code is available at https://github.com/ShawnBIT/Brain-Midline-Detection.

preprint2020arXiv

DR^2Track: Towards Real-Time Visual Tracking for UAV via Distractor Repressed Dynamic Regression

Visual tracking has yielded promising applications with unmanned aerial vehicle (UAV). In literature, the advanced discriminative correlation filter (DCF) type trackers generally distinguish the foreground from the background with a learned regressor which regresses the implicit circulated samples into a fixed target label. However, the predefined and unchanged regression target results in low robustness and adaptivity to uncertain aerial tracking scenarios. In this work, we exploit the local maximum points of the response map generated in the detection phase to automatically locate current distractors. By repressing the response of distractors in the regressor learning, we can dynamically and adaptively alter our regression target to leverage the tracking robustness as well as adaptivity. Substantial experiments conducted on three challenging UAV benchmarks demonstrate both excellent performance and extraordinary speed (~50fps on a cheap CPU) of our tracker.

preprint2020arXiv

Eliminating the Electric Field Response in a Perovskite Heterojunction Solar Cell to Improve Operational Stability

Intrinsic and extrinsic ion migration is a very large threat to the operational stability of perovskite solar cells and is difficult to completely eliminate due to the low activation energy of ion migration and the existence of internal electric field. We propose a heterojunction route to help suppress ion migration, thus improving the operational stability of the cell from the perspective of eliminating the electric field response in the perovskite absorber. A heavily doped p-type (p+) thin layer semiconductor is introduced between the electron transporting layer (ETL) and perovskite absorber. The heterojunction charge depletion and electric field are limited to the ETL and p+ layers, while the perovskite absorber and hole transporting layer remain neutral. The p+ layer has a variety of candidate materials and is tolerant of defect density and carrier mobility, which makes this heterojunction route highly feasible and promising for use in practical applications.

preprint2020arXiv

Keyfilter-Aware Real-Time UAV Object Tracking

Correlation filter-based tracking has been widely applied in unmanned aerial vehicle (UAV) with high efficiency. However, it has two imperfections, i.e., boundary effect and filter corruption. Several methods enlarging the search area can mitigate boundary effect, yet introducing undesired background distraction. Existing frame-by-frame context learning strategies for repressing background distraction nevertheless lower the tracking speed. Inspired by keyframe-based simultaneous localization and mapping, keyfilter is proposed in visual tracking for the first time, in order to handle the above issues efficiently and effectively. Keyfilters generated by periodically selected keyframes learn the context intermittently and are used to restrain the learning of filters, so that 1) context awareness can be transmitted to all the filters via keyfilter restriction, and 2) filter corruption can be repressed. Compared to the state-of-the-art results, our tracker performs better on two challenging benchmarks, with enough speed for UAV real-time applications.

preprint2020arXiv

Multinomial Random Forest: Toward Consistency and Privacy-Preservation

Despite the impressive performance of random forests (RF), its theoretical properties have not been thoroughly understood. In this paper, we propose a novel RF framework, dubbed multinomial random forest (MRF), to analyze the \emph{consistency} and \emph{privacy-preservation}. Instead of deterministic greedy split rule or with simple randomness, the MRF adopts two impurity-based multinomial distributions to randomly select a split feature and a split value respectively. Theoretically, we prove the consistency of the proposed MRF and analyze its privacy-preservation within the framework of differential privacy. We also demonstrate with multiple datasets that its performance is on par with the standard RF. To the best of our knowledge, MRF is the first consistent RF variant that has comparable performance to the standard RF.

preprint2020arXiv

Rectified Decision Trees: Exploring the Landscape of Interpretable and Effective Machine Learning

Interpretability and effectiveness are two essential and indispensable requirements for adopting machine learning methods in reality. In this paper, we propose a knowledge distillation based decision trees extension, dubbed rectified decision trees (ReDT), to explore the possibility of fulfilling those requirements simultaneously. Specifically, we extend the splitting criteria and the ending condition of the standard decision trees, which allows training with soft labels while preserving the deterministic splitting paths. We then train the ReDT based on the soft label distilled from a well-trained teacher model through a novel jackknife-based method. Accordingly, ReDT preserves the excellent interpretable nature of the decision trees while having a relatively good performance. The effectiveness of adopting soft labels instead of hard ones is also analyzed empirically and theoretically. Surprisingly, experiments indicate that the introduction of soft labels also reduces the model size compared with the standard decision trees from the aspect of the total nodes and rules, which is an unexpected gift from the `dark knowledge' distilled from the teacher model.

preprint2020arXiv

Rectified Decision Trees: Towards Interpretability, Compression and Empirical Soundness

How to obtain a model with good interpretability and performance has always been an important research topic. In this paper, we propose rectified decision trees (ReDT), a knowledge distillation based decision trees rectification with high interpretability, small model size, and empirical soundness. Specifically, we extend the impurity calculation and the pure ending condition of the classical decision tree to propose a decision tree extension that allows the use of soft labels generated by a well-trained teacher model in training and prediction process. It is worth noting that for the acquisition of soft labels, we propose a new multiple cross-validation based method to reduce the effects of randomness and overfitting. These approaches ensure that ReDT retains excellent interpretability and even achieves fewer nodes than the decision tree in the aspect of compression while having relatively good performance. Besides, in contrast to traditional knowledge distillation, back propagation of the student model is not necessarily required in ReDT, which is an attempt of a new knowledge distillation approach. Extensive experiments are conducted, which demonstrates the superiority of ReDT in interpretability, compression, and empirical soundness.

preprint2020arXiv

Spin-orbit coupling in photonic graphene

We generate experimentally a honeycomb refractive index pattern in an atomic vapor cell using electromagnetically-induced transparency. We study experimentally and theoretically the propagation of polarized light beams in such "photonic graphene". We demonstrate that an effective spin-orbit coupling appears as a correction to the paraxial beam equations because of the strong spatial gradients of the permittivity. It leads to the coupling of spin and angular momentum at the Dirac points of the graphene lattice. Our results suggest that the polarization degree plays an important role in many configurations where it has been previously neglected.

preprint2020arXiv

Targeted Attack for Deep Hashing based Retrieval

The deep hashing based retrieval method is widely adopted in large-scale image and video retrieval. However, there is little investigation on its security. In this paper, we propose a novel method, dubbed deep hashing targeted attack (DHTA), to study the targeted attack on such retrieval. Specifically, we first formulate the targeted attack as a point-to-set optimization, which minimizes the average distance between the hash code of an adversarial example and those of a set of objects with the target label. Then we design a novel component-voting scheme to obtain an anchor code as the representative of the set of hash codes of objects with the target label, whose optimality guarantee is also theoretically derived. To balance the performance and perceptibility, we propose to minimize the Hamming distance between the hash code of the adversarial example and the anchor code under the $\ell^\infty$ restriction on the perturbation. Extensive experiments verify that DHTA is effective in attacking both deep hashing based image retrieval and video retrieval.

preprint2020arXiv

The ABC130 barrel module prototyping programme for the ATLAS strip tracker

For the Phase-II Upgrade of the ATLAS Detector, its Inner Detector, consisting of silicon pixel, silicon strip and transition radiation sub-detectors, will be replaced with an all new 100 % silicon tracker, composed of a pixel tracker at inner radii and a strip tracker at outer radii. The future ATLAS strip tracker will include 11,000 silicon sensor modules in the central region (barrel) and 7,000 modules in the forward region (end-caps), which are foreseen to be constructed over a period of 3.5 years. The construction of each module consists of a series of assembly and quality control steps, which were engineered to be identical for all production sites. In order to develop the tooling and procedures for assembly and testing of these modules, two series of major prototyping programs were conducted: an early program using readout chips designed using a 250 nm fabrication process (ABCN-25) and a subsequent program using a follow-up chip set made using 130 nm processing (ABC130 and HCC130 chips). This second generation of readout chips was used for an extensive prototyping program that produced around 100 barrel-type modules and contributed significantly to the development of the final module layout. This paper gives an overview of the components used in ABC130 barrel modules, their assembly procedure and findings resulting from their tests.

preprint2020arXiv

Toward Adversarial Robustness via Semi-supervised Robust Training

Adversarial examples have been shown to be the severe threat to deep neural networks (DNNs). One of the most effective adversarial defense methods is adversarial training (AT) through minimizing the adversarial risk $R_{adv}$, which encourages both the benign example $x$ and its adversarially perturbed neighborhoods within the $\ell_{p}$-ball to be predicted as the ground-truth label. In this work, we propose a novel defense method, the robust training (RT), by jointly minimizing two separated risks ($R_{stand}$ and $R_{rob}$), which is with respect to the benign example and its neighborhoods respectively. The motivation is to explicitly and jointly enhance the accuracy and the adversarial robustness. We prove that $R_{adv}$ is upper-bounded by $R_{stand} + R_{rob}$, which implies that RT has similar effect as AT. Intuitively, minimizing the standard risk enforces the benign example to be correctly predicted, and the robust risk minimization encourages the predictions of the neighbor examples to be consistent with the prediction of the benign example. Besides, since $R_{rob}$ is independent of the ground-truth label, RT is naturally extended to the semi-supervised mode ($i.e.$, SRT), to further enhance the adversarial robustness. Moreover, we extend the $\ell_{p}$-bounded neighborhood to a general case, which covers different types of perturbations, such as the pixel-wise ($i.e.$, $x + δ$) or the spatial perturbation ($i.e.$, $ AX + b$). Extensive experiments on benchmark datasets not only verify the superiority of the proposed SRT method to state-of-the-art methods for defensing pixel-wise or spatial perturbations separately, but also demonstrate its robustness to both perturbations simultaneously. The code for reproducing main results is available at \url{https://github.com/THUYimingLi/Semi-supervised_Robust_Training}.

preprint2020arXiv

Towards Robust Visual Tracking for Unmanned Aerial Vehicle with Tri-Attentional Correlation Filters

Object tracking has been broadly applied in unmanned aerial vehicle (UAV) tasks in recent years. However, existing algorithms still face difficulties such as partial occlusion, clutter background, and other challenging visual factors. Inspired by the cutting-edge attention mechanisms, a novel object tracking framework is proposed to leverage multi-level visual attention. Three primary attention, i.e., contextual attention, dimensional attention, and spatiotemporal attention, are integrated into the training and detection stages of correlation filter-based tracking pipeline. Therefore, the proposed tracker is equipped with robust discriminative power against challenging factors while maintaining high operational efficiency in UAV scenarios. Quantitative and qualitative experiments on two well-known benchmarks with 173 challenging UAV video sequences demonstrate the effectiveness of the proposed framework. The proposed tracking algorithm favorably outperforms 12 state-of-the-art methods, yielding 4.8% relative gain in UAVDT and 8.2% relative gain in UAV123@10fps against the baseline tracker while operating at the speed of $\sim$ 28 frames per second.

preprint2020arXiv

Training-Set Distillation for Real-Time UAV Object Tracking

Correlation filter (CF) has recently exhibited promising performance in visual object tracking for unmanned aerial vehicle (UAV). Such online learning method heavily depends on the quality of the training-set, yet complicated aerial scenarios like occlusion or out of view can reduce its reliability. In this work, a novel time slot-based distillation approach is proposed to efficiently and effectively optimize the training-set's quality on the fly. A cooperative energy minimization function is established to score the historical samples adaptively. To accelerate the scoring process, frames with high confident tracking results are employed as the keyframes to divide the tracking process into multiple time slots. After the establishment of a new slot, the weighted fusion of the previous samples generates one key-sample, in order to reduce the number of samples to be scored. Besides, when the current time slot exceeds the maximum frame number, which can be scored, the sample with the lowest score will be discarded. Consequently, the training-set can be efficiently and reliably distilled. Comprehensive tests on two well-known UAV benchmarks prove the effectiveness of our method with real-time speed on a single CPU.

preprint2019arXiv

Clearly Discriminate the Continuum Band and Exciton State of the Hybrid Lead Bromide Perovskite

Electronic states of the hybrid perovskite enable their promising applications as distinctive optoelectronic materials. The understanding of their electronic structures and charge characters remains highly controversial. The electronic mechanism such as reabsorption, Urbach tail and indirect band for interpreting dual-peak emissions is one of the controversial focuses. Herein, we report that through heterojunction enhanced exciton dissociation and global tracing of multiple radiative electronic states across wide temperature regions, we have succeeded in directly observing free carrier emissions from the hybrid lead bromide perovskite and clearly discriminating the direct continuum band and exciton states. The widely-concerned dual-peak emissions are clarified to be excitonic, arising from two types of exciton states of the perovskite. These excitons possess giant binding energies and superior phase stability compared to conventional inorganic semiconductors, providing important implications for exploiting the excitonic mechanism for realizing novel optoelectronic applications.

preprint2019arXiv

Exploiting Electrical Transients to Reveal Charge Loss Mechanism of Junction Solar Cells

Electrical transients enabled by optical excitation and electric detection provide a distinctive opportunity to study the charge transport, recombination and even the hysteresis of a solar cell in a much wider time window ranging from nanoseconds to seconds. However, controversies on how to exploit these investigations to unravel the charge loss mechanism of the cell have been ongoing. Herein, a new methodology of quantifying the charge loss within the bulk absorber or at the interfaces and the defect properties of junction solar cells has been proposed after the conventional tail-state framework is firstly demonstrated to be unreasonable. This methodology has been successfully applied in the study of commercialized silicon and emerging Cu2ZnSn(S, Se)4 and perovskite solar cells herein and should be universal to other photovoltaic device systems with similar structures. Overall, this work provides an alluring route for comprehensive investigation of dynamic physics processes and charge loss mechanism of junction solar cells and possesses potential applications for other optoelectronic devices.