Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
28works
0followers
21topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

28 published item(s)

preprint2026arXiv

Diffusion-APO: Trajectory-Aware Direct Preference Alignment for Video Diffusion Transformers

Efficiently aligning large-scale video diffusion models with human intent requires a scalable and trajectory-aware pathway that bridges the inherent discrepancy between training noise distributions and practical inference trajectories. While existing paradigms such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) attempt to address this, they are often hindered by either reliance on bias-prone, complex reward models or suboptimal timestep sampling. In this paper, we propose Diffusion-APO (Aligned Preference Optimization), a trajectory-aware algorithm that resolves this misalignment by synchronizing training noise with inference-time denoising paths to maximize gradient signal efficacy. To translate this algorithmic innovation into a practical solution, we introduce a unified and modular RLHF framework that integrates online ranking, half-online anchoring, offline refinement, and distillation-aware drift correction. This framework enables flexible, multi-stage preference alignment across diverse data and computational constraints without relying on scalar-reward-based policy gradients. Through extensive experiments, we demonstrate that Diffusion-APO consistently outperforms standard baselines in visual quality and instruction following, while effectively preserving generative fidelity during model acceleration, providing a robust, end-to-end pathway for scalable video diffusion alignment.

preprint2026arXiv

Electric field switching of altermagnetic spin-splitting in multiferroic skyrmions

Magnetic skyrmions are localized magnetic structures that retain their shape and stability over time, thanks to their topological nature. Recent theoretical and experimental progress has laid the groundwork for understanding magnetic skyrmions characterized by negligible net magnetization and ultrafast dynamics. Notably, skyrmions emerging in materials with altermagnetism, a novel magnetic phase featuring lifted Kramers degeneracy-have remained unreported until now. In this study, we demonstrate that BiFeO3, a multiferroic renowned for its strong coupling between ferroelectricity and magnetism, can transit from a spin cycloid to a Neel-type skyrmion under antidamping spin-orbit torque at room temperature. Strikingly, the altermagnetic spin splitting within BiFeO3 skyrmion can be reversed through the application of an electric field, revealed via the Circular photogalvanic effect. This quasiparticle, which possesses a neutral topological charge, holds substantial promise for diverse applications-most notably, enabling the development of unconventional computing systems with low power consumption and magnetoelectric controllability.

preprint2026arXiv

Exposing and Mitigating Temporal Attack in Deepfake Video Detection

While spatiotemporal deepfake detectors achieve high AUC, our experiments reveal their susceptibility to evasion attacks. These models tend to overfit on fragile temporal spectrum cues, rather than learning robust semantic causality. To mitigate this vulnerability, we propose SpInShield, a temporal spectral-invariant defense framework explicitly designed to decouple semantic motion from manipulatable spectral artifacts. We propose a learnable spectral adversary that dynamically synthesizes severe spectral deformations, simulating extreme attack scenarios. By employing a shortcut suppression optimization strategy, SpInShield compels the encoder to extract reliable forensic cues while purging unstable spectral statistics from the latent space. Experiments show that SpInShield obtains competitive performance on widely used datasets and outperforms the strongest baseline by 21.30 percentage points in AUC under simulated amplitude spectral attacks.

preprint2026arXiv

Stereo Audio Rendering for Personal Sound Zones Using a Binaural Spatially Adaptive Neural Network (BSANN)

A binaural rendering framework for personal sound zones (PSZs) is proposed to enable multiple head-tracked listeners to receive fully independent stereo audio programs. Current PSZ systems typically rely on monophonic rendering and therefore cannot control the left and right ears separately, which limits the quality and accuracy of spatial imaging. The proposed method employs a Binaural Spatially Adaptive Neural Network (BSANN) to generate ear-optimized loudspeaker filters that reconstruct the desired acoustic field at each ear of multiple listeners. The framework integrates anechoically measured loudspeaker frequency responses, analytically modeled transducer directivity, and rigid-sphere head-related transfer functions (HRTFs) to enhance acoustic accuracy and spatial rendering fidelity. An explicit active crosstalk cancellation (XTC) stage further improves three-dimensional spatial perception. Experiments show significant gains in measured objective performance metrics, including inter-zone isolation (IZI), inter-program isolation (IPI), and crosstalk cancellation (XTC), with log-frequency-weighted values of 10.23/10.03 dB (IZI), 11.11/9.16 dB (IPI), and 10.55/11.13 dB (XTC), respectively, over 100-20,000 Hz. The combined use of ear-wise control, accurate acoustic modeling, and integrated active XTC produces a unified rendering method that delivers greater isolation performance, increased robustness to room asymmetry, and more faithful spatial reproduction in real acoustic environments.

preprint2026arXiv

TEA: Temporal Adaptive Satellite Image Semantic Segmentation

Crop mapping based on satellite images time-series (SITS) holds substantial economic value in agricultural production settings, in which parcel segmentation is an essential step. Existing approaches have achieved notable advancements in SITS segmentation with predetermined sequence lengths. However, we found that these approaches overlooked the generalization capability of models across scenarios with varying temporal length, leading to markedly poor segmentation results in such cases. To address this issue, we propose TEA, a TEmporal Adaptive SITS semantic segmentation method to enhance the model's resilience under varying sequence lengths. We introduce a teacher model that encapsulates the global sequence knowledge to guide a student model with adaptive temporal input lengths. Specifically, teacher shapes the student's feature space via intermediate embedding, prototypes and soft label perspectives to realize knowledge transfer, while dynamically aggregating student model to mitigate knowledge forgetting. Finally, we introduce full-sequence reconstruction as an auxiliary task to further enhance the quality of representations across inputs of varying temporal lengths. Through extensive experiments, we demonstrate that our method brings remarkable improvements across inputs of different temporal lengths on common benchmarks. Our code will be publicly available.

preprint2026arXiv

The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs

On-policy distillation (OPD) is widely used for LLM post-training. When pushed with a reward-extrapolation coefficient lambda > 1, the student can lift past the teacher in domain, but past a threshold lambda* the same step violates the output contract on structured-output tasks. In a single-position Bernoulli reduction, we derive a closed-form base-relative clip-safety threshold lambda*(p,b,c) determined by three measurable quantities: the teacher modal probability, the warm-start mass, and the importance-sampling clip strength. Above lambda*, the extrapolated fixed point exits the clip-safe region, changing training from format-preserving to format-collapsing. We extend the rule to calibrated K-ary listwise JSON tasks where a single binding equivalence class dominates the output contract and SFT retains parse headroom. On Amazon Fashion, three pre-registered tests--a fine-grid cliff interval, a budget-extension test, and a small-clip cross-prediction--fall within their locked prediction windows, with the small-clip value matching the closed-form prediction below grid resolution. Operating just below lambda*, ListOPD brings a 1.7B Qwen3 student to in-domain parity with an 8B-SFT baseline at one-fifth the parameters. The gain is driven primarily by format adherence: NDCG@1 on parsed outputs remains flat across lambda, while parse validity sharply changes at the predicted boundary. The cliff diagnostic is rubric-independent, whereas the parity claim uses a Gemini-graded rubric and inherits that evaluator's exposure.

preprint2026arXiv

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient conflicts introduced by joint optimization while keeping parameters close to a single model. Building on this, an adaptive think mechanism uses a self-supervised routing gate to decide per input whether to generate CoT, skipping unnecessary reasoning to reduce inference overhead and even improve retrieval quality. We further explore embedding-guided RL to optimize CoT quality beyond supervised training. On the 78 tasks of MMEB-V2, TWN achieves state-of-the-art embedding quality while being substantially more efficient than existing generative methods, requiring only 3-5% additional parameters relative to the backbone and up to 50% fewer reasoning tokens compared to the full generative mode.

preprint2026arXiv

Towards Customized Multimodal Role-Play

Unified multimodal understanding and generation models enable richer human-AI interaction. Yet jointly customizing a character's persona, dialogue style, and visual identity while maintaining output consistency across modalities remains largely unexplored. To mitigate this gap, we introduce a new task, Customized Multimodal Role-Play (CMRP). We construct the RoleScape-20 dataset comprising 20 characters, including training and evaluation data that cover persona, stylistic descriptions, visual/expressive cues, and text-image interactions. Building on a unified model, we devise UniCharacter, a two-stage training framework containing Unified Supervised Finetuning (Unified-SFT) and character-specific group relative policy optimization (Character-GRPO). Given only 10 images plus corresponding interaction examples, the model acquires the target character and exhibits coherent persona, style, and visual identity in both generated text and images. This process takes about 100 GPU hours. Experiments on the RoleScape-20 dataset show that the proposed method substantially outperforms prior approaches. Ablation studies further validate the effectiveness of our cross-modal consistency design and few-shot customization strategy. We argue that CMRP, coupled with unified modeling, provides a basis for next-generation characterful and immersive interactive agents.

preprint2026arXiv

Unified Personalized Understanding, Generating and Editing

Unified large multimodal models (LMMs) have achieved remarkable progress in general-purpose multimodal understanding and generation. However, they still operate under a ``one-size-fits-all&#39;&#39; paradigm and struggle to model user-specific concepts (e.g., generate a photo of \texttt{<maeve>}) in a consistent and controllable manner. Existing personalization methods typically rely on external retrieval, which is inefficient and poorly integrated into unified multimodal pipelines. Recent personalized unified models introduce learnable soft prompts to encode concept information, yet they either couple understanding and generation or depend on complex multi-stage training, leading to cross-task interference and ultimately to fuzzy or misaligned personalized knowledge. We present \textbf{OmniPersona}, an end-to-end personalization framework for unified LMMs that, for the first time, integrates personalized understanding, generation, and image editing within a single architecture. OmniPersona introduces structurally decoupled concept tokens, allocating dedicated subspaces for different tasks to minimize interference, and incorporates an explicit knowledge replay mechanism that propagates personalized attribute knowledge across tasks, enabling consistent personalized behavior. To systematically evaluate unified personalization, we propose \textbf{\texttt{OmniPBench}}, extending the public UnifyBench concept set with personalized editing tasks and cross-task evaluation protocols integrating understanding, generation, and editing. Experimental results demonstrate that OmniPersona delivers competitive and robust performance across diverse personalization tasks. We hope OmniPersona will serve as a strong baseline and spur further research on controllable, unified personalization.

preprint2024arXiv

Hybrid Vector Message Passing for Generalized Bilinear Factorization

In this paper, we propose a new message passing algorithm that utilizes hybrid vector message passing (HVMP) to solve the generalized bilinear factorization (GBF) problem. The proposed GBF-HVMP algorithm integrates expectation propagation (EP) and variational message passing (VMP) via variational free energy minimization, yielding tractable Gaussian messages. Furthermore, GBF-HVMP enables vector/matrix variables rather than scalar ones in message passing, resulting in a loop-free Bayesian network that improves convergence. Numerical results show that GBF-HVMP significantly outperforms state-of-the-art methods in terms of NMSE performance and computational complexity.

preprint2022arXiv

Bayesian Inverse Uncertainty Quantification of the Physical Model Parameters for the Spallation Neutron Source First Target Station

The reliability of the mercury spallation target is mission-critical for the neutron science program of the spallation neutron source at the Oak Ridge National Laboratory. We present an inverse uncertainty quantification (UQ) study using the Bayesian framework for the mercury equation of state model parameters, with the assistance of polynomial chaos expansion surrogate models. By leveraging high-fidelity structural mechanics simulations and real measured strain data, the inverse UQ results reveal a maximum-a-posteriori estimate, mean, and standard deviation of $6.5\times 10^4$ ($6.49\times 10^4 \pm 2.39\times 10^3$) Pa for the tensile cutoff threshold, $12112.1$ ($12111.8 \pm 14.9$) kg/m$^3$ for the mercury density, and $1850.4$ ($1849.7 \pm 5.3$) m/s for the mercury speed of sound. These values do not necessarily represent the nominal mercury physical properties, but the ones that fit the strain data and the solid mechanics model we have used, and can be explained by three reasons: The limitations of the computer model or what is known as the model-form uncertainty, the biases and errors in the experimental data, and the mercury cavitation damage that also contributes to the change in mercury behavior. Consequently, the equation of state model parameters try to compensate for these effects to improve fitness to the data. The mercury target simulations using the updated parametric values result in an excellent agreement with 88% average accuracy compared to experimental data, 6% average increase compared to reference parameters, with some sensors experiencing an increase of more than 25%. With a more accurate simulated strain response, the component fatigue analysis can utilize the comprehensive strain history data to evaluate the target vessel&#39;s lifetime closer to its real limit, saving tremendous target costs.

preprint2022arXiv

ConfPred: A layered intergrowth structure prediction model based on confinement self-assembly in two-dimensional interlayer space

We constructed a simple but effective model to predict the layered intergrowth structures by combining the self-assembly phenomenon in confined space and the sandwich configuration of layered materials. In this model, a two-dimensional confined space is constructed by two known block layers, such as the Fe$_2$As$_2$ block of iron-based superconductors. Then, the crystal structure prediction is carried out only inside the confined space to search for brand-new block layers. We realized this model on the basis of the USPEX9.4 code. In the test, the already existing iron-based superconductors can be always successfully found, such as Ba$_2$Ti$_2$Fe$_2$As$_4$O, Sr$_3$Sc$_2$Fe$_2$As$_2$O$_5$, Sr$_4$V$_2$Fe$_2$As$_2$O$_6$, and so on. The comparison test suggests that our model has remarkable advantages in searching for intergrowth structures. With this space confinement prediction model, a structure prediction of layered intergrowth materials even with up to six elements can be performed with an acceptable machine time consumption. So far, we have done some multi-composition crystal structure predictions of iron-based superconductor, and found several stable and metastable structures, such as Ba$_2$Fe$_2$As$_3$, Eu$_2$Fe$_2$As$_3$, La$_2$O$_2$ClFeAs, LiOMn$_2$As,Li$_4$OFe$_2$As$_2$.

preprint2022arXiv

Ego4D: Around the World in 3,000 Hours of Egocentric Video

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household, outdoor, workplace, leisure, etc.) captured by 931 unique camera wearers from 74 worldwide locations and 9 different countries. The approach to collection is designed to uphold rigorous privacy and ethics standards with consenting participants and robust de-identification procedures where relevant. Ego4D dramatically expands the volume of diverse egocentric video footage publicly available to the research community. Portions of the video are accompanied by audio, 3D meshes of the environment, eye gaze, stereo, and/or synchronized videos from multiple egocentric cameras at the same event. Furthermore, we present a host of new benchmark challenges centered around understanding the first-person visual experience in the past (querying an episodic memory), present (analyzing hand-object manipulation, audio-visual conversation, and social interactions), and future (forecasting activities). By publicly sharing this massive annotated dataset and benchmark suite, we aim to push the frontier of first-person perception. Project page: https://ego4d-data.org/

preprint2022arXiv

Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization

Augmented reality devices have the potential to enhance human perception and enable other assistive functionalities in complex conversational environments. Effectively capturing the audio-visual context necessary for understanding these social interactions first requires detecting and localizing the voice activities of the device wearer and the surrounding people. These tasks are challenging due to their egocentric nature: the wearer&#39;s head motion may cause motion blur, surrounding people may appear in difficult viewing angles, and there may be occlusions, visual clutter, audio noise, and bad lighting. Under these conditions, previous state-of-the-art active speaker detection methods do not give satisfactory results. Instead, we tackle the problem from a new setting using both video and multi-channel microphone array audio. We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results. In contrast to previous methods, our method localizes active speakers from all possible directions on the sphere, even outside the camera&#39;s field of view, while simultaneously detecting the device wearer&#39;s own voice activity. Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.

preprint2022arXiv

Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering

To alleviate the data scarcity problem in training question answering systems, recent works propose additional intermediate pre-training for dense passage retrieval (DPR). However, there still remains a large discrepancy between the provided upstream signals and the downstream question-passage relevance, which leads to less improvement. To bridge this gap, we propose the HyperLink-induced Pre-training (HLP), a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents. We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training that better facilitate downstream passage retrieval. We investigate the effectiveness of our approach across a wide range of open-domain QA datasets under zero-shot, few-shot, multi-hop, and out-of-domain scenarios. The experiments show our HLP outperforms the BM25 by up to 7 points as well as other pre-training methods by more than 10 points in terms of top-20 retrieval accuracy under the zero-shot scenario. Furthermore, HLP significantly outperforms other pre-training methods under the other scenarios.

preprint2022arXiv

KMIR: A Benchmark for Evaluating Knowledge Memorization, Identification and Reasoning Abilities of Language Models

Previous works show the great potential of pre-trained language models (PLMs) for storing a large amount of factual knowledge. However, to figure out whether PLMs can be reliable knowledge sources and used as alternative knowledge bases (KBs), we need to further explore some critical features of PLMs. Firstly, knowledge memorization and identification abilities: traditional KBs can store various types of entities and relationships; do PLMs have a high knowledge capacity to store different types of knowledge? Secondly, reasoning ability: a qualified knowledge source should not only provide a collection of facts, but support a symbolic reasoner. Can PLMs derive new knowledge based on the correlations between facts? To evaluate these features of PLMs, we propose a benchmark, named Knowledge Memorization, Identification, and Reasoning test (KMIR). KMIR covers 3 types of knowledge, including general knowledge, domain-specific knowledge, and commonsense, and provides 184,348 well-designed questions. Preliminary experiments with various representative pre-training language models on KMIR reveal many interesting phenomenons: 1) The memorization ability of PLMs depends more on the number of parameters than training schemes. 2) Current PLMs are struggling to robustly remember the facts. 3) Model compression technology retains the amount of knowledge well, but hurts the identification and reasoning abilities. We hope KMIR can facilitate the design of PLMs as better knowledge sources.

preprint2022arXiv

Model Calibration of the Liquid Mercury Spallation Target using Evolutionary Neural Networks and Sparse Polynomial Expansions

The mercury constitutive model predicting the strain and stress in the target vessel plays a central role in improving the lifetime prediction and future target designs of the mercury targets at the Spallation Neutron Source (SNS). We leverage the experiment strain data collected over multiple years to improve the mercury constitutive model through a combination of large-scale simulations of the target behavior and the use of machine learning tools for parameter estimation. We present two interdisciplinary approaches for surrogate-based model calibration of expensive simulations using evolutionary neural networks and sparse polynomial expansions. The experiments and results of the two methods show a very good agreement for the solid mechanics simulation of the mercury spallation target. The proposed methods are used to calibrate the tensile cutoff threshold, mercury density, and mercury speed of sound during intense proton pulse experiments. Using strain experimental data from the mercury target sensors, the newly calibrated simulations achieve 7\% average improvement on the signal prediction accuracy and 8\% reduction in mean absolute error compared to previously reported reference parameters, with some sensors experiencing up to 30\% improvement. The proposed calibrated simulations can significantly aid in fatigue analysis to estimate the mercury target lifetime and integrity, which reduces abrupt target failure and saves a tremendous amount of costs. However, an important conclusion from this work points out to a deficiency in the current constitutive model based on the equation of state in capturing the full physics of the spallation reaction. Given that some of the calibrated parameters that show a good agreement with the experimental data can be nonphysical mercury properties, we need a more advanced two-phase flow model to capture bubble dynamics and mercury cavitation.

preprint2022arXiv

PReGAN: Answer Oriented Passage Ranking with Weakly Supervised GAN

Beyond topical relevance, passage ranking for open-domain factoid question answering also requires a passage to contain an answer (answerability). While a few recent studies have incorporated some reading capability into a ranker to account for answerability, the ranker is still hindered by the noisy nature of the training data typically available in this area, which considers any passage containing an answer entity as a positive sample. However, the answer entity in a passage is not necessarily mentioned in relation with the given question. To address the problem, we propose an approach called \ttt{PReGAN} for Passage Reranking based on Generative Adversarial Neural networks, which incorporates a discriminator on answerability, in addition to a discriminator on topical relevance. The goal is to force the generator to rank higher a passage that is topically relevant and contains an answer. Experiments on five public datasets show that \ttt{PReGAN} can better rank appropriate passages, which in turn, boosts the effectiveness of QA systems, and outperforms the existing approaches without using external data.

preprint2022arXiv

Residual-Aided End-to-End Learning of Communication System without Known Channel

Leveraging powerful deep learning techniques, the end-to-end (E2E) learning of communication system is able to outperform the classical communication system. Unfortunately, this communication system cannot be trained by deep learning without known channel. To deal with this problem, a generative adversarial network (GAN) based training scheme has been recently proposed to imitate the real channel. However, the gradient vanishing and overfitting problems of GAN will result in the serious performance degradation of E2E learning of communication system. To mitigate these two problems, we propose a residual aided GAN (RA-GAN) based training scheme in this paper. Particularly, inspired by the idea of residual learning, we propose a residual generator to mitigate the gradient vanishing problem by realizing a more robust gradient backpropagation. Moreover, to cope with the overfitting problem, we reconstruct the loss function for training by adding a regularizer, which limits the representation ability of RA-GAN. Simulation results show that the trained residual generator has better generation performance than the conventional generator, and the proposed RA-GAN based training scheme can achieve the near-optimal block error rate (BLER) performance with a negligible computational complexity increase in both the theoretical channel model and the ray-tracing based channel dataset.

preprint2022arXiv

Towards Efficient NLP: A Standard Evaluation and A Strong Baseline

Supersized pre-trained language models have pushed the accuracy of various natural language processing (NLP) tasks to a new state-of-the-art (SOTA). Rather than pursuing the reachless SOTA accuracy, more and more researchers start paying attention on model efficiency and usability. Different from accuracy, the metric for efficiency varies across different studies, making them hard to be fairly compared. To that end, this work presents ELUE (Efficient Language Understanding Evaluation), a standard evaluation, and a public leaderboard for efficient NLP models. ELUE is dedicated to depict the Pareto Frontier for various language understanding tasks, such that it can tell whether and how much a method achieves Pareto improvement. Along with the benchmark, we also release a strong baseline, ElasticBERT, which allows BERT to exit at any layer in both static and dynamic ways. We demonstrate the ElasticBERT, despite its simplicity, outperforms or performs on par with SOTA compressed and early exiting models. With ElasticBERT, the proposed ELUE has a strong Pareto Frontier and makes a better evaluation for efficient NLP models.

preprint2021arXiv

Feasible Computationally Efficient Path Planning for UAV Collision Avoidance

This paper presents a robust computationally efficient real-time collision avoidance algorithm for Unmanned Aerial Vehicle (UAV), namely Memory-based Wall Following-Artificial Potential Field (MWF-APF) method. The new algorithm switches between Wall-Following Method (WFM) and Artificial Potential Field method (APF) with improved situation awareness capability. Historical trajectory is taken into account to avoid repetitive wrong decision. Furthermore, it can be effectively applied to platform with low computing capability. As an example, a quad-rotor equipped with limited number of Time-of-Flight (TOF) rangefinders is adopted to validate the effectiveness and efficiency of this algorithm. Both software simulation and physical flight test have been conducted to demonstrate the capability of the MWF-APF method in complex scenarios.

preprint2021arXiv

Overcoming Long-term Catastrophic Forgetting through Adversarial Neural Pruning and Synaptic Consolidation

Artificial neural networks face the well-known problem of catastrophic forgetting. What&#39;s worse, the degradation of previously learned skills becomes more severe as the task sequence increases, known as the long-term catastrophic forgetting. It is due to two facts: first, as the model learns more tasks, the intersection of the low-error parameter subspace satisfying for these tasks becomes smaller or even does not exist; second, when the model learns a new task, the cumulative error keeps increasing as the model tries to protect the parameter configuration of previous tasks from interference. Inspired by the memory consolidation mechanism in mammalian brains with synaptic plasticity, we propose a confrontation mechanism in which Adversarial Neural Pruning and synaptic Consolidation (ANPyC) is used to overcome the long-term catastrophic forgetting issue. The neural pruning acts as long-term depression to prune task-irrelevant parameters, while the novel synaptic consolidation acts as long-term potentiation to strengthen task-relevant parameters. During the training, this confrontation achieves a balance in that only crucial parameters remain, and non-significant parameters are freed to learn subsequent tasks. ANPyC avoids forgetting important information and makes the model efficient to learn a large number of tasks. Specifically, the neural pruning iteratively relaxes the current task&#39;s parameter conditions to expand the common parameter subspace of the task; the synaptic consolidation strategy, which consists of a structure-aware parameter-importance measurement and an element-wise parameter updating strategy, decreases the cumulative error when learning new tasks. The full source code is available at https://github.com/GeoX-Lab/ANPyC.

preprint2021arXiv

Robopheus: A Virtual-Physical Interactive Mobile Robotic Testbed

The mobile robotic testbed is an essential and critical support to verify the effectiveness of mobile robotics research. This paper introduces a novel multi-robot testbed, named Robopheus, which exploits the ideas of virtual-physical modeling in digital-twin. Unlike most existing testbeds, the developed Robopheus constructs a bridge that connects the traditional physical hardware and virtual simulation testbeds, providing scalable, interactive, and high-fidelity simulations-tests on both sides. Another salient feature of the Robopheus is that it enables a new form to learn the actual models from the physical environment dynamically and is compatible with heterogeneous robot chassis and controllers. In turn, the virtual world&#39;s learned models are further leveraged to approximate the robot dynamics online on the physical side. Extensive experiments demonstrate the extraordinary performance of the Robopheus. Significantly, the physical-virtual interaction design increases the trajectory accuracy of a real robot by 300%, compared with that of not using the interaction.

preprint2020arXiv

Design, Control, and Applications of a Soft Robotic Arm

This paper presents the design, control, and applications of a multi-segment soft robotic arm. In order to design a soft arm with large load capacity, several design principles are proposed by analyzing two kinds of buckling issues, under which we present a novel structure named Honeycomb Pneumatic Networks (HPN). Parameter optimization method, based on finite element method (FEM), is proposed to optimize HPN Arm design parameters. Through a quick fabrication process, several prototypes with different performance are made, one of which can achieve the transverse load capacity of 3 kg under 3 bar pressure. Next, considering different internal and external conditions, we develop three controllers according to different model precision. Specifically, based on accurate model, an open-loop controller is realized by combining piece-wise constant curvature (PCC) modeling method and machine learning method. Based on inaccurate model, a feedback controller, using estimated Jacobian, is realized in 3D space. A model-free controller, using reinforcement learning to learn a control policy rather than a model, is realized in 2D plane, with minimal training data. Then, these three control methods are compared on a same experiment platform to explore the applicability of different methods under different conditions. Lastly, we figure out that soft arm can greatly simplify the perception, planning, and control of interaction tasks through its compliance, which is its main advantage over the rigid arm. Through plentiful experiments in three interaction application scenarios, human-robot interaction, free space interaction task, and confined space interaction task, we demonstrate the potential application prospect of the soft arm.

preprint2020arXiv

Learning Differential Diagnosis of Skin Conditions with Co-occurrence Supervision using Graph Convolutional Networks

Skin conditions are reported the 4th leading cause of nonfatal disease burden worldwide. However, given the colossal spectrum of skin disorders defined clinically and shortage in dermatology expertise, diagnosing skin conditions in a timely and accurate manner remains a challenging task. Using computer vision technologies, a deep learning system has proven effective assisting clinicians in image diagnostics of radiology, ophthalmology and more. In this paper, we propose a deep learning system (DLS) that may predict differential diagnosis of skin conditions using clinical images. Our DLS formulates the differential diagnostics as a multi-label classification task over 80 conditions when only incomplete image labels are available. We tackle the label incompleteness problem by combining a classification network with a Graph Convolutional Network (GCN) that characterizes label co-occurrence and effectively regularizes it towards a sparse representation. Our approach is demonstrated on 136,462 clinical images and concludes that the classification accuracy greatly benefit from the Co-occurrence supervision. Our DLS achieves 93.6% top-5 accuracy on 12,378 test images and consistently outperform the baseline classification network.

preprint2020arXiv

Real-time 3D Deep Multi-Camera Tracking

Tracking a crowd in 3D using multiple RGB cameras is a challenging task. Most previous multi-camera tracking algorithms are designed for offline setting and have high computational complexity. Robust real-time multi-camera 3D tracking is still an unsolved problem. In this work, we propose a novel end-to-end tracking pipeline, Deep Multi-Camera Tracking (DMCT), which achieves reliable real-time multi-camera people tracking. Our DMCT consists of 1) a fast and novel perspective-aware Deep GroudPoint Network, 2) a fusion procedure for ground-plane occupancy heatmap estimation, 3) a novel Deep Glimpse Network for person detection and 4) a fast and accurate online tracker. Our design fully unleashes the power of deep neural network to estimate the &#34;ground point&#34; of each person in each color image, which can be optimized to run efficiently and robustly. Our fusion procedure, glimpse network and tracker merge the results from different views, find people candidates using multiple video frames and then track people on the fused heatmap. Our system achieves the state-of-the-art tracking results while maintaining real-time performance. Apart from evaluation on the challenging WILDTRACK dataset, we also collect two more tracking datasets with high-quality labels from two different environments and camera settings. Our experimental results confirm that our proposed real-time pipeline gives superior results to previous approaches.

preprint2020arXiv

Review of data analysis in vision inspection of power lines with an in-depth discussion of deep learning technology

The widespread popularity of unmanned aerial vehicles enables an immense amount of power lines inspection data to be collected. How to employ massive inspection data especially the visible images to maintain the reliability, safety, and sustainability of power transmission is a pressing issue. To date, substantial works have been conducted on the analysis of power lines inspection data. With the aim of providing a comprehensive overview for researchers who are interested in developing a deep-learning-based analysis system for power lines inspection data, this paper conducts a thorough review of the current literature and identifies the challenges for future research. Following the typical procedure of inspection data analysis, we categorize current works in this area into component detection and fault diagnosis. For each aspect, the techniques and methodologies adopted in the literature are summarized. Some valuable information is also included such as data description and method performance. Further, an in-depth discussion of existing deep-learning-related analysis methods in power lines inspection is proposed. Finally, we conclude the paper with several research trends for the future of this area, such as data quality problems, small object detection, embedded application, and evaluation baseline.

preprint2020arXiv

Variance Reduction for Deep Q-Learning using Stochastic Recursive Gradient

Deep Q-learning algorithms often suffer from poor gradient estimations with an excessive variance, resulting in unstable training and poor sampling efficiency. Stochastic variance-reduced gradient methods such as SVRG have been applied to reduce the estimation variance (Zhao et al. 2019). However, due to the online instance generation nature of reinforcement learning, directly applying SVRG to deep Q-learning is facing the problem of the inaccurate estimation of the anchor points, which dramatically limits the potentials of SVRG. To address this issue and inspired by the recursive gradient variance reduction algorithm SARAH (Nguyen et al. 2017), this paper proposes to introduce the recursive framework for updating the stochastic gradient estimates in deep Q-learning, achieving a novel algorithm called SRG-DQN. Unlike the SVRG-based algorithms, SRG-DQN designs a recursive update of the stochastic gradient estimate. The parameter update is along an accumulated direction using the past stochastic gradient information, and therefore can get rid of the estimation of the full gradients as the anchors. Additionally, SRG-DQN involves the Adam process for further accelerating the training process. Theoretical analysis and the experimental results on well-known reinforcement learning tasks demonstrate the efficiency and effectiveness of the proposed SRG-DQN algorithm.