Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
57works
0followers
32topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

57 published item(s)

preprint2026arXiv

Robust Alignment: Harmonizing Clean Accuracy and Adversarial Robustness in Adversarial Training

Adversarial Training (AT) is one of the most effective methods for developing robust deep neural networks (DNNs). However, AT faces a trade-off problem between clean accuracy and adversarial robustness. In this work, we reveal a surprising phenomenon for the first time: Varying input perturbation intensities for training samples near decision boundaries in AT have minimal impact on model robustness. This finding directly exposes the inconsistency between accuracy and robustness score fluctuations, leading us to identify the misalignment between input and latent spaces as a critical driver of the robustness-accuracy trade-off. To mitigate this misalignment for harmonizing accuracy and robustness, we define Robust Alignment as a new AT target, encouraging the model perception to change with input perturbations provided the final label prediction remains unchanged, which can be achieved via two novel ideas. First, we suggest a reduced and fixed perturbation intensity for those boundary samples, which facilitates the model to utilize the perturbations as learnable patterns, instead of noises that complicate decision boundaries meaninglessly. Second, we propose a Domain Interpolation Consistency Adversarial Regularization (DICAR), based on rigorous theoretical derivations, which explicitly introduces semantic alignment between input and latent spaces into AT. Based on these two ideas, we end up with a new Robust Alignment Adversarial Training (RAAT) method, effectively harmonizing accuracy and robustness. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet with ResNet-18, PreActResNet-18, and WideResNet-28-10 demonstrate the effectiveness of RAAT in improving the trade-off beyond four common baselines and a total of 14 related state-of-the-art (SOTA) works.

preprint2024arXiv

Attacks in Adversarial Machine Learning: A Systematic Survey from the Life-cycle Perspective

Adversarial machine learning (AML) studies the adversarial phenomenon of machine learning, which may make inconsistent or unexpected predictions with humans. Some paradigms have been recently developed to explore this adversarial phenomenon occurring at different stages of a machine learning system, such as backdoor attack occurring at the pre-training, in-training and inference stage; weight attack occurring at the post-training, deployment and inference stage; adversarial attack occurring at the inference stage. However, although these adversarial paradigms share a common goal, their developments are almost independent, and there is still no big picture of AML. In this work, we aim to provide a unified perspective to the AML community to systematically review the overall progress of this field. We firstly provide a general definition about AML, and then propose a unified mathematical framework to covering existing attack paradigms. According to the proposed unified framework, we build a full taxonomy to systematically categorize and review existing representative methods for each paradigm. Besides, using this unified framework, it is easy to figure out the connections and differences among different attack paradigms, which may inspire future researchers to develop more advanced attack paradigms. Finally, to facilitate the viewing of the built taxonomy and the related literature in adversarial machine learning, we further provide a website, \ie, \url{http://adversarial-ml.com}, where the taxonomies and literature will be continuously updated.

preprint2023arXiv

BS3D: Building-scale 3D Reconstruction from RGB-D Images

Various datasets have been proposed for simultaneous localization and mapping (SLAM) and related problems. Existing datasets often include small environments, have incomplete ground truth, or lack important sensor data, such as depth and infrared images. We propose an easy-to-use framework for acquiring building-scale 3D reconstruction using a consumer depth camera. Unlike complex and expensive acquisition setups, our system enables crowd-sourcing, which can greatly benefit data-hungry algorithms. Compared to similar systems, we utilize raw depth maps for odometry computation and loop closure refinement which results in better reconstructions. We acquire a building-scale 3D dataset (BS3D) and demonstrate its value by training an improved monocular depth estimation model. As a unique experiment, we benchmark visual-inertial odometry methods using both color and active infrared images.

preprint2022arXiv

A weighted first-order formulation for solving anisotropic diffusion equations with deep neural networks

In this paper, a new weighted first-order formulation is proposed for solving the anisotropic diffusion equations with deep neural networks. For many numerical schemes, the accurate approximation of anisotropic heat flux is crucial for the overall accuracy. In this work, the heat flux is firstly decomposed into two components along the two eigenvectors of the diffusion tensor, thus the anisotropic heat flux approximation is converted into the approximation of two isotropic components. Moreover, to handle the possible jump of the diffusion tensor across the interface, the weighted first-order formulation is obtained by multiplying this first-order formulation by a weighted function. By the decaying property of the weighted function, the weighted first-order formulation is always well-defined in the pointwise way. Finally, the weighted first-order formulation is solved with deep neural network approximation. Compared to the neural network approximation with the original second-order elliptic formulation, the proposed method can significantly improve the accuracy, especially for the discontinuous anisotropic diffusion problems.

preprint2022arXiv

Acoustic-to-articulatory Inversion based on Speech Decomposition and Auxiliary Feature

Acoustic-to-articulatory inversion (AAI) is to obtain the movement of articulators from speech signals. Until now, achieving a speaker-independent AAI remains a challenge given the limited data. Besides, most current works only use audio speech as input, causing an inevitable performance bottleneck. To solve these problems, firstly, we pre-train a speech decomposition network to decompose audio speech into speaker embedding and content embedding as the new personalized speech features to adapt to the speaker-independent case. Secondly, to further improve the AAI, we propose a novel auxiliary feature network to estimate the lip auxiliary features from the above personalized speech features. Experimental results on three public datasets show that, compared with the state-of-the-art only using the audio speech feature, the proposed method reduces the average RMSE by 0.25 and increases the average correlation coefficient by 2.0% in the speaker-dependent case. More importantly, the average RMSE decreases by 0.29 and the average correlation coefficient increases by 5.0% in the speaker-independent case.

preprint2022arXiv

Advanced Conditional Variational Autoencoders (A-CVAE): Towards interpreting open-domain conversation generation via disentangling latent feature representation

Currently end-to-end deep learning based open-domain dialogue systems remain black box models, making it easy to generate irrelevant contents with data-driven models. Specifically, latent variables are highly entangled with different semantics in the latent space due to the lack of priori knowledge to guide the training. To address this problem, this paper proposes to harness the generative model with a priori knowledge through a cognitive approach involving mesoscopic scale feature disentanglement. Particularly, the model integrates the macro-level guided-category knowledge and micro-level open-domain dialogue data for the training, leveraging the priori knowledge into the latent space, which enables the model to disentangle the latent variables within the mesoscopic scale. Besides, we propose a new metric for open-domain dialogues, which can objectively evaluate the interpretability of the latent space distribution. Finally, we validate our model on different datasets and experimentally demonstrate that our model is able to generate higher quality and more interpretable dialogues than other models.

preprint2022arXiv

Anti-Forgery: Towards a Stealthy and Robust DeepFake Disruption Attack via Adversarial Perceptual-aware Perturbations

DeepFake is becoming a real risk to society and brings potential threats to both individual privacy and political security due to the DeepFaked multimedia are realistic and convincing. However, the popular DeepFake passive detection is an ex-post forensics countermeasure and failed in blocking the disinformation spreading in advance. To address this limitation, researchers study the proactive defense techniques by adding adversarial noises into the source data to disrupt the DeepFake manipulation. However, the existing studies on proactive DeepFake defense via injecting adversarial noises are not robust, which could be easily bypassed by employing simple image reconstruction revealed in a recent study MagDR. In this paper, we investigate the vulnerability of the existing forgery techniques and propose a novel \emph{anti-forgery} technique that helps users protect the shared facial images from attackers who are capable of applying the popular forgery techniques. Our proposed method generates perceptual-aware perturbations in an incessant manner which is vastly different from the prior studies by adding adversarial noises that is sparse. Experimental results reveal that our perceptual-aware perturbations are robust to diverse image transformations, especially the competitive evasion technique, MagDR via image reconstruction. Our findings potentially open up a new research direction towards thorough understanding and investigation of perceptual-aware adversarial attack for protecting facial images against DeepFakes in a proactive and robust manner. We open-source our tool to foster future research. Code is available at https://github.com/AbstractTeen/AntiForgery/.

preprint2022arXiv

Attentional Feature Refinement and Alignment Network for Aircraft Detection in SAR Imagery

Aircraft detection in Synthetic Aperture Radar (SAR) imagery is a challenging task in SAR Automatic Target Recognition (SAR ATR) areas due to aircraft's extremely discrete appearance, obvious intraclass variation, small size and serious background's interference. In this paper, a single-shot detector namely Attentional Feature Refinement and Alignment Network (AFRAN) is proposed for detecting aircraft in SAR images with competitive accuracy and speed. Specifically, three significant components including Attention Feature Fusion Module (AFFM), Deformable Lateral Connection Module (DLCM) and Anchor-guided Detection Module (ADM), are carefully designed in our method for refining and aligning informative characteristics of aircraft. To represent characteristics of aircraft with less interference, low-level textural and high-level semantic features of aircraft are fused and refined in AFFM throughly. The alignment between aircraft's discrete back-scatting points and convolutional sampling spots is promoted in DLCM. Eventually, the locations of aircraft are predicted precisely in ADM based on aligned features revised by refined anchors. To evaluate the performance of our method, a self-built SAR aircraft sliced dataset and a large scene SAR image are collected. Extensive quantitative and qualitative experiments with detailed analysis illustrate the effectiveness of the three proposed components. Furthermore, the topmost detection accuracy and competitive speed are achieved by our method compared with other domain-specific,e.g., DAPN, PADN, and general CNN-based methods,e.g., FPN, Cascade R-CNN, SSD, RefineDet and RPDet.

preprint2022arXiv

Constant gap between conventional strategies and those based on C*-dynamics for self-embezzlement

We consider a bipartite transformation that we call self-embezzlement and use it to prove a constant gap between the capabilities of two models of quantum information: the conventional model, where bipartite systems are represented by tensor products of Hilbert spaces; and a natural model of quantum information processing for abstract states on C*-algebras, where joint systems are represented by tensor products of C*-algebras. We call this the C*-circuit model and show that it is a special case of the commuting-operator model (in that it can be translated into such a model). For the conventional model, we show that there exists a constant $ε_0 > 0$ such that self-embezzlement cannot be achieved with precision parameter less than $ε_0$ (i.e., the fidelity cannot be greater than $1 - ε_0$); whereas, in the C*-circuit model -- as well as in a commuting-operator model -- the precision can be $0$ (i.e., fidelity~$1$).

preprint2022arXiv

Decoupling Makes Weakly Supervised Local Feature Better

Weakly supervised learning can help local feature methods to overcome the obstacle of acquiring a large-scale dataset with densely labeled correspondences. However, since weak supervision cannot distinguish the losses caused by the detection and description steps, directly conducting weakly supervised learning within a joint describe-then-detect pipeline suffers limited performance. In this paper, we propose a decoupled describe-then-detect pipeline tailored for weakly supervised local feature learning. Within our pipeline, the detection step is decoupled from the description step and postponed until discriminative and robust descriptors are learned. In addition, we introduce a line-to-window search strategy to explicitly use the camera pose information for better descriptor learning. Extensive experiments show that our method, namely PoSFeat (Camera Pose Supervised Feature), outperforms previous fully and weakly supervised methods and achieves state-of-the-art performance on a wide range of downstream tasks.

preprint2022arXiv

Efficient Video Transformers with Spatial-Temporal Token Selection

Video transformers have achieved impressive results on major video recognition benchmarks, which however suffer from high computational cost. In this paper, we present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples. Specifically, we formulate token selection as a ranking problem, which estimates the importance of each token through a lightweight scorer network and only those with top scores will be used for downstream evaluation. In the temporal dimension, we keep the frames that are most relevant to the action categories, while in the spatial dimension, we identify the most discriminative region in feature maps without affecting the spatial context used in a hierarchical way in most video transformers. Since the decision of token selection is non-differentiable, we employ a perturbed-maximum based differentiable Top-K operator for end-to-end training. We mainly conduct extensive experiments on Kinetics-400 with a recently introduced video transformer backbone, MViT. Our framework achieves similar results while requiring 20% less computation. We also demonstrate our approach is generic for different transformer architectures and video datasets. Code is available at https://github.com/wangjk666/STTS.

preprint2022arXiv

Exploring Inter-Channel Correlation for Diversity-preserved KnowledgeDistillation

Knowledge Distillation has shown very promising abil-ity in transferring learned representation from the largermodel (teacher) to the smaller one (student).Despitemany efforts, prior methods ignore the important role ofretaining inter-channel correlation of features, leading tothe lack of capturing intrinsic distribution of the featurespace and sufficient diversity properties of features in theteacher network.To solve the issue, we propose thenovel Inter-Channel Correlation for Knowledge Distillation(ICKD), with which the diversity and homology of the fea-ture space of the student network can align with that ofthe teacher network. The correlation between these twochannels is interpreted as diversity if they are irrelevantto each other, otherwise homology. Then the student isrequired to mimic the correlation within its own embed-ding space. In addition, we introduce the grid-level inter-channel correlation, making it capable of dense predictiontasks. Extensive experiments on two vision tasks, includ-ing ImageNet classification and Pascal VOC segmentation,demonstrate the superiority of our ICKD, which consis-tently outperforms many existing methods, advancing thestate-of-the-art in the fields of Knowledge Distillation. Toour knowledge, we are the first method based on knowl-edge distillation boosts ResNet18 beyond 72% Top-1 ac-curacy on ImageNet classification. Code is available at:https://github.com/ADLab-AutoDrive/ICKD.

preprint2022arXiv

Graph Signal Processing for Heterogeneous Change Detection Part I: Vertex Domain Filtering

This paper provides a new strategy for the Heterogeneous Change Detection (HCD) problem: solving HCD from the perspective of Graph Signal Processing (GSP). We construct a graph for each image to capture the structure information, and treat each image as the graph signal. In this way, we convert the HCD into a GSP problem: a comparison of the responses of the two signals on different systems defined on the two graphs, which attempts to find structural differences (Part I) and signal differences (Part II) due to the changes between heterogeneous images. In this first part, we analyze the HCD with GSP from the vertex domain. We first show that for the unchanged images, their structures are consistent, and then the outputs of the same signal on systems defined on the two graphs are similar. However, once a region has changed, the local structure of the image changes, i.e., the connectivity of the vertex containing this region changes. Then, we can compare the output signals of the same input graph signal passing through filters defined on the two graphs to detect changes. We design different filters from the vertex domain, which can flexibly explore the high-order neighborhood information hidden in original graphs. We also analyze the detrimental effects of changing regions on the change detection results from the viewpoint of signal propagation. Experiments conducted on seven real data sets show the effectiveness of the vertex domain filtering based HCD method.

preprint2022arXiv

Graph Signal Processing for Heterogeneous Change Detection Part II: Spectral Domain Analysis

This is the second part of the paper that provides a new strategy for the heterogeneous change detection (HCD) problem, that is, solving HCD from the perspective of graph signal processing (GSP). We construct a graph to represent the structure of each image, and treat each image as a graph signal defined on the graph. In this way, we can convert the HCD problem into a comparison of responses of signals on systems defined on the graphs. In the part I, the changes are measured by comparing the structure difference between the graphs from the vertex domain. In this part II, we analyze the GSP for HCD from the spectral domain. We first analyze the spectral properties of the different images on the same graph, and show that their spectra exhibit commonalities and dissimilarities. Specially, it is the change that leads to the dissimilarities of their spectra. Then, we propose a regression model for the HCD, which decomposes the source signal into the regressed signal and changed signal, and requires the regressed signal have the same spectral property as the target signal on the same graph. With the help of graph spectral analysis, the proposed regression model is flexible and scalable. Experiments conducted on seven real data sets show the effectiveness of the proposed method.

preprint2022arXiv

HQANN: Efficient and Robust Similarity Search for Hybrid Queries with Structured and Unstructured Constraints

The in-memory approximate nearest neighbor search (ANNS) algorithms have achieved great success for fast high-recall query processing, but are extremely inefficient when handling hybrid queries with unstructured (i.e., feature vectors) and structured (i.e., related attributes) constraints. In this paper, we present HQANN, a simple yet highly efficient hybrid query processing framework which can be easily embedded into existing proximity graph-based ANNS algorithms. We guarantee both low latency and high recall by leveraging navigation sense among attributes and fusing vector similarity search with attribute filtering. Experimental results on both public and in-house datasets demonstrate that HQANN is 10x faster than the state-of-the-art hybrid ANNS solutions to reach the same recall quality and its performance is hardly affected by the complexity of attributes. It can reach 99\% recall@10 in just around 50 microseconds On GLOVE-1.2M with thousands of attribute constraints.

preprint2022arXiv

Investigate the Essence of Long-Tailed Recognition from a Unified Perspective

As the data scale grows, deep recognition models often suffer from long-tailed data distributions due to the heavy imbalanced sample number across categories. Indeed, real-world data usually exhibit some similarity relation among different categories (e.g., pigeons and sparrows), called category similarity in this work. It is doubly difficult when the imbalance occurs between such categories with similar appearances. However, existing solutions mainly focus on the sample number to re-balance data distribution. In this work, we systematically investigate the essence of the long-tailed problem from a unified perspective. Specifically, we demonstrate that long-tailed recognition suffers from both sample number and category similarity. Intuitively, using a toy example, we first show that sample number is not the unique influence factor for performance dropping of long-tailed recognition. Theoretically, we demonstrate that (1) category similarity, as an inevitable factor, would also influence the model learning under long-tailed distribution via similar samples, (2) using more discriminative representation methods (e.g., self-supervised learning) for similarity reduction, the classifier bias can be further alleviated with greatly improved performance. Extensive experiments on several long-tailed datasets verify the rationality of our theoretical analysis, and show that based on existing state-of-the-arts (SOTAs), the performance could be further improved by similarity reduction. Our investigations highlight the essence behind the long-tailed problem, and claim several feasible directions for future work.

preprint2022arXiv

Median Pixel Difference Convolutional Network for Robust Face Recognition

Face recognition is one of the most active tasks in computer vision and has been widely used in the real world. With great advances made in convolutional neural networks (CNN), lots of face recognition algorithms have achieved high accuracy on various face datasets. However, existing face recognition algorithms based on CNNs are vulnerable to noise. Noise corrupted image patterns could lead to false activations, significantly decreasing face recognition accuracy in noisy situations. To equip CNNs with built-in robustness to noise of different levels, we proposed a Median Pixel Difference Convolutional Network (MeDiNet) by replacing some traditional convolutional layers with the proposed novel Median Pixel Difference Convolutional Layer (MeDiConv) layer. The proposed MeDiNet integrates the idea of traditional multiscale median filtering with deep CNNs. The MeDiNet is tested on the four face datasets (LFW, CA-LFW, CP-LFW, and YTF) with versatile settings on blur kernels, noise intensities, scales, and JPEG quality factors. Extensive experiments show that our MeDiNet can effectively remove noisy pixels in the feature map and suppress the negative impact of noise, leading to achieving limited accuracy loss under these practical noises compared with the standard CNN under clean conditions.

preprint2022arXiv

Noncollinear Antiferromagnetic Spintronics

Antiferromagnetic spintronics is one of the leading candidates for next-generation electronics. Among abundant antiferromagnets, noncollinear antiferromagnets are promising for achieving practical applications due to coexisting ferromagnetic and antiferromagnetic merits. In this perspective, we briefly review the recent progress in the emerging noncollinear antiferromagnetic spintronics from fundamental physics to device applications. Current challenges and future research directions for this field are also discussed.

preprint2022arXiv

Residual-guided Personalized Speech Synthesis based on Face Image

Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds. It was reported that face information has a strong link with the speech sound. Thus in this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder. A Face-based Residual Personalized Speech Synthesis Model (FR-PSS) containing a speech encoder, a speech synthesizer and a face encoder is designed for PSS. In this model, by designing two speech priors, a residual-guided strategy is introduced to guide the face feature to approach the true speech feature in the training. Moreover, considering the error of feature's absolute values and their directional bias, we formulate a novel tri-item loss function for face encoder. Experimental results show that the speech synthesized by our model is comparable to the personalized speech synthesized by training a large amount of audio data in previous works.

preprint2022arXiv

Rethinking Few-Shot Class-Incremental Learning with Open-Set Hypothesis in Hyperbolic Geometry

Few-Shot Class-Incremental Learning (FSCIL) aims at incrementally learning novel classes from a few labeled samples by avoiding the overfitting and catastrophic forgetting simultaneously. The current protocol of FSCIL is built by mimicking the general class-incremental learning setting, while it is not totally appropriate due to the different data configuration, i.e., novel classes are all in the limited data regime. In this paper, we rethink the configuration of FSCIL with the open-set hypothesis by reserving the possibility in the first session for incoming categories. To assign better performances on both close-set and open-set recognition to the model, Hyperbolic Reciprocal Point Learning module (Hyper-RPL) is built on Reciprocal Point Learning (RPL) with hyperbolic neural networks. Besides, for learning novel categories from limited labeled data, we incorporate a hyperbolic metric learning (Hyper-Metric) module into the distillation-based framework to alleviate the overfitting issue and better handle the trade-off issue between the preservation of old knowledge and the acquisition of new knowledge. The comprehensive assessments of the proposed configuration and modules on three benchmark datasets are executed to validate the effectiveness concerning three evaluation indicators.

preprint2022arXiv

Several classes of optimal ternary cyclic codes

Cyclic codes have efficient encoding and decoding algorithms over finite fields, so that they have practical applications in communication systems, consumer electronics and data storage systems. The objective of this paper is to give eight new classes of optimal ternary cyclic codes with parameters $[3^m-1,3^m-1-2m,4]$, according to a result on the non-existence of solutions to a certain equation over $F_{3^m}$. It is worth noticing that some recent conclusions on such optimal ternary cyclic codes are some special cases of our work. More importantly, three of the nine open problems proposed by Ding and Helleseth in [8] are solved completely. In addition, another one among the nine open problems is also promoted.

preprint2022arXiv

SFS: Smart OS Scheduling for Serverless Functions

Serverless computing enables a new way of building and scaling cloud applications by allowing developers to write fine-grained serverless or cloud functions. The execution duration of a cloud function is typically short-ranging from a few milliseconds to hundreds of seconds. However, due to resource contentions caused by public clouds' deep consolidation, the function execution duration may get significantly prolonged and fail to accurately account for the function's true resource usage. We observe that the function duration can be highly unpredictable with huge amplification of more than 50x for an open-source FaaS platform (OpenLambda). Our experiments show that the OS scheduling policy of cloud functions' host server can have a crucial impact on performance. The default Linux scheduler, CFS (Completely Fair Scheduler), being oblivious to workloads, frequently context-switches short functions, causing a turnaround time that is much longer than their service time. We propose SFS (Smart Function Scheduler),which works entirely in the user space and carefully orchestrates existing Linux FIFO and CFS schedulers to approximate Shortest Remaining Time First (SRTF). SFS uses two-level scheduling that seamlessly combines a new FILTER policy with Linux CFS, to trade off increased duration of long functions for significant performance improvement for short functions. We implement {\proj} in the Linux user space and port it to OpenLambda. Evaluation results show that SFS significantly improves short functions' duration with a small impact on relatively longer functions, compared to CFS.

preprint2022arXiv

WebUAV-3M: A Benchmark for Unveiling the Power of Million-Scale Deep UAV Tracking

Unmanned aerial vehicle (UAV) tracking is of great significance for a wide range of applications, such as delivery and agriculture. Previous benchmarks in this area mainly focused on small-scale tracking problems while ignoring the amounts of data, types of data modalities, diversities of target categories and scenarios, and evaluation protocols involved, greatly hiding the massive power of deep UAV tracking. In this work, we propose WebUAV-3M, the largest public UAV tracking benchmark to date, to facilitate both the development and evaluation of deep UAV trackers. WebUAV-3M contains over 3.3 million frames across 4,500 videos and offers 223 highly diverse target categories. Each video is densely annotated with bounding boxes by an efficient and scalable semiautomatic target annotation (SATA) pipeline. Importantly, to take advantage of the complementary superiority of language and audio, we enrich WebUAV-3M by innovatively providing both natural language specifications and audio descriptions. We believe that such additions will greatly boost future research in terms of exploring language features and audio cues for multimodal UAV tracking. In addition, a fine-grained UAV tracking-under-scenario constraint (UTUSC) evaluation protocol and seven challenging scenario subtest sets are constructed to enable the community to develop, adapt and evaluate various types of advanced trackers. We provide extensive evaluations and detailed analyses of 43 representative trackers and envision future research directions in the field of deep UAV tracking and beyond. The dataset, toolkits and baseline results are available at \url{https://github.com/983632847/WebUAV-3M}.

preprint2022arXiv

WiVelo: Fine-grained Walking Velocity Estimation for Wi-Fi Passive Tracking

Passive human tracking via Wi-Fi has been researched broadly in the past decade. Besides straight-forward anchor point localization, velocity is another vital sign adopted by the existing approaches to infer user trajectory. However, state-of-the-art Wi-Fi velocity estimation relies on Doppler-Frequency-Shift (DFS) which suffers from the inevitable signal noise incurring unbounded velocity errors, further degrading the tracking accuracy. In this paper, we present WiVelo\footnote{Code\&datasets are available at \textit{https://github.com/liecn/WiVelo\_SECON22}} that explores new spatial-temporal signal correlation features observed from different antennas to achieve accurate velocity estimation. First, we use subcarrier shift distribution (SSD) extracted from channel state information (CSI) to define two correlation features for direction and speed estimation, separately. Then, we design a mesh model calculated by the antennas' locations to enable a fine-grained velocity estimation with bounded direction error. Finally, with the continuously estimated velocity, we develop an end-to-end trajectory recovery algorithm to mitigate velocity outliers with the property of walking velocity continuity. We implement WiVelo on commodity Wi-Fi hardware and extensively evaluate its tracking accuracy in various environments. The experimental results show our median and 90\% tracking errors are 0.47~m and 1.06~m, which are half and a quarter of state-of-the-arts.

preprint2022arXiv

WL-Align: Weisfeiler-Lehman Relabeling for Aligning Users across Networks via Regularized Representation Learning

Aligning users across networks using graph representation learning has been found effective where the alignment is accomplished in a low-dimensional embedding space. Yet, achieving highly precise alignment is still challenging, especially when nodes with long-range connectivity to the labeled anchors are encountered. To alleviate this limitation, we purposefully designed WL-Align which adopts a regularized representation learning framework to learn distinctive node representations. It extends the Weisfeiler-Lehman Isormorphism Test and learns the alignment in alternating phases of "across-network Weisfeiler-Lehman relabeling" and "proximity-preserving representation learning". The across-network Weisfeiler-Lehman relabeling is achieved through iterating the anchor-based label propagation and a similarity-based hashing to exploit the known anchors' connectivity to different nodes in an efficient and robust manner. The representation learning module preserves the second-order proximity within individual networks and is regularized by the across-network Weisfeiler-Lehman hash labels. Extensive experiments on real-world and synthetic datasets have demonstrated that our proposed WL-Align outperforms the state-of-the-art methods, achieving significant performance improvements in the "exact matching" scenario. Data and code of WL-Align are available at https://github.com/ChenPengGang/WLAlignCode.

preprint2021arXiv

Deep Learning for Scene Classification: A Survey

Scene classification, aiming at classifying a scene image to one of the predefined scene categories by comprehending the entire image, is a longstanding, fundamental and challenging problem in computer vision. The rise of large-scale datasets, which constitute the corresponding dense sampling of diverse real-world scenes, and the renaissance of deep learning techniques, which learn powerful feature representations directly from big raw data, have been bringing remarkable progress in the field of scene representation and classification. To help researchers master needed advances in this field, the goal of this paper is to provide a comprehensive survey of recent achievements in scene classification using deep learning. More than 200 major publications are included in this survey covering different aspects of scene classification, including challenges, benchmark datasets, taxonomy, and quantitative performance comparisons of the reviewed methods. In retrospect of what has been achieved so far, this paper is also concluded with a list of promising research opportunities.

preprint2021arXiv

How to Train Your Agent to Read and Write

Reading and writing research papers is one of the most privileged abilities that a qualified researcher should master. However, it is difficult for new researchers (\eg{students}) to fully {grasp} this ability. It would be fascinating if we could train an intelligent agent to help people read and summarize papers, and perhaps even discover and exploit the potential knowledge clues to write novel papers. Although there have been existing works focusing on summarizing (\emph{i.e.}, reading) the knowledge in a given text or generating (\emph{i.e.}, writing) a text based on the given knowledge, the ability of simultaneously reading and writing is still under development. Typically, this requires an agent to fully understand the knowledge from the given text materials and generate correct and fluent novel paragraphs, which is very challenging in practice. In this paper, we propose a Deep ReAder-Writer (DRAW) network, which consists of a \textit{Reader} that can extract knowledge graphs (KGs) from input paragraphs and discover potential knowledge, a graph-to-text \textit{Writer} that generates a novel paragraph, and a \textit{Reviewer} that reviews the generated paragraph from three different aspects. Extensive experiments show that our DRAW network outperforms considered baselines and several state-of-the-art methods on AGENDA and M-AGENDA datasets. Our code and supplementary are released at https://github.com/menggehe/DRAW.

preprint2021arXiv

Semi-Supervised Active Learning for COVID-19 Lung Ultrasound Multi-symptom Classification

Ultrasound (US) is a non-invasive yet effective medical diagnostic imaging technique for the COVID-19 global pandemic. However, due to complex feature behaviors and expensive annotations of US images, it is difficult to apply Artificial Intelligence (AI) assisting approaches for lung's multi-symptom (multi-label) classification. To overcome these difficulties, we propose a novel semi-supervised Two-Stream Active Learning (TSAL) method to model complicated features and reduce labeling costs in an iterative procedure. The core component of TSAL is the multi-label learning mechanism, in which label correlations information is used to design multi-label margin (MLM) strategy and confidence validation for automatically selecting informative samples and confident labels. On this basis, a multi-symptom multi-label (MSML) classification network is proposed to learn discriminative features of lung symptoms, and a human-machine interaction is exploited to confirm the final annotations that are used to fine-tune MSML with progressively labeled data. Moreover, a novel lung US dataset named COVID19-LUSMS is built, currently containing 71 clinical patients with 6,836 images sampled from 678 videos. Experimental evaluations show that TSAL using only 20% data can achieve superior performance to the baseline and the state-of-the-art. Qualitatively, visualization of both attention map and sample distribution confirms the good consistency with the clinic knowledge.

preprint2020arXiv

An Investigation into the Stochasticity of Batch Whitening

Batch Normalization (BN) is extensively employed in various network architectures by performing standardization within mini-batches. A full understanding of the process has been a central target in the deep learning communities. Unlike existing works, which usually only analyze the standardization operation, this paper investigates the more general Batch Whitening (BW). Our work originates from the observation that while various whitening transformations equivalently improve the conditioning, they show significantly different behaviors in discriminative scenarios and training Generative Adversarial Networks (GANs). We attribute this phenomenon to the stochasticity that BW introduces. We quantitatively investigate the stochasticity of different whitening transformations and show that it correlates well with the optimization behaviors during training. We also investigate how stochasticity relates to the estimation of population statistics during inference. Based on our analysis, we provide a framework for designing and comparing BW algorithms in different scenarios. Our proposed BW algorithm improves the residual networks by a significant margin on ImageNet classification. Besides, we show that the stochasticity of BW can improve the GAN's performance with, however, the sacrifice of the training stability.

preprint2020arXiv

Attention-based Residual Speech Portrait Model for Speech to Face Generation

Given a speaker's speech, it is interesting to see if it is possible to generate this speaker's face. One main challenge in this task is to alleviate the natural mismatch between face and speech. To this end, in this paper, we propose a novel Attention-based Residual Speech Portrait Model (AR-SPM) by introducing the ideal of the residual into a hybrid encoder-decoder architecture, where face prior features are merged with the output of speech encoder to form the final face feature. In particular, we innovatively establish a tri-item loss function, which is a weighted linear combination of the L2-norm, L1-norm and negative cosine loss, to train our model by comparing the final face feature and true face feature. Evaluation on AVSpeech dataset shows that our proposed model accelerates the convergence of training, outperforms the state-of-the-art in terms of quality of the generated face, and achieves superior recognition accuracy of gender and age compared with the ground truth.

preprint2020arXiv

Auto-Encoding Twin-Bottleneck Hashing

Conventional unsupervised hashing methods usually take advantage of similarity graphs, which are either pre-computed in the high-dimensional space or obtained from random anchor points. On the one hand, existing methods uncouple the procedures of hash function learning and graph construction. On the other hand, graphs empirically built upon original data could introduce biased prior knowledge of data relevance, leading to sub-optimal retrieval performance. In this paper, we tackle the above problems by proposing an efficient and adaptive code-driven graph, which is updated by decoding in the context of an auto-encoder. Specifically, we introduce into our framework twin bottlenecks (i.e., latent variables) that exchange crucial information collaboratively. One bottleneck (i.e., binary codes) conveys the high-level intrinsic data structure captured by the code-driven graph to the other (i.e., continuous variables for low-level detail information), which in turn propagates the updated network feedback for the encoder to learn more discriminative binary codes. The auto-encoding learning objective literally rewards the code-driven graph to learn an optimal encoder. Moreover, the proposed model can be simply optimized by gradient descent without violating the binary constraints. Experiments on benchmarked datasets clearly show the superiority of our framework over the state-of-the-art hashing methods. Our source code can be found at https://github.com/ymcidence/TBH.

preprint2020arXiv

Combinatorial Laser Molecular Beam Epitaxy System Integrated with Specialized Low-temperature Scanning Tunneling Microscopy

We present a newly developed facility, comprised of a combinatorial laser molecular beam epitaxy system and an in-situ scanning tunneling microscopy (STM). This facility aims at accelerating the materials research in a highly efficient way, by advanced high-throughput film synthesis techniques and subsequent fast characterization of surface morphology and electronic states. Compared with uniform films deposited by conventional methods, the so-called combinatorial thin films will be beneficial to determining the accurate phase diagrams of different materials due to the improved control of parameters such as chemical substitution and sample thickness resulting from a rotarymask method. A specially designed STM working under low-temperature and ultra-high vacuum conditions is optimized for the characterization of combinatorial thin films, in an XY coarse motion range of 15 mm $\times$ 15 mm and with sub-micrometer location precision. The overall configuration as well as some key aspects like sample holder design, scanner head, and sample/tip/target transfer mechanism are described in detail. The performance of the device is demonstrated by synthesizing high-quality superconducting FeSe thin films with gradient thickness, imaging surfaces of highly oriented pyrolytic graphite, Au (111), Bi2Sr2CaCu2O8+δ (BSCCO) and FeSe. In addition, we have also obtained clean noise spectra of tunneling junctions and the superconducting energy gap of BSCCO. The successful manufacturing of such a facility opens a new window for the next generation of equipment designed for experimental materials research.

preprint2020arXiv

Controllable Orthogonalization in Training DNNs

Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1 and reduce redundancy in representation. This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI), to learn a layer-wise orthogonal weight matrix in DNNs. ONI works by iteratively stretching the singular values of a weight matrix towards 1. This property enables it to control the orthogonality of a weight matrix by its number of iterations. We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction. We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (SN), and further outperforms SN by providing controllable orthogonality.

preprint2020arXiv

Data Dissemination Using Interest Tree in Socially Aware Networking

Socially aware networking (SAN) exploits social characteristics of mobile users to streamline data dissemination protocols in opportunistic environments. Existing protocols in this area utilized various social features such as user interests, social similarity, and community structure to improve the performance of data dissemination. However, the interrelationship between user interests and its impact on the efficiency of data dissemination has not been explored sufficiently. In this paper, we analyze various kinds of relationships between user interests and model them using a layer-based structure in order to form social communities in SAN paradigm. We propose Int-Tree, an Interest-Tree based scheme which uses the relationship between user interests to improve the performance of data dissemination. The core of Int-Tree is the interest-tree, a tree-based community structure that combines two social features, i.e. density of a community and social tie, to support data dissemination. The simulation results show that Int-Tree achieves higher delivery ratio, lower overhead, in comparison to two benchmark protocols, PROPHET and Epidemic routing. In addition, Int-Tree can perform with 1.36 hop counts in average, and tolerable latency in terms of buffer size, time to live (TTL) and simulation duration. Finally, Int-Tree keeps stable performance with various parameters.

preprint2020arXiv

Deep Learning for 3D Point Clouds: A Survey

Point cloud learning has lately attracted increasing attention due to its wide applications in many areas, such as computer vision, autonomous driving, and robotics. As a dominating technique in AI, deep learning has been successfully used to solve various 2D vision problems. However, deep learning on point clouds is still in its infancy due to the unique challenges faced by the processing of point clouds with deep neural networks. Recently, deep learning on point clouds has become even thriving, with numerous methods being proposed to address different problems in this area. To stimulate future research, this paper presents a comprehensive review of recent progress in deep learning methods for point clouds. It covers three major tasks, including 3D shape classification, 3D object detection and tracking, and 3D point cloud segmentation. It also presents comparative results on several publicly available datasets, together with insightful observations and inspiring future research directions.

preprint2020arXiv

Deep Video Super-Resolution using HR Optical Flow Estimation

Video super-resolution (SR) aims at generating a sequence of high-resolution (HR) frames with plausible and temporally consistent details from their low-resolution (LR) counterparts. The key challenge for video SR lies in the effective exploitation of temporal dependency between consecutive frames. Existing deep learning based methods commonly estimate optical flows between LR frames to provide temporal dependency. However, the resolution conflict between LR optical flows and HR outputs hinders the recovery of fine details. In this paper, we propose an end-to-end video SR network to super-resolve both optical flows and images. Optical flow SR from LR frames provides accurate temporal dependency and ultimately improves video SR performance. Specifically, we first propose an optical flow reconstruction network (OFRnet) to infer HR optical flows in a coarse-to-fine manner. Then, motion compensation is performed using HR optical flows to encode temporal dependency. Finally, compensated LR inputs are fed to a super-resolution network (SRnet) to generate SR results. Extensive experiments have been conducted to demonstrate the effectiveness of HR optical flows for SR performance improvement. Comparative results on the Vid4 and DAVIS-10 datasets show that our network achieves the state-of-the-art performance.

preprint2020arXiv

Dynamic Group Convolution for Accelerating Convolutional Neural Networks

Replacing normal convolutions with group convolutions can significantly increase the computational efficiency of modern deep convolutional networks, which has been widely adopted in compact network architecture designs. However, existing group convolutions undermine the original network structures by cutting off some connections permanently resulting in significant accuracy degradation. In this paper, we propose dynamic group convolution (DGC) that adaptively selects which part of input channels to be connected within each group for individual samples on the fly. Specifically, we equip each group with a small feature selector to automatically select the most important input channels conditioned on the input images. Multiple groups can adaptively capture abundant and complementary visual/semantic features for each input image. The DGC preserves the original network structure and has similar computational efficiency as the conventional group convolution simultaneously. Extensive experiments on multiple image classification benchmarks including CIFAR-10, CIFAR-100 and ImageNet demonstrate its superiority over the existing group convolution techniques and dynamic execution methods. The code is available at https://github.com/zhuogege1943/dgc.

preprint2020arXiv

FTBNN: Rethinking Non-linearity for 1-bit CNNs and Going Beyond

Binary neural networks (BNNs), where both weights and activations are binarized into 1 bit, have been widely studied in recent years due to its great benefit of highly accelerated computation and substantially reduced memory footprint that appeal to the development of resource constrained devices. In contrast to previous methods tending to reduce the quantization error for training BNN structures, we argue that the binarized convolution process owns an increasing linearity towards the target of minimizing such error, which in turn hampers BNN's discriminative ability. In this paper, we re-investigate and tune proper non-linear modules to fix that contradiction, leading to a strong baseline which achieves state-of-the-art performance on the large-scale ImageNet dataset in terms of accuracy and training efficiency. To go further, we find that the proposed BNN model still has much potential to be compressed by making a better use of the efficient binary operations, without losing accuracy. In addition, the limited capacity of the BNN model can also be increased with the help of group execution. Based on these insights, we are able to improve the baseline with an additional 4~5% top-1 accuracy gain even with less computational cost. Our code will be made public at https://github.com/zhuogege1943/ftbnn.

preprint2020arXiv

Improved Residual Networks for Image and Video Recognition

Residual networks (ResNets) represent a powerful type of convolutional neural network (CNN) architecture, widely adopted and used in various tasks. In this work we propose an improved version of ResNets. Our proposed improvements address all three main components of a ResNet: the flow of information through the network layers, the residual building block, and the projection shortcut. We are able to show consistent improvements in accuracy and learning convergence over the baseline. For instance, on ImageNet dataset, using the ResNet with 50 layers, for top-1 accuracy we can report a 1.19% improvement over the baseline in one setting and around 2% boost in another. Importantly, these improvements are obtained without increasing the model complexity. Our proposed approach allows us to train extremely deep networks, while the baseline shows severe optimization issues. We report results on three tasks over six datasets: image classification (ImageNet, CIFAR-10 and CIFAR-100), object detection (COCO) and video action recognition (Kinetics-400 and Something-Something-v2). In the deep learning era, we establish a new milestone for the depth of a CNN. We successfully train a 404-layer deep CNN on the ImageNet dataset and a 3002-layer network on CIFAR-10 and CIFAR-100, while the baseline is not able to converge at such extreme depths. Code is available at: https://github.com/iduta/iresnet

preprint2020arXiv

JGR-P2O: Joint Graph Reasoning based Pixel-to-Offset Prediction Network for 3D Hand Pose Estimation from a Single Depth Image

State-of-the-art single depth image-based 3D hand pose estimation methods are based on dense predictions, including voxel-to-voxel predictions, point-to-point regression, and pixel-wise estimations. Despite the good performance, those methods have a few issues in nature, such as the poor trade-off between accuracy and efficiency, and plain feature representation learning with local convolutions. In this paper, a novel pixel-wise prediction-based method is proposed to address the above issues. The key ideas are two-fold: a) explicitly modeling the dependencies among joints and the relations between the pixels and the joints for better local feature representation learning; b) unifying the dense pixel-wise offset predictions and direct joint regression for end-to-end training. Specifically, we first propose a graph convolutional network (GCN) based joint graph reasoning module to model the complex dependencies among joints and augment the representation capability of each pixel. Then we densely estimate all pixels' offsets to joints in both image plane and depth space and calculate the joints' positions by a weighted average over all pixels' predictions, totally discarding the complex postprocessing operations. The proposed model is implemented with an efficient 2D fully convolutional network (FCN) backbone and has only about 1.4M parameters. Extensive experiments on multiple 3D hand pose estimation benchmarks demonstrate that the proposed method achieves new state-of-the-art accuracy while running very efficiently with around a speed of 110fps on a single NVIDIA 1080Ti GPU.

preprint2020arXiv

Layer-wise Conditioning Analysis in Exploring the Learning Dynamics of DNNs

Conditioning analysis uncovers the landscape of an optimization objective by exploring the spectrum of its curvature matrix. This has been well explored theoretically for linear models. We extend this analysis to deep neural networks (DNNs) in order to investigate their learning dynamics. To this end, we propose layer-wise conditioning analysis, which explores the optimization landscape with respect to each layer independently. Such an analysis is theoretically supported under mild assumptions that approximately hold in practice. Based on our analysis, we show that batch normalization (BN) can stabilize the training, but sometimes result in the false impression of a local minimum, which has detrimental effects on the learning. Besides, we experimentally observe that BN can improve the layer-wise conditioning of the optimization problem. Finally, we find that the last linear layer of a very deep residual network displays ill-conditioned behavior. We solve this problem by only adding one BN layer before the last linear layer, which achieves improved performance over the original and pre-activation residual networks.

preprint2020arXiv

Learning to Respond with Stickers: A Framework of Unifying Multi-Modality in Multi-Turn Dialog

Stickers with vivid and engaging expressions are becoming increasingly popular in online messaging apps, and some works are dedicated to automatically select sticker response by matching text labels of stickers with previous utterances. However, due to their large quantities, it is impractical to require text labels for the all stickers. Hence, in this paper, we propose to recommend an appropriate sticker to user based on multi-turn dialog context history without any external labels. Two main challenges are confronted in this task. One is to learn semantic meaning of stickers without corresponding text labels. Another challenge is to jointly model the candidate sticker with the multi-turn dialog context. To tackle these challenges, we propose a sticker response selector (SRS) model. Specifically, SRS first employs a convolutional based sticker image encoder and a self-attention based multi-turn dialog encoder to obtain the representation of stickers and utterances. Next, deep interaction network is proposed to conduct deep matching between the sticker with each utterance in the dialog history. SRS then learns the short-term and long-term dependency between all interaction results by a fusion network to output the the final matching score. To evaluate our proposed method, we collect a large-scale real-world dialog dataset with stickers from one of the most popular online chatting platform. Extensive experiments conducted on this dataset show that our model achieves the state-of-the-art performance for all commonly-used metrics. Experiments also verify the effectiveness of each component of SRS. To facilitate further research in sticker selection field, we release this dataset of 340K multi-turn dialog and sticker pairs.

preprint2020arXiv

NLH: A Blind Pixel-level Non-local Method for Real-world Image Denoising

Non-local self similarity (NSS) is a powerful prior of natural images for image denoising. Most of existing denoising methods employ similar patches, which is a patch-level NSS prior. In this paper, we take one step forward by introducing a pixel-level NSS prior, i.e., searching similar pixels across a non-local region. This is motivated by the fact that finding closely similar pixels is more feasible than similar patches in natural images, which can be used to enhance image denoising performance. With the introduced pixel-level NSS prior, we propose an accurate noise level estimation method, and then develop a blind image denoising method based on the lifting Haar transform and Wiener filtering techniques. Experiments on benchmark datasets demonstrate that, the proposed method achieves much better performance than previous non-deep methods, and is still competitive with existing state-of-the-art deep learning based methods on real-world image denoising. The code is publicly available at https://github.com/njusthyk1972/NLH.

preprint2020arXiv

Nonconvex Nonsmooth Low-Rank Minimization for Generalized Image Compressed Sensing via Group Sparse Representation

Group sparse representation (GSR) based method has led to great successes in various image recovery tasks, which can be converted into a low-rank matrix minimization problem. As a widely used surrogate function of low-rank, the nuclear norm based convex surrogate usually leads to over-shrinking problem, since the standard soft-thresholding operator shrinks all singular values equally. To improve traditional sparse representation based image compressive sensing (CS) performance, we propose a generalized CS framework based on GSR model, which leads to a nonconvex nonsmooth low-rank minimization problem. The popular L_2-norm and M-estimator are employed for standard image CS and robust CS problem to fit the data respectively. For the better approximation of the rank of group-matrix, a family of nuclear norms are employed to address the over-shrinking problem. Moreover, we also propose a flexible and effective iteratively-weighting strategy to control the weighting and contribution of each singular value. Then we develop an iteratively reweighted nuclear norm algorithm for our generalized framework via an alternating direction method of multipliers framework, namely, GSR-AIR. Experimental results demonstrate that our proposed CS framework can achieve favorable reconstruction performance compared with current state-of-the-art methods and the robust CS framework can suppress the outliers effectively.

preprint2020arXiv

PIS: A Multi-dimensional Routing Protocol for Socially-aware Networking

Socially-aware networking is an emerging paradigm for intermittently connected networks consisting of mobile users with social relationships and characteristics. In this setting, humans are the main carriers of mobile devices. Hence, their connections, social features, and behaviors can be exploited to improve the performance of data forwarding protocols. In this paper, we first explore the impact of three social features, namely physical proximity, user interests, and social relationship on users' daily routines. Then, we propose a multi-dimensional routing protocol called Proximity-Interest-Social (PIS) protocol in which the three different social dimensions are integrated into a unified distance function in order to select optimal intermediate data carriers. PIS protocol utilizes a time slot management mechanism to discover users' movement similarities in different time periods during a day. We compare the performance of PIS to Epidemic, PROPHET, and SimBet routing protocols using SIGCOMM09 and INFOCOM06 data sets. The experiment results show that PIS outperforms other benchmark routing protocols with the highest data delivery ratio with a low communication overhead.

preprint2020arXiv

Predicting Long-Term Skeletal Motions by a Spatio-Temporal Hierarchical Recurrent Network

The primary goal of skeletal motion prediction is to generate future motion by observing a sequence of 3D skeletons. A key challenge in motion prediction is the fact that a motion can often be performed in several different ways, with each consisting of its own configuration of poses and their spatio-temporal dependencies, and as a result, the predicted poses often converge to the motionless poses or non-human like motions in long-term prediction. This leads us to define a hierarchical recurrent network model that explicitly characterizes these internal configurations of poses and their local and global spatio-temporal dependencies. The model introduces a latent vector variable from the Lie algebra to represent spatial and temporal relations simultaneously. Furthermore, a structured stack LSTM-based decoder is devised to decode the predicted poses with a new loss function defined to estimate the quantized weight of each body part in a pose. Empirical evaluations on benchmark datasets suggest our approach significantly outperforms the state-of-the-art methods on both short-term and long-term motion prediction.

preprint2020arXiv

Pyramidal Convolution: Rethinking Convolutional Neural Networks for Visual Recognition

This work introduces pyramidal convolution (PyConv), which is capable of processing the input at multiple filter scales. PyConv contains a pyramid of kernels, where each level involves different types of filters with varying size and depth, which are able to capture different levels of details in the scene. On top of these improved recognition capabilities, PyConv is also efficient and, with our formulation, it does not increase the computational cost and parameters compared to standard convolution. Moreover, it is very flexible and extensible, providing a large space of potential network architectures for different applications. PyConv has the potential to impact nearly every computer vision task and, in this work, we present different architectures based on PyConv for four main tasks on visual recognition: image classification, video action classification/recognition, object detection and semantic image segmentation/parsing. Our approach shows significant improvements over all these core tasks in comparison with the baselines. For instance, on image recognition, our 50-layers network outperforms in terms of recognition performance on ImageNet dataset its counterpart baseline ResNet with 152 layers, while having 2.39 times less parameters, 2.52 times lower computational complexity and more than 3 times less layers. On image segmentation, our novel framework sets a new state-of-the-art on the challenging ADE20K benchmark for scene parsing. Code is available at: https://github.com/iduta/pyconv

preprint2020arXiv

Re-synchronization using the Hand Preceding Model for Multi-modal Fusion in Automatic Continuous Cued Speech Recognition

Cued Speech (CS) is an augmented lip reading complemented by hand coding, and it is very helpful to the deaf people. Automatic CS recognition can help communications between the deaf people and others. Due to the asynchronous nature of lips and hand movements, fusion of them in automatic CS recognition is a challenging problem. In this work, we propose a novel re-synchronization procedure for multi-modal fusion, which aligns the hand features with lips feature. It is realized by delaying hand position and hand shape with their optimal hand preceding time which is derived by investigating the temporal organizations of hand position and hand shape movements in CS. This re-synchronization procedure is incorporated into a practical continuous CS recognition system that combines convolutional neural network (CNN) with multi-stream hidden markov model (MSHMM). A significant improvement of about 4.6\% has been achieved retaining 76.6\% CS phoneme recognition correctness compared with the state-of-the-art architecture (72.04\%), which did not take into account the asynchrony of multi-modal fusion in CS. To our knowledge, this is the first work to tackle the asynchronous multi-modal fusion in the automatic continuous CS recognition.

preprint2020arXiv

Ro-SOS: Metric Expression Network (MEnet) for Robust Salient Object Segmentation

Although deep CNNs have brought significant improvement to image saliency detection, most CNN based models are sensitive to distortion such as compression and noise. In this paper, we propose an end-to-end generic salient object segmentation model called Metric Expression Network (MEnet) to deal with saliency detection with the tolerance of distortion. Within MEnet, a new topological metric space is constructed, whose implicit metric is determined by the deep network. As a result, we manage to group all the pixels in the observed image semantically within this latent space into two regions: a salient region and a non-salient region. With this architecture, all feature extractions are carried out at the pixel level, enabling fine granularity of output boundaries of the salient objects. What's more, we try to give a general analysis for the noise robustness of the network in the sense of Lipschitz and Jacobian literature. Experiments demonstrate that robust salient maps facilitating object segmentation can be generated by the proposed metric. Tests on several public benchmarks show that MEnet has achieved desirable performance. Furthermore, by direct computation and measuring the robustness, the proposed method outperforms previous CNN-based methods on distorted inputs.

preprint2020arXiv

Self-Supervised Joint Learning Framework of Depth Estimation via Implicit Cues

In self-supervised monocular depth estimation, the depth discontinuity and motion objects' artifacts are still challenging problems. Existing self-supervised methods usually utilize a single view to train the depth estimation network. Compared with static views, abundant dynamic properties between video frames are beneficial to refined depth estimation, especially for dynamic objects. In this work, we propose a novel self-supervised joint learning framework for depth estimation using consecutive frames from monocular and stereo videos. The main idea is using an implicit depth cue extractor which leverages dynamic and static cues to generate useful depth proposals. These cues can predict distinguishable motion contours and geometric scene structures. Furthermore, a new high-dimensional attention module is introduced to extract clear global transformation, which effectively suppresses uncertainty of local descriptors in high-dimensional space, resulting in a more reliable optimization in learning framework. Experiments demonstrate that the proposed framework outperforms the state-of-the-art(SOTA) on KITTI and Make3D datasets.

preprint2020arXiv

STAR: A Structure and Texture Aware Retinex Model

Retinex theory is developed mainly to decompose an image into the illumination and reflectance components by analyzing local image derivatives. In this theory, larger derivatives are attributed to the changes in reflectance, while smaller derivatives are emerged in the smooth illumination. In this paper, we utilize exponentiated local derivatives (with an exponent γ) of an observed image to generate its structure map and texture map. The structure map is produced by been amplified with γ > 1, while the texture map is generated by been shrank with γ < 1. To this end, we design exponential filters for the local derivatives, and present their capability on extracting accurate structure and texture maps, influenced by the choices of exponents γ. The extracted structure and texture maps are employed to regularize the illumination and reflectance components in Retinex decomposition. A novel Structure and Texture Aware Retinex (STAR) model is further proposed for illumination and reflectance decomposition of a single image. We solve the STAR model by an alternating optimization algorithm. Each sub-problem is transformed into a vectorized least squares regression, with closed-form solutions. Comprehensive experiments on commonly tested datasets demonstrate that, the proposed STAR model produce better quantitative and qualitative performance than previous competing methods, on illumination and reflectance decomposition, low-light image enhancement, and color correction. The code is publicly available at https://github.com/csjunxu/STAR.

preprint2020arXiv

Temporal Self-Ensembling Teacher for Semi-Supervised Object Detection

This paper focuses on Semi-Supervised Object Detection (SSOD). Knowledge Distillation (KD) has been widely used for semi-supervised image classification. However, adapting these methods for SSOD has the following obstacles. (1) The teacher model serves a dual role as a teacher and a student, such that the teacher predictions on unlabeled images may be very close to those of student, which limits the upper-bound of the student. (2) The class imbalance issue in SSOD hinders an efficient knowledge transfer from teacher to student. To address these problems, we propose a novel method Temporal Self-Ensembling Teacher (TSE-T) for SSOD. Differently from previous KD based methods, we devise a temporally evolved teacher model. First, our teacher model ensembles its temporal predictions for unlabeled images under stochastic perturbations. Second, our teacher model ensembles its temporal model weights with the student model weights by an exponential moving average (EMA) which allows the teacher gradually learn from the student. These self-ensembling strategies increase data and model diversity, thus improving teacher predictions on unlabeled images. Finally, we use focal loss to formulate consistency regularization term to handle the data imbalance problem, which is a more efficient manner to utilize the useful information from unlabeled images than a simple hard-thresholding method which solely preserves confident predictions. Evaluated on the widely used VOC and COCO benchmarks, the mAP of our method has achieved 80.73% and 40.52% on the VOC2007 test set and the COCO2014 minval5k set respectively, which outperforms a strong fully-supervised detector by 2.37% and 1.49%. Furthermore, our method sets the new state-of-the-art in SSOD on VOC2007 test set which outperforms the baseline SSOD method by 1.44%. The source code of this work is publicly available at http://github.com/syangdong/tse-t.

preprint2020arXiv

The First Round Result from the TianQin-1 Satellite

The TianQin-1 satellite (TQ-1), which is the first technology demonstration satellite for the TianQin project, was launched on 20 December 2019. The first round of experiment had been carried out from 21 December 2019 until 1 April 2020. The residual acceleration of the satellite is found to be about $1\times10^{-10}~{\rm m}/{\rm s}^{2}/{\rm Hz}^{1/2}$ at $0.1~{\rm Hz}\,$ and about $5\times10^{-11}~{\rm m}/{\rm s}^{2}/{\rm Hz}^{1/2}$ at $0.05~{\rm Hz}\,$, measured by an inertial sensor with a sensitivity of $5\times10^{-12}~{\rm m}/{\rm s}^{2}/{\rm Hz}^{1/2}$ at $0.1~{\rm Hz}\,$. The micro-Newton thrusters has demonstrated a thrust resolution of $0.1~μ{\rm N}$ and a thrust noise of $0.3~μ{\rm N}/{\rm Hz}^{1/2}$ at $0.1~{\rm Hz}$. The residual noise of the satellite with drag-free control is $3\times10^{-9}~{\rm m}/{\rm s}^{2}/{\rm Hz}^{1/2}$ at $0.1~{\rm Hz}\,$. The noise level of the optical readout system is about $30~{\rm pm}/{\rm Hz}^{1/2}$ at $0.1~{\rm Hz}\,$. The temperature stability at temperature monitoring position is controlled to be about $\pm3~{\rm mK}$ per orbit, and the mismatch between the center-of-mass of the satellite and that of the test mass is measured with a precision of better than $0.1~{\rm mm}$.

preprint2020arXiv

The TianQin project: current progress on science and technology

TianQin is a planned space-based gravitational wave (GW) observatory consisting of three earth orbiting satellites with an orbital radius of about $10^5~{\rm km}$. The satellites will form a equilateral triangle constellation the plane of which is nearly perpendicular to the ecliptic plane. TianQin aims to detect GWs between $10^{-4}~{\rm Hz}$ and $1~{\rm Hz}$ that can be generated by a wide variety of important astrophysical and cosmological sources, including the inspiral of Galactic ultra-compact binaries, the inspiral of stellar-mass black hole binaries, extreme mass ratio inspirals, the merger of massive black hole binaries, and possibly the energetic processes in the very early universe or exotic sources such as cosmic strings. In order to start science operations around 2035, a roadmap called the 0123 plan is being used to bring the key technologies of TianQin to maturity, supported by the construction of a series of research facilities on the ground. Two major projects of the 0123 plan are being carried out. In this process, the team has created a new generation $17~{\rm cm}$ single-body hollow corner-cube retro-reflector which has been launched with the QueQiao satellite on 21 May 2018; a new laser ranging station equipped with a $1.2~{\rm m}$ telescope has been constructed and the station has successfully ranged to all the five retro-reflectors on the Moon; and the TianQin-1 experimental satellite has been launched on 20 December 2019 and the first round result shows that the satellite has exceeded all of its mission requirements.

preprint2020arXiv

User Popularity-based Packet Scheduling for Congestion Control in Ad-hoc Social Networks

Traditional ad-hoc network packet scheduling schemes cannot fulfill the requirements of proximity-based ad-hoc social networks (ASNETs) and they do not behave properly in congested environments. To address this issue, we propose a user popularity-based packet scheduling scheme for congestion control in ASNETs called Pop-aware. The proposed algorithm exploits social popularity of sender nodes to prioritize all incoming flows. Pop-aware also provides fairness of service received by each flow. We evaluate the performance of Pop-aware through a series of simulations. In comparison with some existing scheduling algorithms, Pop-aware performs better in terms of control overhead, total overhead, average throughput, packet loss rate, packet delivery rate and average delay.

preprint2019arXiv

Practical quantum key distribution with non-phase-randomized coherent states

Quantum key distribution (QKD) based on coherent states is well known for its implementation simplicity, but it suffers from loss-dependent attacks based on optimal unambiguous state discrimination. Crucially, previous research has suggested that coherent-state QKD is limited to short distances, typically below 100 km assuming standard optical fiber loss and system parameters. In this work, we propose a six-coherent-state phase-encoding QKD protocol that is able to tolerate the total loss of up to 38 dB assuming realistic system parameters, and up to 56 dB loss assuming zero noise. The security of the protocol is calculated using a recently developed security proof technique based on semi-definite programming, which assumes only the inner-product information of the encoded coherent states, the expected statistics, and that the measurement is basis-independent. Our results thus suggest that coherent-state QKD could be a promising candidate for high-speed provably-secure QKD.

preprint2019arXiv

RANet: Ranking Attention Network for Fast Video Object Segmentation

Despite online learning (OL) techniques have boosted the performance of semi-supervised video object segmentation (VOS) methods, the huge time costs of OL greatly restrict their practicality. Matching based and propagation based methods run at a faster speed by avoiding OL techniques. However, they are limited by sub-optimal accuracy, due to mismatching and drifting problems. In this paper, we develop a real-time yet very accurate Ranking Attention Network (RANet) for VOS. Specifically, to integrate the insights of matching based and propagation based methods, we employ an encoder-decoder framework to learn pixel-level similarity and segmentation in an end-to-end manner. To better utilize the similarity maps, we propose a novel ranking attention module, which automatically ranks and selects these maps for fine-grained VOS performance. Experiments on DAVIS-16 and DAVIS-17 datasets show that our RANet achieves the best speed-accuracy trade-off, e.g., with 33 milliseconds per frame and J&F=85.5% on DAVIS-16. With OL, our RANet reaches J&F=87.1% on DAVIS-16, exceeding state-of-the-art VOS methods. The code can be found at https://github.com/Storife/RANet.