Source author record

Jue Wang

Jue Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

59works

27topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Search Your Block Floating Point Scales!

Quantization has emerged as a standard technique for accelerating inference for generative models by enabling faster low-precision computations and reduced memory transfers. Recently, GPU accelerators have added first-class support for microscaling Block Floating Point (BFP) formats. Standard BFP algorithms use a fixed scale based on the maximum magnitude of the block. We observe that this scale choice can be suboptimal with respect to quantization errors. In this work, we propose ScaleSearch, an alternative strategy for selecting these scale factors: using a fine-grained search leveraging the mantissa bits in microscaling formats to minimize the quantization error for the given distribution. ScaleSearch can be integrated with existing quantization methods such as Post Training Quantization and low-precision attention, and is shown to improve their performance. Additionally, we introduce ScaleSearchAttention, an accelerated NVFP4-based attention algorithm, which uses ScaleSearch and adapted prior techniques to ensure near-0 performance loss for causal language modeling. Experiments show that ScaleSearch reduces quantization error by 27% for NVFP4 and improves language model PTQ by up to 15 points for MATH500 (Qwen3-8B), while ScaleSearchAttention improves Wikitext-2 PPL by upto 0.77 points for Llama 3.1 70B. The proposed methods closely match baseline performance while providing quantization accuracy improvements.

preprint2023arXiv

Evidence for Exciton Crystals in a 2D Semiconductor Heterotrilayer

Two-dimensional (2D) transition metal dichalcogenides (TMDC) and their moiré interfaces have been demonstrated for correlated electron states, including Mott insulators and electron/hole crystals commensurate with moiré superlattices. Here we present spectroscopic evidences for ordered bosons - interlayer exciton crystals in a WSe2/MoSe2/WSe2 trilayer, where the enhanced Coulomb interactions over those in heterobilayers have been predicted to result in exciton ordering. While the dipolar interlayer excitons in the heterobilayer may be ordered by the periodic moiré traps, their mutual repulsion results in de-trapping at exciton density n_ex larger than 10^11 cm^-2 to form mobile exciton gases and further to electron-hole plasmas, both accompanied by broadening in photoluminescence (PL) peaks and large increases in mobility. In contrast, ordered interlayer excitons in the trilayer are characterized by negligible mobility and by sharper PL peaks persisting to n_ex approximately 10^12 cm^-2. We present evidences for the predicted quadrupolar exciton crystal and its transitions to dipolar excitons either with increasing n_ex or by an applied electric field. These ordered interlayer excitons may serve as models for the exploration of quantum phase transitions and quantum coherent phenomena.

preprint2023arXiv

Fluid Antenna-Assisted MIMO Transmission Exploiting Statistical CSI

In conventional multiple-input multiple-output (MIMO) communication systems, the positions of antennas are fixed. To take full advantage of spatial degrees of freedom, a new technology called fluid antenna (FA) is proposed to obtain higher achievable rate and diversity gain. Most existing works on FA exploit instantaneous channel state information (CSI). However, in FA-assisted systems, it is difficult to obtain instantaneous CSI since changes in the antenna position will lead to channel variation. In this letter, we investigate a FA-assisted MIMO system using relatively slow-varying statistical CSI. Specifically, in the criterion of rate maximization, we propose an algorithmic framework for transmit precoding and transmit/receive FAs position designs with statistical CSI. Simulation results show that our proposed algorithm in FA-assisted systems significantly outperforms baselines in terms of rate performance.

preprint2022arXiv

Boosting Fast Adversarial Training with Learnable Adversarial Initialization

Adversarial training (AT) has been demonstrated to be effective in improving model robustness by leveraging adversarial examples for training. However, most AT methods are in face of expensive time and computational cost for calculating gradients at multiple steps in generating adversarial examples. To boost training efficiency, fast gradient sign method (FGSM) is adopted in fast AT methods by calculating gradient only once. Unfortunately, the robustness is far from satisfactory. One reason may arise from the initialization fashion. Existing fast AT generally uses a random sample-agnostic initialization, which facilitates the efficiency yet hinders a further robustness improvement. Up to now, the initialization in fast AT is still not extensively explored. In this paper, we boost fast AT with a sample-dependent adversarial initialization, i.e., an output from a generative network conditioned on a benign image and its gradient information from the target network. As the generative network and the target network are optimized jointly in the training phase, the former can adaptively generate an effective initialization with respect to the latter, which motivates gradually improved robustness. Experimental evaluations on four benchmark databases demonstrate the superiority of our proposed method over state-of-the-art fast AT methods, as well as comparable robustness to advanced multi-step AT methods. The code is released at https://github.com//jiaxiaojunQAQ//FGSM-SDI.

preprint2022arXiv

Control-Oriented Power Allocation for Integrated Satellite-UAV Networks

This letter presents a sensing-communication-computing-control (SC3) integrated satellite unmanned aerial vehicle (UAV) network, where the UAV is equipped with on-board sensors, mobile edge computing (MEC) servers, base stations and satellite communication module. Like the nervous system, this integrated network is capable of organizing multiple field robots in remote areas, so as to perform mission-critical tasks which are dangerous for human. Aiming at activating this nervous system with multiple SC3 loops, we present a control-oriented optimization problem. Different from traditional studies which mainly focused on communication metrics, we address the power allocation issue to minimize the sum linear quadratic regulator (LQR) control cost of all SC3 loops. Specifically, we show the convexity of the formulated problem and reveal the relationship between optimal transmit power and intrinsic entropy rate of different SC3 loops. For the assure-to-be-stable case, we derive a closed-form solution for ease of practical applications. After demonstrating the superiority of the control-oriented power allocation, we further highlight its difference with classic capacity-oriented water-filling method.

preprint2022arXiv

Deblur-NeRF: Neural Radiance Fields from Blurry Images

Neural Radiance Field (NeRF) has gained considerable attention recently for 3D scene reconstruction and novel view synthesis due to its remarkable synthesis quality. However, image blurriness caused by defocus or motion, which often occurs when capturing scenes in the wild, significantly degrades its reconstruction quality. To address this problem, We propose Deblur-NeRF, the first method that can recover a sharp NeRF from blurry input. We adopt an analysis-by-synthesis approach that reconstructs blurry views by simulating the blurring process, thus making NeRF robust to blurry inputs. The core of this simulation is a novel Deformable Sparse Kernel (DSK) module that models spatially-varying blur kernels by deforming a canonical sparse kernel at each spatial location. The ray origin of each kernel point is jointly optimized, inspired by the physical blurring process. This module is parameterized as an MLP that has the ability to be generalized to various blur types. Jointly optimizing the NeRF and the DSK module allows us to restore a sharp NeRF. We demonstrate that our method can be used on both camera motion blur and defocus blur: the two most common types of blur in real scenes. Evaluation results on both synthetic and real-world data show that our method outperforms several baselines. The synthetic and real datasets along with the source code is publicly available at https://limacv.github.io/deblurnerf/

preprint2022arXiv

Deformable Video Transformer

Video transformers have recently emerged as an effective alternative to convolutional networks for action classification. However, most prior video transformers adopt either global space-time attention or hand-defined strategies to compare patches within and across frames. These fixed attention schemes not only have high computational cost but, by comparing patches at predetermined locations, they neglect the motion dynamics in the video. In this paper, we introduce the Deformable Video Transformer (DVT), which dynamically predicts a small subset of video patches to attend for each query location based on motion information, thus allowing the model to decide where to look in the video based on correspondences across frames. Crucially, these motion-based correspondences are obtained at zero-cost from information stored in the compressed format of the video. Our deformable attention mechanism is optimised directly with respect to classification performance, thus eliminating the need for suboptimal hand-design of attention strategies. Experiments on four large-scale video benchmarks (Kinetics-400, Something-Something-V2, EPIC-KITCHENS and Diving-48) demonstrate that, compared to existing video transformers, our model achieves higher accuracy at the same or lower computational cost, and it attains state-of-the-art results on these four datasets.

preprint2022arXiv

Energy Efficiency Maximization of Massive MIMO Communications With Dynamic Metasurface Antennas

Future wireless communications are largely inclined to deploy massive numbers of antennas at the base stations (BSs) by leveraging cost- and energy-efficient as well as environmentally friendly antenna arrays. The emerging technology of dynamic metasurface antennas (DMAs) is promising to realize such massive antenna arrays with reduced physical size, hardware cost, and power consumption. The goal of this paper is the optimization of the energy efficiency (EE) performance of DMA-assisted massive multiple-input multiple-output (MIMO) wireless communications. Focusing on the uplink, we propose an algorithmic framework for designing the transmit precoding of each multi-antenna user and the DMA tuning strategy at the BS to maximize the EE performance, considering the availability of either instantaneous or statistical channel state information (CSI). Specifically, the proposed framework is shaped around Dinkelbach's transform, alternating optimization, and deterministic equivalent methods. In addition, we obtain a closed-form solution to the optimal transmit signal directions for the statistical CSI case, which simplifies the corresponding transmission design for the multiple-antenna case. Our numerical results verify the good convergence behavior of the proposed algorithms, and showcase the considerable EE performance gains of the DMA-assisted massive MIMO transmissions over the baseline schemes.

preprint2022arXiv

Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Generating accurate descriptions for online fashion items is important not only for enhancing customers' shopping experiences, but also for the increase of online sales. Besides the need of correctly presenting the attributes of items, the expressions in an enchanting style could better attract customer interests. The goal of this work is to develop a novel learning framework for accurate and expressive fashion captioning. Different from popular work on image captioning, it is hard to identify and describe the rich attributes of fashion items. We seed the description of an item by first identifying its attributes, and introduce attribute-level semantic (ALS) reward and sentence-level semantic (SLS) reward as metrics to improve the quality of text descriptions. We further integrate the training of our model with maximum likelihood estimation (MLE), attribute embedding, and Reinforcement Learning (RL). To facilitate the learning, we build a new FAshion CAptioning Dataset (FACAD), which contains 993K images and 130K corresponding enchanting and diverse descriptions. Experiments on FACAD demonstrate the effectiveness of our model.

preprint2022arXiv

Fast Adversarial Training with Adaptive Step Size

While adversarial training and its variants have shown to be the most effective algorithms to defend against adversarial attacks, their extremely slow training process makes it hard to scale to large datasets like ImageNet. The key idea of recent works to accelerate adversarial training is to substitute multi-step attacks (e.g., PGD) with single-step attacks (e.g., FGSM). However, these single-step methods suffer from catastrophic overfitting, where the accuracy against PGD attack suddenly drops to nearly 0% during training, destroying the robustness of the networks. In this work, we study the phenomenon from the perspective of training instances. We show that catastrophic overfitting is instance-dependent and fitting instances with larger gradient norm is more likely to cause catastrophic overfitting. Based on our findings, we propose a simple but effective method, Adversarial Training with Adaptive Step size (ATAS). ATAS learns an instancewise adaptive step size that is inversely proportional to its gradient norm. The theoretical analysis shows that ATAS converges faster than the commonly adopted non-adaptive counterparts. Empirically, ATAS consistently mitigates catastrophic overfitting and achieves higher robust accuracy on CIFAR10, CIFAR100 and ImageNet when evaluated on various adversarial budgets.

preprint2022arXiv

FENeRF: Face Editing in Neural Radiance Fields

Previous portrait image generation methods roughly fall into two categories: 2D GANs and 3D-aware GANs. 2D GANs can generate high fidelity portraits but with low view consistency. 3D-aware GAN methods can maintain view consistency but their generated images are not locally editable. To overcome these limitations, we propose FENeRF, a 3D-aware generator that can produce view-consistent and locally-editable portrait images. Our method uses two decoupled latent codes to generate corresponding facial semantics and texture in a spatial aligned 3D volume with shared geometry. Benefiting from such underlying 3D representation, FENeRF can jointly render the boundary-aligned image and semantic mask and use the semantic mask to edit the 3D volume via GAN inversion. We further show such 3D representation can be learned from widely available monocular image and semantic mask pairs. Moreover, we reveal that joint learning semantics and texture helps to generate finer geometry. Our experiments demonstrate that FENeRF outperforms state-of-the-art methods in various face editing tasks.

preprint2022arXiv

Hallucinated Neural Radiance Fields in the Wild

Neural Radiance Fields (NeRF) has recently gained popularity for its impressive novel view synthesis ability. This paper studies the problem of hallucinated NeRF: i.e., recovering a realistic NeRF at a different time of day from a group of tourism images. Existing solutions adopt NeRF with a controllable appearance embedding to render novel views under various conditions, but they cannot render view-consistent images with an unseen appearance. To solve this problem, we present an end-to-end framework for constructing a hallucinated NeRF, dubbed as Ha-NeRF. Specifically, we propose an appearance hallucination module to handle time-varying appearances and transfer them to novel views. Considering the complex occlusions of tourism images, we introduce an anti-occlusion module to decompose the static subjects for visibility accurately. Experimental results on synthetic data and real tourism photo collections demonstrate that our method can hallucinate the desired appearances and render occlusion-free images from different views. The project and supplementary materials are available at https://rover-xingyu.github.io/Ha-NeRF/.

preprint2022arXiv

Hybrid RIS and DMA Assisted Multiuser MIMO Uplink Transmission With Electromagnetic Exposure Constraints

In the fifth-generation and beyond era, reconfigurable intelligent surface (RIS) and dynamic metasurface antennas (DMAs) are emerging metamaterials keeping up with the demand for high-quality wireless communication services, which promote the diversification of portable wireless terminals. However, along with the rapid expansion of wireless devices, the electromagnetic (EM) radiation increases unceasingly and inevitably affects public health, which requires a limited exposure level in the transmission design. To reduce the EM radiation and preserve the quality of communication service, we investigate the spectral efficiency (SE) maximization with EM constraints for uplink transmission in hybrid RIS and DMA assisted multiuser multiple-input multiple-output systems. Specifically, alternating optimization is adopted to optimize the transmit covariance, RIS phase shift, and DMA weight matrices. We first figure out the water-filling solutions of transmit covariance matrices with given RIS and DMA parameters. Then, the RIS phase shift matrix is optimized via the weighted minimum mean square error, block coordinate descent and minorization-maximization methods. Furthermore, we solve the unconstrainted DMA weight matrix optimization problem in closed form and then design the DMA weight matrix to approach this performance under DMA constraints. Numerical results confirm the effectiveness of the EM aware SE maximization transmission scheme over the conventional baselines.

preprint2022arXiv

HyP$^2$ Loss: Beyond Hypersphere Metric Space for Multi-label Image Retrieval

Image retrieval has become an increasingly appealing technique with broad multimedia application prospects, where deep hashing serves as the dominant branch towards low storage and efficient retrieval. In this paper, we carried out in-depth investigations on metric learning in deep hashing for establishing a powerful metric space in multi-label scenarios, where the pair loss suffers high computational overhead and converge difficulty, while the proxy loss is theoretically incapable of expressing the profound label dependencies and exhibits conflicts in the constructed hypersphere space. To address the problems, we propose a novel metric learning framework with Hybrid Proxy-Pair Loss (HyP$^2$ Loss) that constructs an expressive metric space with efficient training complexity w.r.t. the whole dataset. The proposed HyP$^2$ Loss focuses on optimizing the hypersphere space by learnable proxies and excavating data-to-data correlations of irrelevant pairs, which integrates sufficient data correspondence of pair-based methods and high-efficiency of proxy-based methods. Extensive experiments on four standard multi-label benchmarks justify the proposed method outperforms the state-of-the-art, is robust among different hash bits and achieves significant performance gains with a faster, more stable convergence speed. Our code is available at https://github.com/JerryXu0129/HyP2-Loss.

preprint2022arXiv

IDE-3D: Interactive Disentangled Editing for High-Resolution 3D-aware Portrait Synthesis

Existing 3D-aware facial generation methods face a dilemma in quality versus editability: they either generate editable results in low resolution or high-quality ones with no editing flexibility. In this work, we propose a new approach that brings the best of both worlds together. Our system consists of three major components: (1) a 3D-semantics-aware generative model that produces view-consistent, disentangled face images and semantic masks; (2) a hybrid GAN inversion approach that initialize the latent codes from the semantic and texture encoder, and further optimized them for faithful reconstruction; and (3) a canonical editor that enables efficient manipulation of semantic masks in canonical view and product high-quality editing results. Our approach is competent for many applications, e.g. free-view face drawing, editing, and style control. Both quantitative and qualitative results show that our method reaches the state-of-the-art in terms of photorealism, faithfulness, and efficiency.

preprint2022arXiv

Improving the Latent Space of Image Style Transfer

Existing neural style transfer researches have studied to match statistical information between the deep features of content and style images, which were extracted by a pre-trained VGG, and achieved significant improvement in synthesizing artistic images. However, in some cases, the feature statistics from the pre-trained encoder may not be consistent with the visual style we perceived. For example, the style distance between images of different styles is less than that of the same style. In such an inappropriate latent space, the objective function of the existing methods will be optimized in the wrong direction, resulting in bad stylization results. In addition, the lack of content details in the features extracted by the pre-trained encoder also leads to the content leak problem. In order to solve these issues in the latent space used by style transfer, we propose two contrastive training schemes to get a refined encoder that is more suitable for this task. The style contrastive loss pulls the stylized result closer to the same visual style image and pushes it away from the content image. The content contrastive loss enables the encoder to retain more available details. We can directly add our training scheme to some existing style transfer methods and significantly improve their results. Extensive experimental results demonstrate the effectiveness and superiority of our methods.

preprint2022arXiv

LAS-AT: Adversarial Training with Learnable Attack Strategy

Adversarial training (AT) is always formulated as a minimax problem, of which the performance depends on the inner optimization that involves the generation of adversarial examples (AEs). Most previous methods adopt Projected Gradient Decent (PGD) with manually specifying attack parameters for AE generation. A combination of the attack parameters can be referred to as an attack strategy. Several works have revealed that using a fixed attack strategy to generate AEs during the whole training phase limits the model robustness and propose to exploit different attack strategies at different training stages to improve robustness. But those multi-stage hand-crafted attack strategies need much domain expertise, and the robustness improvement is limited. In this paper, we propose a novel framework for adversarial training by introducing the concept of "learnable attack strategy", dubbed LAS-AT, which learns to automatically produce attack strategies to improve the model robustness. Our framework is composed of a target network that uses AEs for training to improve robustness and a strategy network that produces attack strategies to control the AE generation. Experimental evaluations on three benchmark databases demonstrate the superiority of the proposed method. The code is released at https://github.com/jiaxiaojunQAQ/LAS-AT.

preprint2022arXiv

LocVTP: Video-Text Pre-training for Temporal Localization

Video-Text Pre-training (VTP) aims to learn transferable representations for various downstream tasks from large-scale web videos. To date, almost all existing VTP methods are limited to retrieval-based downstream tasks, e.g., video retrieval, whereas their transfer potentials on localization-based tasks, e.g., temporal grounding, are under-explored. In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks, and propose a novel Localization-oriented Video-Text Pre-training framework, dubbed as LocVTP. Specifically, we perform the fine-grained contrastive alignment as a complement to the coarse-grained one by a clip-word correspondence discovery scheme. To further enhance the temporal reasoning ability of the learned feature, we propose a context projection head and a temporal aware contrastive loss to perceive the contextual relationships. Extensive experiments on four downstream tasks across six datasets demonstrate that our LocVTP achieves state-of-the-art performance on both retrieval-based and localization-based tasks. Furthermore, we conduct comprehensive ablation studies and thorough analyses to explore the optimum model designs and training strategies.

preprint2022arXiv

Long-Short Temporal Contrastive Learning of Video Transformers

Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K. Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.

preprint2022arXiv

Motion-aware Contrastive Video Representation Learning via Foreground-background Merging

In light of the success of contrastive learning in the image domain, current self-supervised video representation learning methods usually employ contrastive loss to facilitate video representation learning. When naively pulling two augmented views of a video closer, the model however tends to learn the common static background as a shortcut but fails to capture the motion information, a phenomenon dubbed as background bias. Such bias makes the model suffer from weak generalization ability, leading to worse performance on downstream tasks such as action recognition. To alleviate such bias, we propose \textbf{F}oreground-b\textbf{a}ckground \textbf{Me}rging (FAME) to deliberately compose the moving foreground region of the selected video onto the static background of others. Specifically, without any off-the-shelf detector, we extract the moving foreground out of background regions via the frame difference and color statistics, and shuffle the background regions among the videos. By leveraging the semantic consistency between the original clips and the fused ones, the model focuses more on the motion patterns and is debiased from the background shortcut. Extensive experiments demonstrate that FAME can effectively resist background cheating and thus achieve the state-of-the-art performance on downstream tasks across UCF101, HMDB51, and Diving48 datasets. The code and configurations are released at https://github.com/Mark12Ding/FAME.

preprint2022arXiv

Multi-Robot Active Mapping via Neural Bipartite Graph Matching

We study the problem of multi-robot active mapping, which aims for complete scene map construction in minimum time steps. The key to this problem lies in the goal position estimation to enable more efficient robot movements. Previous approaches either choose the frontier as the goal position via a myopic solution that hinders the time efficiency, or maximize the long-term value via reinforcement learning to directly regress the goal position, but does not guarantee the complete map construction. In this paper, we propose a novel algorithm, namely NeuralCoMapping, which takes advantage of both approaches. We reduce the problem to bipartite graph matching, which establishes the node correspondences between two graphs, denoting robots and frontiers. We introduce a multiplex graph neural network (mGNN) that learns the neural distance to fill the affinity matrix for more effective graph matching. We optimize the mGNN with a differentiable linear assignment layer by maximizing the long-term values that favor time efficiency and map completeness via reinforcement learning. We compare our algorithm with several state-of-the-art multi-robot active mapping approaches and adapted reinforcement-learning baselines. Experimental results demonstrate the superior performance and exceptional generalization ability of our algorithm on various indoor scenes and unseen number of robots, when only trained with 9 indoor scenes.

preprint2022arXiv

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Vision Transformers (ViTs) take all the image patches as tokens and construct multi-head self-attention (MHSA) among them. Complete leverage of these image tokens brings redundant computations since not all the tokens are attentive in MHSA. Examples include that tokens containing semantically meaningless or distractive image backgrounds do not positively contribute to the ViT predictions. In this work, we propose to reorganize image tokens during the feed-forward process of ViT models, which is integrated into ViT during training. For each forward inference, we identify the attentive image tokens between MHSA and FFN (i.e., feed-forward network) modules, which is guided by the corresponding class token attention. Then, we reorganize image tokens by preserving attentive image tokens and fusing inattentive ones to expedite subsequent MHSA and FFN computations. To this end, our method EViT improves ViTs from two perspectives. First, under the same amount of input image tokens, our method reduces MHSA and FFN computation for efficient inference. For instance, the inference speed of DeiT-S is increased by 50% while its recognition accuracy is decreased by only 0.3% for ImageNet classification. Second, by maintaining the same computational cost, our method empowers ViTs to take more image tokens as input for recognition accuracy improvement, where the image tokens are from higher resolution images. An example is that we improve the recognition accuracy of DeiT-S by 1% for ImageNet classification at the same computational cost of a vanilla DeiT-S. Meanwhile, our method does not introduce more parameters to ViTs. Experiments on the standard benchmarks show the effectiveness of our method. The code is available at https://github.com/youweiliang/evit

preprint2022arXiv

Parallel measurements of vibrational modes in a few-layer graphene nanomechanical resonator using software-defined radio dongles

Software-defined radio dongles are small and inexpensive receivers well known to amateur radio enthusiasts. When connected to an antenna, they enable monitoring of a wide range of the radio spectrum by conditioning the input signal and transferring a downconverted version of it to a personal computer for software processing. Here, we employ a composite of two such dongles, interfaced with codes written in MATLAB and GNU Radio, as a measuring instrument to study the flexural vibrations of a few-layer graphene nanomechanical resonator. Instead of an antenna, we connect the dongles to the split output of a photodetector used to detect vibrations optically. We first perform a quantitative analysis of the dynamics of the first vibrational mode. We then measure the response of the first two vibrational modes in parallel. To illustrate our technique, we detect changes in the vibrational amplitude of both modes induced by periodic strain modulation with a delay of $\approx1$ ms between measurements. Last, we show that our software-based instrument can be employed to demodulate human voice encoded in the vibrations of our resonator. For parallel measurements of several frequency channels, and provided that the input signal is not too weak, our composite system may offer an alternative to the use of multiple lock-in amplifiers or multiple spectrum analyzers, with the distinct advantage of being cost-effective per frequency channel.

preprint2022arXiv

Prior-Guided Adversarial Initialization for Fast Adversarial Training

Fast adversarial training (FAT) effectively improves the efficiency of standard adversarial training (SAT). However, initial FAT encounters catastrophic overfitting, i.e.,the robust accuracy against adversarial attacks suddenly and dramatically decreases. Though several FAT variants spare no effort to prevent overfitting, they sacrifice much calculation cost. In this paper, we explore the difference between the training processes of SAT and FAT and observe that the attack success rate of adversarial examples (AEs) of FAT gets worse gradually in the late training stage, resulting in overfitting. The AEs are generated by the fast gradient sign method (FGSM) with a zero or random initialization. Based on the observation, we propose a prior-guided FGSM initialization method to avoid overfitting after investigating several initialization strategies, improving the quality of the AEs during the whole training process. The initialization is formed by leveraging historically generated AEs without additional calculation cost. We further provide a theoretical analysis for the proposed initialization method. We also propose a simple yet effective regularizer based on the prior-guided initialization,i.e., the currently generated perturbation should not deviate too much from the prior-guided initialization. The regularizer adopts both historical and current adversarial perturbations to guide the model learning. Evaluations on four datasets demonstrate that the proposed method can prevent catastrophic overfitting and outperform state-of-the-art FAT methods. The code is released at https://github.com/jiaxiaojunQAQ/FGSM-PGI.

preprint2022arXiv

Reinforcement Learning-Empowered Mobile Edge Computing for 6G Edge Intelligence

Mobile edge computing (MEC) is considered a novel paradigm for computation-intensive and delay-sensitive tasks in fifth generation (5G) networks and beyond. However, its uncertainty, referred to as dynamic and randomness, from the mobile device, wireless channel, and edge network sides, results in high-dimensional, nonconvex, nonlinear, and NP-hard optimization problems. Thanks to the evolved reinforcement learning (RL), upon iteratively interacting with the dynamic and random environment, its trained agent can intelligently obtain the optimal policy in MEC. Furthermore, its evolved versions, such as deep RL (DRL), can achieve higher convergence speed efficiency and learning accuracy based on the parametric approximation for the large-scale state-action space. This paper provides a comprehensive research review on RL-enabled MEC and offers insight for development in this area. More importantly, associated with free mobility, dynamic channels, and distributed services, the MEC challenges that can be solved by different kinds of RL algorithms are identified, followed by how they can be solved by RL solutions in diverse mobile applications. Finally, the open challenges are discussed to provide helpful guidance for future research in RL training and learning MEC.

preprint2022arXiv

Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection

Recent studies in deepfake detection have yielded promising results when the training and testing face forgeries are from the same dataset. However, the problem remains challenging when one tries to generalize the detector to forgeries created by unseen methods in the training dataset. This work addresses the generalizable deepfake detection from a simple principle: a generalizable representation should be sensitive to diverse types of forgeries. Following this principle, we propose to enrich the "diversity" of forgeries by synthesizing augmented forgeries with a pool of forgery configurations and strengthen the "sensitivity" to the forgeries by enforcing the model to predict the forgery configurations. To effectively explore the large forgery augmentation space, we further propose to use the adversarial training strategy to dynamically synthesize the most challenging forgeries to the current model. Through extensive experiments, we show that the proposed strategies are surprisingly effective (see Figure 1), and they could achieve superior performance than the current state-of-the-art methods. Code is available at \url{https://github.com/liangchen527/SLADD}.

preprint2022arXiv

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN

One-shot talking face generation aims at synthesizing a high-quality talking face video from an arbitrary portrait image, driven by a video or an audio segment. One challenging quality factor is the resolution of the output video: higher resolution conveys more details. In this work, we investigate the latent feature space of a pre-trained StyleGAN and discover some excellent spatial transformation properties. Upon the observation, we explore the possibility of using a pre-trained StyleGAN to break through the resolution limit of training datasets. We propose a novel unified framework based on a pre-trained StyleGAN that enables a set of powerful functionalities, i.e., high-resolution video generation, disentangled control by driving video or audio, and flexible face editing. Our framework elevates the resolution of the synthesized talking face to 1024*1024 for the first time, even though the training dataset has a lower resolution. We design a video-based motion generation module and an audio-based one, which can be plugged into the framework either individually or jointly to drive the video generation. The predicted motion is used to transform the latent features of StyleGAN for visual animation. To compensate for the transformation distortion, we propose a calibration network as well as a domain loss to refine the features. Moreover, our framework allows two types of facial editing, i.e., global editing via GAN inversion and intuitive editing based on 3D morphable models. Comprehensive experiments show superior video quality, flexible controllability, and editability over state-of-the-art methods.

preprint2022arXiv

Towards Accurate Active Camera Localization

In this work, we tackle the problem of active camera localization, which controls the camera movements actively to achieve an accurate camera pose. The past solutions are mostly based on Markov Localization, which reduces the position-wise camera uncertainty for localization. These approaches localize the camera in the discrete pose space and are agnostic to the localization-driven scene property, which restricts the camera pose accuracy in the coarse scale. We propose to overcome these limitations via a novel active camera localization algorithm, composed of a passive and an active localization module. The former optimizes the camera pose in the continuous pose space by establishing point-wise camera-world correspondences. The latter explicitly models the scene and camera uncertainty components to plan the right path for accurate camera pose estimation. We validate our algorithm on the challenging localization scenarios from both synthetic and scanned real-world indoor scenes. Experimental results demonstrate that our algorithm outperforms both the state-of-the-art Markov Localization based approach and other compared approaches on the fine-scale camera pose accuracy. Code and data are released at https://github.com/qhFang/AccurateACL.

preprint2022arXiv

Towards Real-World Video Deblurring by Exploring Blur Formation Process

This paper aims at exploring how to synthesize close-to-real blurs that existing video deblurring models trained on them can generalize well to real-world blurry videos. In recent years, deep learning-based approaches have achieved promising success on video deblurring task. However, the models trained on existing synthetic datasets still suffer from generalization problems over real-world blurry scenarios with undesired artifacts. The factors accounting for the failure remain unknown. Therefore, we revisit the classical blur synthesis pipeline and figure out the possible reasons, including shooting parameters, blur formation space, and image signal processor~(ISP). To analyze the effects of these potential factors, we first collect an ultra-high frame-rate (940 FPS) RAW video dataset as the data basis to synthesize various kinds of blurs. Then we propose a novel realistic blur synthesis pipeline termed as RAW-Blur by leveraging blur formation cues. Through numerous experiments, we demonstrate that synthesizing blurs in the RAW space and adopting the same ISP as the real-world testing data can effectively eliminate the negative effects of synthetic data. Furthermore, the shooting parameters of the synthesized blurry video, e.g., exposure time and frame-rate play significant roles in improving the performance of deblurring models. Impressively, the models trained on the blurry data synthesized by the proposed RAW-Blur pipeline can obtain more than 5dB PSNR gain against those trained on the existing synthetic blur datasets. We believe the novel realistic synthesis pipeline and the corresponding RAW video dataset can help the community to easily construct customized blur datasets to improve real-world video deblurring performance largely, instead of laboriously collecting real data pairs.

preprint2022arXiv

Truncate-Split-Contrast: A Framework for Learning from Mislabeled Videos

Learning with noisy label (LNL) is a classic problem that has been extensively studied for image tasks, but much less for video in the literature. A straightforward migration from images to videos without considering the properties of videos, such as computational cost and redundant information, is not a sound choice. In this paper, we propose two new strategies for video analysis with noisy labels: 1) A lightweight channel selection method dubbed as Channel Truncation for feature-based label noise detection. This method selects the most discriminative channels to split clean and noisy instances in each category; 2) A novel contrastive strategy dubbed as Noise Contrastive Learning, which constructs the relationship between clean and noisy instances to regularize model training. Experiments on three well-known benchmark datasets for video classification show that our proposed tru{\bf N}cat{\bf E}-split-contr{\bf A}s{\bf T} (NEAT) significantly outperforms the existing baselines. By reducing the dimension to 10\% of it, our method achieves over 0.4 noise detection F1-score and 5\% classification accuracy improvement on Mini-Kinetics dataset under severe noise (symmetric-80\%). Thanks to Noise Contrastive Learning, the average classification accuracy improvement on Mini-Kinetics and Sth-Sth-V1 is over 1.6\%.

preprint2022arXiv

Unsupervised Pre-training for Temporal Action Localization Tasks

Unsupervised video representation learning has made remarkable achievements in recent years. However, most existing methods are designed and optimized for video classification. These pre-trained models can be sub-optimal for temporal localization tasks due to the inherent discrepancy between video-level classification and clip-level localization. To bridge this gap, we make the first attempt to propose a self-supervised pretext task, coined as Pseudo Action Localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action Localization tasks (UP-TAL). Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos. The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them. Compared to the existing unsupervised video representation learning approaches, our PAL adapts better to downstream TAL tasks by introducing a temporal equivariant contrastive learning paradigm in a temporally dense and scale-aware manner. Extensive experiments show that PAL can utilize large-scale unlabeled video data to significantly boost the performance of existing TAL methods. Our codes and models will be made publicly available at https://github.com/zhang-can/UP-TAL.

preprint2022arXiv

UPHDR-GAN: Generative Adversarial Network for High Dynamic Range Imaging with Unpaired Data

The paper proposes a method to effectively fuse multi-exposure inputs and generate high-quality high dynamic range (HDR) images with unpaired datasets. Deep learning-based HDR image generation methods rely heavily on paired datasets. The ground truth images play a leading role in generating reasonable HDR images. Datasets without ground truth are hard to be applied to train deep neural networks. Recently, Generative Adversarial Networks (GAN) have demonstrated their potentials of translating images from source domain X to target domain Y in the absence of paired examples. In this paper, we propose a GAN-based network for solving such problems while generating enjoyable HDR results, named UPHDR-GAN. The proposed method relaxes the constraint of the paired dataset and learns the mapping from the LDR domain to the HDR domain. Although the pair data are missing, UPHDR-GAN can properly handle the ghosting artifacts caused by moving objects or misalignments with the help of the modified GAN loss, the improved discriminator network and the useful initialization phase. The proposed method preserves the details of important regions and improves the total image perceptual quality. Qualitative and quantitative comparisons against the representative methods demonstrate the superiority of the proposed UPHDR-GAN.

preprint2022arXiv

VDTR: Video Deblurring with Transformer

Video deblurring is still an unsolved problem due to the challenging spatio-temporal modeling process. While existing convolutional neural network-based methods show a limited capacity for effective spatial and temporal modeling for video deblurring. This paper presents VDTR, an effective Transformer-based model that makes the first attempt to adapt Transformer for video deblurring. VDTR exploits the superior long-range and relation modeling capabilities of Transformer for both spatial and temporal modeling. However, it is challenging to design an appropriate Transformer-based model for video deblurring due to the complicated non-uniform blurs, misalignment across multiple frames and the high computational costs for high-resolution spatial modeling. To address these problems, VDTR advocates performing attention within non-overlapping windows and exploiting the hierarchical structure for long-range dependencies modeling. For frame-level spatial modeling, we propose an encoder-decoder Transformer that utilizes multi-scale features for deblurring. For multi-frame temporal modeling, we adapt Transformer to fuse multiple spatial features efficiently. Compared with CNN-based methods, the proposed method achieves highly competitive results on both synthetic and real-world video deblurring benchmarks, including DVD, GOPRO, REDS and BSD. We hope such a Transformer-based architecture can serve as a powerful alternative baseline for video deblurring and other video restoration tasks. The source code will be available at \url{https://github.com/ljzycmd/VDTR}.

preprint2021arXiv

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

We present \textsc{Vx2Text}, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. In order to leverage transformer networks, which have been shown to be effective at modeling language, each modality is first converted into a set of language embeddings by a learnable tokenizer. This allows our approach to perform multimodal fusion in the language space, thus eliminating the need for ad-hoc cross-modal fusion modules. To address the non-differentiability of tokenization on continuous inputs (e.g., video or audio), we utilize a relaxation scheme that enables end-to-end training. Furthermore, unlike prior encoder-only models, our network includes an autoregressive decoder to generate open-ended text from the multimodal embeddings fused by the language encoder. This renders our approach fully generative and makes it directly applicable to different "video+$x$ to text" problems without the need to design specialized network heads for each task. The proposed framework is not only conceptually simple but also remarkably effective: experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks -- captioning, question answering and audio-visual scene-aware dialog.

preprint2020arXiv

Content-Aware Unsupervised Deep Homography Estimation

Homography estimation is a basic image alignment method in many applications. It is usually conducted by extracting and matching sparse feature points, which are error-prone in low-light and low-texture images. On the other hand, previous deep homography approaches use either synthetic images for supervised learning or aerial images for unsupervised learning, both ignoring the importance of handling depth disparities and moving objects in real world applications. To overcome these problems, in this work we propose an unsupervised deep homography method with a new architecture design. In the spirit of the RANSAC procedure in traditional methods, we specifically learn an outlier mask to only select reliable regions for homography estimation. We calculate loss with respect to our learned deep features instead of directly comparing image content as did previously. To achieve the unsupervised training, we also formulate a novel triplet loss customized for our network. We verify our method by conducting comprehensive comparisons on a new dataset that covers a wide range of scenes with varying degrees of difficulties for the task. Experimental results reveal that our method outperforms the state-of-the-art including deep solutions and feature-based solutions.

preprint2020arXiv

Contrastive Video Representation Learning via Adversarial Perturbations

Adversarial perturbations are noise-like patterns that can subtly change the data, while failing an otherwise accurate classifier. In this paper, we propose to use such perturbations within a novel contrastive learning setup to build negative samples, which are then used to produce improved video representations. To this end, given a well-trained deep model for per-frame video recognition, we first generate adversarial noise adapted to this model. Positive and negative bags are produced using the original data features from the full video sequence and their perturbed counterparts, respectively. Unlike the classic contrastive learning methods, we develop a binary classification problem that learns a set of discriminative hyperplanes -- as a subspace -- that will separate the two bags from each other. This subspace is then used as a descriptor for the video, dubbed \emph{discriminative subspace pooling}. As the perturbed features belong to data classes that are likely to be confused with the original features, the discriminative subspace will characterize parts of the feature space that are more representative of the original data, and thus may provide robust video representations. To learn such descriptors, we formulate a subspace learning objective on the Stiefel manifold and resort to Riemannian optimization methods for solving it efficiently. We provide experiments on several video datasets and demonstrate state-of-the-art results.

preprint2020arXiv

Enabling 5G on the Ocean: A Hybrid Satellite-UAV-Terrestrial Network Solution

Current fifth generation (5G) cellular networks mainly focus on the terrestrial scenario. Due to the difficulty of deploying communications infrastructure on the ocean, the performance of existing maritime communication networks (MCNs) is far behind 5G. This problem can be solved by using unmanned aerial vehicles (UAVs) as agile aerial platforms to enable on-demand maritime coverage, as a supplement to marine satellites and shore-based terrestrial based stations (TBSs). In this paper, we study the integration of UAVs with existing MCNs, and investigate the potential gains of hybrid satellite-UAV-terrestrial networks for maritime coverage. Unlike the terrestrial scenario, vessels on the ocean keep to sea lanes and are sparsely distributed. This provides new opportunities to ease the scheduling of UAVs. Also, new challenges arise due to the more complicated maritime prorogation environment, as well as the mutual interference between UAVs and existing satellites/TBSs. We discuss these issues and show possible solutions considering practical constraints.

preprint2020arXiv

Energy Efficiency Optimization for Downlink Massive MIMO With Statistical CSIT

We investigate energy efficiency (EE) optimization for single-cell massive multiple-input multiple-output (MIMO) downlink transmission with only statistical channel state information (CSI) available at the base station. We first show that beam domain transmission is favorable for energy efficiency in the massive MIMO downlink, by deriving a closed-form solution for the eigenvectors of the optimal transmit covariance matrix. With this conclusion, the EE optimization problem is reduced to a real-valued power allocation problem, which is much easier to tackle than the original large-dimensional complex matrix-valued precoding design problem. We further propose an iterative water-filling-structured beam domain power allocation algorithm with low complexity and guaranteed convergence, exploiting the techniques from sequential optimization, fractional optimization, and random matrix theory. Numerical results demonstrate the near-optimal performance of our proposed statistical CSI aided EE optimization approach.

preprint2020arXiv

Hysteresis in anesthesia and recovery: Experimental observation and dynamical mechanism

The dynamical mechanism underlying the processes of anesthesia-induced loss of consciousness and recovery is key to gaining insights into the working of the nervous system. Previous experiments revealed an asymmetry between neural signals during the anesthesia and recovery processes. Here we obtain experimental evidence for the hysteresis loop and articulate the dynamical mechanism based on percolation on multilayer complex networks with self-similarity. Model analysis reveals that, during anesthesia, the network is able to maintain its neural pathways despite the loss of a substantial fraction of the edges. A predictive and potentially testable result is that, in the forward process of anesthesia, the average shortest path and the clustering coefficient of the neural network are markedly smaller than those associated with the recovery process. This suggests that the network strives to maintain certain neurological functions by adapting to a relatively more compact structure in response to anesthesia.

preprint2020arXiv

Learning Color Compatibility in Fashion Outfits

Color compatibility is important for evaluating the compatibility of a fashion outfit, yet it was neglected in previous studies. We bring this important problem to researchers' attention and present a compatibility learning framework as solution to various fashion tasks. The framework consists of a novel way to model outfit compatibility and an innovative learning scheme. Specifically, we model the outfits as graphs and propose a novel graph construction to better utilize the power of graph neural networks. Then we utilize both ground-truth labels and pseudo labels to train the compatibility model in a weakly-supervised manner.Extensive experimental results verify the importance of color compatibility alone with the effectiveness of our framework. With color information alone, our model's performance is already comparable to previous methods that use deep image features. Our full model combining the aforementioned contributions set the new state-of-the-art in fashion compatibility prediction.

preprint2020arXiv

New quasi-universal relations for static and rapid rotating neutron stars

In the last few decades, lots of universal relations between different global physical quantities of neutron stars have been proposed to constrain the unobservable or hard to be observed properties of neutron stars. But few of them are related to the gravitational redshift or the gravitational binding energy, especially for the fast rotating neutron stars. Here we will focus on the universal relations related to these two quantities. Based on 11 equations of state (EOSs) from the predictions of microscopic nuclear many-body theories for normal or hybrid neutron stars, we proposed a set of new quasi-universal relations under three rotating cases: static, general rotating and Keplerian rotating. These new quasi-universal relations provide a potential way to constrain or estimate the unobservable or hard to be observed properties of neutron stars.

preprint2020arXiv

OccInpFlow: Occlusion-Inpainting Optical Flow Estimation by Unsupervised Learning

Occlusion is an inevitable and critical problem in unsupervised optical flow learning. Existing methods either treat occlusions equally as non-occluded regions or simply remove them to avoid incorrectness. However, the occlusion regions can provide effective information for optical flow learning. In this paper, we present OccInpFlow, an occlusion-inpainting framework to make full use of occlusion regions. Specifically, a new appearance-flow network is proposed to inpaint occluded flows based on the image content. Moreover, a boundary warp is proposed to deal with occlusions caused by displacement beyond image border. We conduct experiments on multiple leading flow benchmark data sets such as Flying Chairs, KITTI and MPI-Sintel, which demonstrate that the performance is significantly improved by our proposed occlusion handling framework.

preprint2020arXiv

One-Dimensional Moiré Excitons in Transition-Metal Dichalcogenide Heterobilayers

The formation of interfacial moiré patterns from angular and/or lattice mismatch has become a powerful approach to engineer a range of quantum phenomena in van der Waals heterostructures. For long-lived and valley-polarized interlayer excitons in transition-metal dichalcogenide (TMDC) heterobilayers, signatures of quantum confinement by the moiré landscape have been reported in recent experimental studies. Such moiré confinement has offered the exciting possibility to tailor new excitonic systems, such as ordered arrays of zero-dimensional (0D) quantum emitters and their coupling into topological superlattices. A remarkable nature of the moiré potential is its dramatic response to strain, where a small uniaxial strain can tune the array of quantum-dot-like 0D traps into parallel stripes of one-dimensional (1D) quantum wires. Here, we present direct evidence for the 1D moiré potentials from real space imaging and the corresponding 1D moiré excitons from photoluminescence (PL) emission in MoSe2/WSe2 heterobilayers. Whereas the 0D moiré excitons display quantum emitter-like sharp PL peaks with circular polarization, the PL emission from 1D moiré excitons has linear polarization and two orders of magnitude higher intensity. The results presented here establish strain engineering as a powerful new method to tailor moiré potentials as well as their optical and electronic responses on demand.

preprint2020arXiv

Outage Analysis for Intelligent Reflecting Surface Assisted Vehicular Communication Networks

Vehicular communication is an important application of the fifth generation of mobile communication systems (5G). Due to its low cost and energy efficiency, intelligent reflecting surface (IRS) has been envisioned as a promising technique that can enhance the coverage performance significantly by passive beamforming. In this paper, we analyze the outage probability performance in IRS-assisted vehicular communication networks. We derive the expression of outage probability by utilizing series expansion and central limit theorem. Numerical results show that the IRS can significantly reduce the outage probability for vehicles in its vicinity. The outage probability is closely related to the vehicle density and the number of IRS elements, and better performance is achieved with more reflecting elements.

preprint2020arXiv

Spatio-Temporal Ranked-Attention Networks for Video Captioning

Generating video descriptions automatically is a challenging task that involves a complex interplay between spatio-temporal visual features and language models. Given that videos consist of spatial (frame-level) features and their temporal evolutions, an effective captioning model should be able to attend to these different cues selectively. To this end, we propose a Spatio-Temporal and Temporo-Spatial (STaTS) attention model which, conditioned on the language state, hierarchically combines spatial and temporal attention to videos in two different orders: (i) a spatio-temporal (ST) sub-model, which first attends to regions that have temporal evolution, then temporally pools the features from these regions; and (ii) a temporo-spatial (TS) sub-model, which first decides a single frame to attend to, then applies spatial attention within that frame. We propose a novel LSTM-based temporal ranking function, which we call ranked attention, for the ST model to capture action dynamics. Our entire framework is trained end-to-end. We provide experiments on two benchmark datasets: MSVD and MSR-VTT. Our results demonstrate the synergy between the ST and TS modules, outperforming recent state-of-the-art methods.

preprint2020arXiv

Surjectivity of Convolution Operators on Noncompact Symmetric Spaces

Let $μ$ be a $K$-invariant compactly supported distribution on a noncompact Riemannian symmetric space $X=G/K$. If the spherical Fourier transform $\widetildeμ(λ)$ is slowly decreasing, it is known that the right convolution operator $c_μ\colon f\mapsto f*μ$ maps $\mathcal E(X)$ onto $\mathcal E(X)$. In this paper, we prove the converse of this result. We also prove that $c_μ$ has a fundamental solution if and only if $\widetildeμ(λ)$ is slowly decreasing.

preprint2016arXiv

Appearance Harmonization for Single Image Shadow Removal

Shadows often create unwanted artifacts in photographs, and removing them can be very challenging. Previous shadow removal methods often produce de-shadowed regions that are visually inconsistent with the rest of the image. In this work we propose a fully automatic shadow region harmonization approach that improves the appearance compatibility of the de-shadowed region as typically produced by previous methods. It is based on a shadow-guided patch-based image synthesis approach that reconstructs the shadow region using patches sampled from non-shadowed regions. The result is then refined based on the reconstruction confidence to handle unique image patterns. Many shadow removal results and comparisons are show the effectiveness of our improvement. Quantitative evaluation on a benchmark dataset suggests that our automatic shadow harmonization approach effectively improves upon the state-of-the-art.

preprint2016arXiv

Deep Video Deblurring

Motion blur from camera shake is a major problem in videos captured by hand-held devices. Unlike single-image deblurring, video-based approaches can take advantage of the abundant information that exists across neighboring frames. As a result the best performing methods rely on aligning nearby frames. However, aligning images is a computationally expensive and fragile procedure, and methods that aggregate information must therefore be able to identify which regions have been accurately aligned and which have not, a task which requires high level scene understanding. In this work, we introduce a deep learning solution to video deblurring, where a CNN is trained end-to-end to learn how to accumulate information across frames. To train this network, we collected a dataset of real videos recorded with a high framerate camera, which we use to generate synthetic motion blur for supervision. We show that the features learned from this dataset extend to deblurring motion blur that arises due to camera shake in a wide range of videos, and compare the quality of results to a number of other baselines.

preprint2016arXiv

Large-Scale MIMO Secure Transmission with Finite Alphabet Inputs

In this paper, we investigate secure transmission over the large-scale multiple-antenna wiretap channel with finite alphabet inputs. First, we show analytically that a generalized singular value decomposition (GSVD) based design, which is optimal for Gaussian inputs, may exhibit a severe performance loss for finite alphabet inputs in the high signal-to-noise ratio (SNR) regime. In light of this, we propose a novel Per-Group-GSVD (PG-GSVD) design which can effectively compensate the performance loss caused by the GSVD design. More importantly, the computational complexity of the PG-GSVD design is by orders of magnitude lower than that of the existing design for finite alphabet inputs in [1] while the resulting performance loss is minimal. Numerical results indicate that the proposed PG-GSVD design can be efficiently implemented in large-scale multiple-antenna systems and achieves significant performance gains compared to the GSVD design.

preprint2016arXiv

Quantum oscillation and nontrivial transport in the Dirac Semimetal Cd3As2 nanodevice

Here we demonstrate the Shubnikov de Haas oscillation in high-quality Cd3As2 nanowires grown by a chemical vapor deposition approach. The dominant transport of topological Dirac fermions is evident by the nontrivial Berry phase in the Landau Fan diagram. The quantum oscillations rise at a small field of 2 Tesla and preserves till up to 100K, revealing a sizeable Landau level gap and a mobility of over 2000 cm2/V-1s-1. The angle-variable oscillations indicates the isotropy of the bulk Dirac transport. The large estimated mean free path appeals the one-dimensional transport of Dirac semimetals.

preprint2016arXiv

Segmentation Rectification for Video Cutout via One-Class Structured Learning

Recent works on interactive video object cutout mainly focus on designing dynamic foreground-background (FB) classifiers for segmentation propagation. However, the research on optimally removing errors from the FB classification is sparse, and the errors often accumulate rapidly, causing significant errors in the propagated frames. In this work, we take the initial steps to addressing this problem, and we call this new task \emph{segmentation rectification}. Our key observation is that the possibly asymmetrically distributed false positive and false negative errors were handled equally in the conventional methods. We, alternatively, propose to optimally remove these two types of errors. To this effect, we propose a novel bilayer Markov Random Field (MRF) model for this new task. We also adopt the well-established structured learning framework to learn the optimal model from data. Additionally, we propose a novel one-class structured SVM (OSSVM) which greatly speeds up the structured learning process. Our method naturally extends to RGB-D videos as well. Comprehensive experiments on both RGB and RGB-D data demonstrate that our simple and effective method significantly outperforms the segmentation propagation methods adopted in the state-of-the-art video cutout systems, and the results also suggest the potential usefulness of our method in image cutout system.

preprint2015arXiv

CamSwarm: Instantaneous Smartphone Camera Arrays for Collaborative Photography

Camera arrays (CamArrays) are widely used in commercial filming projects for achieving special visual effects such as bullet time effect, but are very expensive to set up. We propose CamSwarm, a low-cost and lightweight alternative to professional CamArrays for consumer applications. It allows the construction of a collaborative photography platform from multiple mobile devices anywhere and anytime, enabling new capturing and editing experiences that a single camera cannot provide. Our system allows easy team formation; uses real-time visualization and feedback to guide camera positioning; provides a mechanism for synchronized capturing; and finally allows the user to efficiently browse and edit the captured imagery. Our user study suggests that CamSwarm is easy to use; the provided real-time guidance is helpful; and the full system achieves high quality results promising for non-professional use. A demo video is provided at https://www.youtube.com/watch?v=LgkHcvcyTTM.

preprint2015arXiv

Jamming-Aided Secure Communication in Massive MIMO Rician Channels

In this paper, we investigate the artificial noise-aided jamming design for a transmitter equipped with large antenna array in Rician fading channels. We figure out that when the number of transmit antennas tends to infinity, whether the secrecy outage happens in a Rician channel depends on the geometric locations of eavesdroppers. In this light, we first define and analytically describe the secrecy outage region (SOR), indicating all possible locations of an eavesdropper that can cause secrecy outage. After that, the secrecy outage probability (SOP) is derived, and a jamming-beneficial range, i.e., the distance range of eavesdroppers which enables uniform jamming to reduce the SOP, is determined. Then, the optimal power allocation between messages and artificial noise is investigated for different scenarios. Furthermore, to use the jamming power more efficiently and further reduce the SOP, we propose directional jamming that generates jamming signals at selected beams (mapped to physical angles) only, and power allocation algorithms are proposed for the cases with and without the information of the suspicious area, i.e., possible locations of eavesdroppers. We further extend the discussions to multiuser and multi-cell scenarios. At last, numerical results validate our conclusions and show the effectiveness of our proposed jamming power allocation schemes.

preprint2015arXiv

PanoSwarm: Collaborative and Synchronized Multi-Device Panoramic Photography

Taking a picture has been traditionally a one-persons task. In this paper we present a novel system that allows multiple mobile devices to work collaboratively in a synchronized fashion to capture a panorama of a highly dynamic scene, creating an entirely new photography experience that encourages social interactions and teamwork. Our system contains two components: a client app that runs on all participating devices, and a server program that monitors and communicates with each device. In a capturing session, the server collects in realtime the viewfinder images of all devices and stitches them on-the-fly to create a panorama preview, which is then streamed to all devices as visual guidance. The system also allows one camera to be the host and to send direct visual instructions to others to guide camera adjustment. When ready, all devices take pictures at the same time for panorama stitching. Our preliminary study suggests that the proposed system can help users capture high quality panoramas with an enjoyable teamwork experience. A demo video of the system in action is provided at http://youtu.be/PwQ6k_ZEQSs.

preprint2013arXiv

Classification of Indecomposable Flows of Signed Graphs

An indecomposable flow $f$ on a signed graph $Σ$ is a nontrivial integral flow that cannot be decomposed into $f=f_1+f_2$, where $f_1,f_2$ are nontrivial integral flows having the same sign (both $\geq 0$ or both $\leq 0$) at each edge of $Σ$. This paper is to classify indecomposable flows into characteristic vectors of circuits and Eulerian cycle-trees --- a class of signed graphs having a kind of tree structure in which all cycles can be viewed as vertices of a tree. Moreover, each indecomposable flow other than circuit characteristic vectors can be further decomposed into a sum of certain half circuit characteristic vectors having the same sign at each edge. The variety of indecomposable flows of signed graphs is much richer than that of ordinary unsigned graphs.

preprint2012arXiv

The Bayesian process control with multiple assignable causes

We study an optimal process control problem with multiple assignable causes. The process is initially in-control but is subject to random transition to one of multiple out-of-control states due to assignable causes. The objective is to find an optimal stopping rule under partial observation that maximizes the total expected reward in infinite horizon. The problem is formulated as a partially observable Markov decision process (POMDP) with the belief space consisting of state probability vectors. New observations are obtained at fixed sampling interval to update the belief vector using Bayes' theorem. Under standard assumptions, we show that a conditional control limit policy is optimal and that there exists a convex, non-increasing control limit that partitions the belief space into two individually connected control regions: a stopping region and a continuation region. We further derive the analytical bounds for the control limit. An algorithm is devised based on structural results, which considerably reduces the computation. We also shed light on the selection of optimal fixed sampling interval.

preprint2011arXiv

The anomalous top quark coupling tqg and tW production at the LHC

Many new physics models beyond the standard model ($SM$) can give rise to the large anomalous top couplings $tqg$ ($q=u$ and $c$). We focus our attention on these couplings induced by the topcolor-assisted technicolor ($TC2$) model and the littlest Higgs model with $T$-parity (called $LHT $ model), and consider their contributions to the production cross section and the charge asymmetry for $tW$ production at the $LHC$. We find that the anomalous top coupling $tqg$ induced by these two kinds of new physics models can indeed generate sizable charge asymmetry. The correction effects of the $LHT $ model on the production cross sections of the processes $pp\rightarrow tW^-+X$ and $pp\rightarrow \bar{t}W^++X$ are significant large, which might be detected at the $LHC$.

preprint2010arXiv

Density, structure and dynamics of water: the effect of Van der Waals interactions

It is known that ab initio molecular dynamics (AIMD) simulations of liquid water, based on the generalized gradient approximation (GGA) to density functional theory (DFT), yield structural and diffusive properties in reasonable agreement with experiment only if artificially high temperatures are used in the simulations. The equilibrium density, at normal conditions, of DFT water has been recently shown by Schmidt et al. [J. Phys. chem. B, 113, 11959 (2009)] to be underestimated by different GGA functionals for exchange and correlation, and corrected by the addition of interatomic pair potentials to describe van derWaals (vdW) interactions. In this contribution we present a DFTAIMD study of liquid water using several GGA functionals as well as the van der Waals density functional (vdW-DF) of Dion et al. [Phys. Rev. Lett. 92, 246401(2004)]. As expected, we find that the density of water is grossly underestimated by GGA functionals. When a vdW-DF is used, the density improves drastically and the experimental diffusivity is reproduced without the need of thermal corrections. We analyze the origin of the density differences between all the functionals. We show that the vdW-DF increases the population of non-H-bonded interstitial sites, at distances between the first and second coordination shells. However, it excessively weakens the H-bond network, collapsing the second coordination shell. This structural problem is partially associated to the choice of GGA exchange in the vdW-DF. We show that a different choice for the exchange functional is enough to achieve an overall improvement both in structure and diffusivity.

preprint2010arXiv

Single production of the doubly charged Higgs boson via eγcollision in the Higgs triplet model

The Higgs triplet model (HTM) predicts the existence of a pair of doubly charged Higgs bosons $H^{\pm \pm}$. Single production of $H^{\pm \pm}$ via e collision at the next generation e+e- International Linear Collider (ILC) and the Large Hadron electron Collider (LHeC) is considered. The numerical results show that the production cross sections are very sensitive to the neutrino oscillation parameters. Their values for the inverted hierarchy mass spectrum are larger than those for the normal hierarchy mass spectrum at these two kinds of collider experiments. With reasonable values of the relevant free parameters, the possible signals of the doubly charged Higgs bosons predicted by the HTM might be detected in future ILC experiments.

Jue Wang

What is connected

Connect this record

See the researcher in context

Building this map preview

59 published item(s)

Search Your Block Floating Point Scales!

Evidence for Exciton Crystals in a 2D Semiconductor Heterotrilayer

Fluid Antenna-Assisted MIMO Transmission Exploiting Statistical CSI

Boosting Fast Adversarial Training with Learnable Adversarial Initialization

Control-Oriented Power Allocation for Integrated Satellite-UAV Networks

Deblur-NeRF: Neural Radiance Fields from Blurry Images

Deformable Video Transformer

Energy Efficiency Maximization of Massive MIMO Communications With Dynamic Metasurface Antennas

Fashion Captioning: Towards Generating Accurate Descriptions with Semantic Rewards

Fast Adversarial Training with Adaptive Step Size

FENeRF: Face Editing in Neural Radiance Fields

Hallucinated Neural Radiance Fields in the Wild

Hybrid RIS and DMA Assisted Multiuser MIMO Uplink Transmission With Electromagnetic Exposure Constraints

HyP$^2$ Loss: Beyond Hypersphere Metric Space for Multi-label Image Retrieval

IDE-3D: Interactive Disentangled Editing for High-Resolution 3D-aware Portrait Synthesis

Improving the Latent Space of Image Style Transfer

LAS-AT: Adversarial Training with Learnable Attack Strategy

LocVTP: Video-Text Pre-training for Temporal Localization

Long-Short Temporal Contrastive Learning of Video Transformers

Motion-aware Contrastive Video Representation Learning via Foreground-background Merging

Multi-Robot Active Mapping via Neural Bipartite Graph Matching

Not All Patches are What You Need: Expediting Vision Transformers via Token Reorganizations

Parallel measurements of vibrational modes in a few-layer graphene nanomechanical resonator using software-defined radio dongles

Prior-Guided Adversarial Initialization for Fast Adversarial Training

Reinforcement Learning-Empowered Mobile Edge Computing for 6G Edge Intelligence

Self-supervised Learning of Adversarial Example: Towards Good Generalizations for Deepfake Detection

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN

Towards Accurate Active Camera Localization

Towards Real-World Video Deblurring by Exploring Blur Formation Process

Truncate-Split-Contrast: A Framework for Learning from Mislabeled Videos

Unsupervised Pre-training for Temporal Action Localization Tasks

UPHDR-GAN: Generative Adversarial Network for High Dynamic Range Imaging with Unpaired Data

VDTR: Video Deblurring with Transformer

VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs

Content-Aware Unsupervised Deep Homography Estimation

Contrastive Video Representation Learning via Adversarial Perturbations

Enabling 5G on the Ocean: A Hybrid Satellite-UAV-Terrestrial Network Solution

Energy Efficiency Optimization for Downlink Massive MIMO With Statistical CSIT

Hysteresis in anesthesia and recovery: Experimental observation and dynamical mechanism

Learning Color Compatibility in Fashion Outfits

New quasi-universal relations for static and rapid rotating neutron stars

OccInpFlow: Occlusion-Inpainting Optical Flow Estimation by Unsupervised Learning

One-Dimensional Moiré Excitons in Transition-Metal Dichalcogenide Heterobilayers

Outage Analysis for Intelligent Reflecting Surface Assisted Vehicular Communication Networks

Spatio-Temporal Ranked-Attention Networks for Video Captioning

Surjectivity of Convolution Operators on Noncompact Symmetric Spaces

Appearance Harmonization for Single Image Shadow Removal

Deep Video Deblurring

Large-Scale MIMO Secure Transmission with Finite Alphabet Inputs

Quantum oscillation and nontrivial transport in the Dirac Semimetal Cd3As2 nanodevice

Segmentation Rectification for Video Cutout via One-Class Structured Learning

CamSwarm: Instantaneous Smartphone Camera Arrays for Collaborative Photography

Jamming-Aided Secure Communication in Massive MIMO Rician Channels

PanoSwarm: Collaborative and Synchronized Multi-Device Panoramic Photography

Classification of Indecomposable Flows of Signed Graphs

The Bayesian process control with multiple assignable causes

The anomalous top quark coupling tqg and tW production at the LHC

Density, structure and dynamics of water: the effect of Van der Waals interactions

Single production of the doubly charged Higgs boson via eγcollision in the Higgs triplet model