Source author record

Hao Tang

Hao Tang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

62works

21topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

A framework for analyzing concept representations in neural models

Understanding how neural models represent human-interpretable concepts is challenging. Prior work has explored linear concept subspaces from diverse perspectives, such as probing and concept erasure. We introduce a unified framework to study these subspaces along two axes: \textit{containment}, which tests if a concept is fully represented in a subspace but not outside it, and \textit{disentanglement}, which tests for isolation from other concepts. In experiments on both text and speech models, we first highlight that concept subspaces may not be uniquely determined, and discuss the implications for concept subspace analysis. Then, we compare properties of concept subspaces estimated using five estimators, proposed in different communities. We find that (1) the choice of estimator impacts the containment and disentanglement properties; (2) the state-of-the-art concept erasure method, LEACE, performs well on both testing axes, but still struggles to generalize to unseen data; and (3) in HuBERT speech representations, phone information is both contained and disentangled from speaker information, while speaker information is hard to contain in a compact subspace, despite being disentangled from phones.

preprint2023arXiv

Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment

Few-shot fine-grained recognition (FS-FGR) aims to recognize novel fine-grained categories with the help of limited available samples. Undoubtedly, this task inherits the main challenges from both few-shot learning and fine-grained recognition. First, the lack of labeled samples makes the learned model easy to overfit. Second, it also suffers from high intra-class variance and low inter-class differences in the datasets. To address this challenging task, we propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local-to-local (L2L) similarity metric. Specifically, the BAS is introduced to generate a foreground mask for localization to weaken background disturbance and enhance dominative foreground objects. The FOA then reconstructs the feature map of each support sample according to its correction to the query ones, which addresses the problem of misalignment between support-query image pairs. To enable the proposed method to have the ability to capture subtle differences in confused samples, we present a novel L2L similarity metric to further measure the local similarity between a pair of aligned spatial features in the embedding space. What's more, considering that background interference brings poor robustness, we infer the pairwise similarity of feature maps using both the raw image and the refined image. Extensive experiments conducted on multiple popular fine-grained benchmarks demonstrate that our method outperforms the existing state of the art by a large margin. The source codes are available at: https://github.com/CSer-Tang-hao/BSFA-FSFG.

preprint2023arXiv

Controllable Person Image Synthesis with Spatially-Adaptive Warped Normalization

Controllable person image generation aims to produce realistic human images with desirable attributes such as a given pose, cloth textures, or hairstyles. However, the large spatial misalignment between source and target images makes the standard image-to-image translation architectures unsuitable for this task. Most state-of-the-art methods focus on alignment for global pose-transfer tasks. However, they fail to deal with region-specific texture-transfer tasks, especially for person images with complex textures. To solve this problem, we propose a novel Spatially-Adaptive Warped Normalization (SAWN) which integrates a learned flow-field to warp modulation parameters. It allows us to efficiently align person spatially-adaptive styles with pose features. Moreover, we propose a novel Self-Training Part Replacement (STPR) strategy to refine the model for the texture-transfer task, which improves the quality of the generated clothes and the preservation ability of non-target regions. Our experimental results on the widely used DeepFashion dataset demonstrate a significant improvement of the proposed method over the state-of-the-art methods on pose-transfer and texture-transfer tasks. The code is available at https://github.com/zhangqianhui/Sawn.

preprint2022arXiv

3D-Aware Semantic-Guided Generative Model for Human Synthesis

Generative Neural Radiance Field (GNeRF) models, which extract implicit 3D representations from 2D images, have recently been shown to produce realistic images representing rigid/semi-rigid objects, such as human faces or cars. However, they usually struggle to generate high-quality images representing non-rigid objects, such as the human body, which is of a great interest for many computer graphics applications. This paper proposes a 3D-aware Semantic-Guided Generative Model (3D-SGAN) for human image synthesis, which combines a GNeRF with a texture generator. The former learns an implicit 3D representation of the human body and outputs a set of 2D semantic segmentation masks. The latter transforms these semantic masks into a real image, adding a realistic texture to the human appearance. Without requiring additional 3D information, our model can learn 3D human representations with a photo-realistic, controllable generation. Our experiments on the DeepFashion dataset show that 3D-SGAN significantly outperforms the most recent baselines. The code is available at https://github.com/zhangqianhui/3DSGAN

preprint2022arXiv

AO2-DETR: Arbitrary-Oriented Object Detection Transformer

Arbitrary-oriented object detection (AOOD) is a challenging task to detect objects in the wild with arbitrary orientations and cluttered arrangements. Existing approaches are mainly based on anchor-based boxes or dense points, which rely on complicated hand-designed processing steps and inductive bias, such as anchor generation, transformation, and non-maximum suppression reasoning. Recently, the emerging transformer-based approaches view object detection as a direct set prediction problem that effectively removes the need for hand-designed components and inductive biases. In this paper, we propose an Arbitrary-Oriented Object DEtection TRansformer framework, termed AO2-DETR, which comprises three dedicated components. More precisely, an oriented proposal generation mechanism is proposed to explicitly generate oriented proposals, which provides better positional priors for pooling features to modulate the cross-attention in the transformer decoder. An adaptive oriented proposal refinement module is introduced to extract rotation-invariant region features and eliminate the misalignment between region features and objects. And a rotation-aware set matching loss is used to ensure the one-to-one matching process for direct set prediction without duplicate predictions. Our method considerably simplifies the overall pipeline and presents a new AOOD paradigm. Comprehensive experiments on several challenging datasets show that our method achieves superior performance on the AOOD task.

preprint2022arXiv

Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

Vision transformers (ViTs) are emerging with significantly improved accuracy in computer vision tasks. However, their complex architecture and enormous computation/storage demand impose urgent needs for new hardware accelerator design methodology. This work proposes an FPGA-aware automatic ViT acceleration framework based on the proposed mixed-scheme quantization. To the best of our knowledge, this is the first FPGA-based ViT acceleration framework exploring model quantization. Compared with state-of-the-art ViT quantization work (algorithmic approach only without hardware acceleration), our quantization achieves 0.47% to 1.36% higher Top-1 accuracy under the same bit-width. Compared with the 32-bit floating-point baseline FPGA accelerator, our accelerator achieves around 5.6x improvement on the frame rate (i.e., 56.8 FPS vs. 10.0 FPS) with 0.71% accuracy drop on ImageNet dataset for DeiT-base.

preprint2022arXiv

CharFormer: A Glyph Fusion based Attentive Framework for High-precision Character Image Denoising

Degraded images commonly exist in the general sources of character images, leading to unsatisfactory character recognition results. Existing methods have dedicated efforts to restoring degraded character images. However, the denoising results obtained by these methods do not appear to improve character recognition performance. This is mainly because current methods only focus on pixel-level information and ignore critical features of a character, such as its glyph, resulting in character-glyph damage during the denoising process. In this paper, we introduce a novel generic framework based on glyph fusion and attention mechanisms, i.e., CharFormer, for precisely recovering character images without changing their inherent glyphs. Unlike existing frameworks, CharFormer introduces a parallel target task for capturing additional information and injecting it into the image denoising backbone, which will maintain the consistency of character glyphs during character image denoising. Moreover, we utilize attention-based networks for global-local feature interaction, which will help to deal with blind denoising and enhance denoising performance. We compare CharFormer with state-of-the-art methods on multiple datasets. The experimental results show the superiority of CharFormer quantitatively and qualitatively.

preprint2022arXiv

Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution

Deep learning-based super-resolution (SR) has gained tremendous popularity in recent years because of its high image quality performance and wide application scenarios. However, prior methods typically suffer from large amounts of computations and huge power consumption, causing difficulties for real-time inference, especially on resource-limited platforms such as mobile devices. To mitigate this, we propose a compiler-aware SR neural architecture search (NAS) framework that conducts depth search and per-layer width search with adaptive SR blocks. The inference speed is directly taken into the optimization along with the SR loss to derive SR models with high image quality while satisfying the real-time inference requirement. Instead of measuring the speed on mobile devices at each iteration during the search process, a speed model incorporated with compiler optimizations is leveraged to predict the inference latency of the SR block with various width configurations for faster convergence. With the proposed framework, we achieve real-time SR inference for implementing 720p resolution with competitive SR performance (in terms of PSNR and SSIM) on GPU/DSP of mobile platforms (Samsung Galaxy S21).

preprint2022arXiv

Continual Attentive Fusion for Incremental Learning in Semantic Segmentation

Over the past years, semantic segmentation, as many other tasks in computer vision, benefited from the progress in deep neural networks, resulting in significantly improved performance. However, deep architectures trained with gradient-based techniques suffer from catastrophic forgetting, which is the tendency to forget previously learned knowledge while learning new tasks. Aiming at devising strategies to counteract this effect, incremental learning approaches have gained popularity over the past years. However, the first incremental learning methods for semantic segmentation appeared only recently. While effective, these approaches do not account for a crucial aspect in pixel-level dense prediction problems, i.e. the role of attention mechanisms. To fill this gap, in this paper we introduce a novel attentive feature distillation approach to mitigate catastrophic forgetting while accounting for semantic spatial- and channel-level dependencies. Furthermore, we propose a {continual attentive fusion} structure, which takes advantage of the attention learned from the new and the old tasks while learning features for the new task. Finally, we also introduce a novel strategy to account for the background class in the distillation loss, thus preventing biased predictions. We demonstrate the effectiveness of our approach with an extensive evaluation on Pascal-VOC 2012 and ADE20K, setting a new state of the art.

preprint2022arXiv

Contrastive Learning from Spatio-Temporal Mixed Skeleton Sequences for Self-Supervised Skeleton-Based Action Recognition

Self-supervised skeleton-based action recognition with contrastive learning has attracted much attention. Recent literature shows that data augmentation and large sets of contrastive pairs are crucial in learning such representations. In this paper, we found that directly extending contrastive pairs based on normal augmentations brings limited returns in terms of performance, because the contribution of contrastive pairs from the normal data augmentation to the loss get smaller as training progresses. Therefore, we delve into hard contrastive pairs for contrastive learning. Motivated by the success of mixing augmentation strategy which improves the performance of many tasks by synthesizing novel samples, we propose SkeleMixCLR: a contrastive learning framework with a spatio-temporal skeleton mixing augmentation (SkeleMix) to complement current contrastive learning approaches by providing hard contrastive samples. First, SkeleMix utilizes the topological information of skeleton data to mix two skeleton sequences by randomly combing the cropped skeleton fragments (the trimmed view) with the remaining skeleton sequences (the truncated view). Second, a spatio-temporal mask pooling is applied to separate these two views at the feature level. Third, we extend contrastive pairs with these two views. SkeleMixCLR leverages the trimmed and truncated views to provide abundant hard contrastive pairs since they involve some context information from each other due to the graph convolution operations, which allows the model to learn better motion representations for action recognition. Extensive experiments on NTU-RGB+D, NTU120-RGB+D, and PKU-MMD datasets show that SkeleMixCLR achieves state-of-the-art performance. Codes are available at https://github.com/czhaneva/SkeleMixCLR.

preprint2022arXiv

DE-Net: Dynamic Text-guided Image Editing Adversarial Networks

Text-guided image editing models have shown remarkable results. However, there remain two problems. First, they employ fixed manipulation modules for various editing requirements (e.g., color changing, texture changing, content adding and removing), which results in over-editing or insufficient editing. Second, they do not clearly distinguish between text-required and text-irrelevant parts, which leads to inaccurate editing. To solve these limitations, we propose: (i) a Dynamic Editing Block (DEBlock) which composes different editing modules dynamically for various editing requirements. (ii) a Composition Predictor (Comp-Pred) which predicts the composition weights for DEBlock according to the inference on target texts and source images. (iii) a Dynamic text-adaptive Convolution Block (DCBlock) which queries source image features to distinguish text-required parts and text-irrelevant parts. Extensive experiments demonstrate that our DE-Net achieves excellent performance and manipulates source images more correctly and accurately. Code is available at \url{https://github.com/tobran/DE-Net}.

preprint2022arXiv

Deep Algebraic Fitting for Multiple Circle Primitives Extraction from Raw Point Clouds

The shape of circle is one of fundamental geometric primitives of man-made engineering objects. Thus, extraction of circles from scanned point clouds is a quite important task in 3D geometry data processing. However, existing circle extraction methods either are sensitive to the quality of raw point clouds when classifying circle-boundary points, or require well-designed fitting functions when regressing circle parameters. To relieve the challenges, we propose an end-to-end Point Cloud Circle Algebraic Fitting Network (Circle-Net) based on a synergy of deep circle-boundary point feature learning and weighted algebraic fitting. First, we design a circle-boundary learning module, which considers local and global neighboring contexts of each point, to detect all potential circle-boundary points. Second, we develop a deep feature based circle parameter learning module for weighted algebraic fitting, without designing any weight metric, to avoid the influence of outliers during fitting. Unlike most of the cutting-edge circle extraction wisdoms, the proposed classification-and-fitting modules are originally co-trained with a comprehensive loss to enhance the quality of extracted circles.Comparisons on the established dataset and real-scanned point clouds exhibit clear improvements of Circle-Net over SOTAs in terms of both noise-robustness and extraction accuracy. We will release our code, model, and data for both training and evaluation on GitHub upon publication.

preprint2022arXiv

Disentangle Saliency Detection into Cascaded Detail Modeling and Body Filling

Salient object detection has been long studied to identify the most visually attractive objects in images/videos. Recently, a growing amount of approaches have been proposed all of which rely on the contour/edge information to improve detection performance. The edge labels are either put into the loss directly or used as extra supervision. The edge and body can also be learned separately and then fused afterward. Both methods either lead to high prediction errors near the edge or cannot be trained in an end-to-end manner. Another problem is that existing methods may fail to detect objects of various sizes due to the lack of efficient and effective feature fusion mechanisms. In this work, we propose to decompose the saliency detection task into two cascaded sub-tasks, \emph{i.e.}, detail modeling and body filling. Specifically, the detail modeling focuses on capturing the object edges by supervision of explicitly decomposed detail label that consists of the pixels that are nested on the edge and near the edge. Then the body filling learns the body part which will be filled into the detail map to generate more accurate saliency map. To effectively fuse the features and handle objects at different scales, we have also proposed two novel multi-scale detail attention and body attention blocks for precise detail and body modeling. Experimental results show that our method achieves state-of-the-art performances on six public datasets.

preprint2022arXiv

Distribution-Path Dependent Nonlinear SPDEs with Application to Stochastic Transport Type Equations

By using a regularity approximation argument, the global existence and uniqueness are derived for a class of nonlinear SPDEs depending on both the whole history and the distribution under strong enough noise. As applications, the global existence and uniqueness are proved for distribution-path dependent stochastic transport type equations, which are arising from stochastic fluid mechanics with forces depending on the history and the environment. In particular, the distribution-path dependent stochastic Camassa--Holm equation with or without Coriolis effect has a unique global solution when the noise is strong enough, whereas for the deterministic model wave-breaking may occur. This indicates that the noise may prevent blow-up almost surely.

preprint2022arXiv

First-principles Calculation of the Temperature-dependent Transition Energies in Spin Defects

Spin qubits associated with color centers are promising platforms for various quantum technologies. However, to be deployed in robust quantum devices, the variations of their intrinsic properties with the external conditions, and in particular temperature, should be known with high precision. Unfortunately, a predictive theory on the temperature dependence of the resonance frequency of electron and nuclear spin defects in solids remains lacking. In this work, we develop a first-principles method for the temperature dependence of zero phonon line, zero-field splitting, hyperfine interaction, and nuclear quadrupole interaction of color centers. As a testbed, we compare our ab-initio calculation results with experiments in the Nitrogen-Vacancy (NV) center finding good agreement. Interestingly, we identify the major origin of temperature dependence as a second-order effect of phonon vibration. The method is generally applicable to different color centers and provides a theoretical tool for designing high-precision quantum sensors.

preprint2022arXiv

Graph-based Generative Face Anonymisation with Pose Preservation

We propose AnonyGAN, a GAN-based solution for face anonymisation which replaces the visual information corresponding to a source identity with a condition identity provided as any single image. With the goal to maintain the geometric attributes of the source face, i.e., the facial pose and expression, and to promote more natural face generation, we propose to exploit a Bipartite Graph to explicitly model the relations between the facial landmarks of the source identity and the ones of the condition identity through a deep model. We further propose a landmark attention model to relax the manual selection of facial landmarks, allowing the network to weight the landmarks for the best visual naturalness and pose preservation. Finally, to facilitate the appearance learning, we propose a hybrid training strategy to address the challenge caused by the lack of direct pixel-level supervision. We evaluate our method and its variants on two public datasets, CelebA and LFW, in terms of visual naturalness, facial pose preservation and of its impacts on face detection and re-identification. We prove that AnonyGAN significantly outperforms the state-of-the-art methods in terms of visual naturalness, face detection and pose preservation.

preprint2022arXiv

Hierarchical Sketch Induction for Paraphrase Generation

We propose a generative model of paraphrase generation, that encourages syntactic diversity by conditioning on an explicit syntactic sketch. We introduce Hierarchical Refinement Quantized Variational Autoencoders (HRQ-VAE), a method for learning decompositions of dense encodings as a sequence of discrete latent variables that make iterative refinements of increasing granularity. This hierarchy of codes is learned through end-to-end training, and represents fine-to-coarse grained information about the input. We use HRQ-VAE to encode the syntactic form of an input sentence as a path through the hierarchy, allowing us to more easily predict syntactic sketches at test time. Extensive experiments, including a human evaluation, confirm that HRQ-VAE learns a hierarchical representation of the input space, and generates paraphrases of higher quality than previous systems.

preprint2022arXiv

Identity-Sensitive Knowledge Propagation for Cloth-Changing Person Re-identification

Cloth-changing person re-identification (CC-ReID), which aims to match person identities under clothing changes, is a new rising research topic in recent years. However, typical biometrics-based CC-ReID methods often require cumbersome pose or body part estimators to learn cloth-irrelevant features from human biometric traits, which comes with high computational costs. Besides, the performance is significantly limited due to the resolution degradation of surveillance images. To address the above limitations, we propose an effective Identity-Sensitive Knowledge Propagation framework (DeSKPro) for CC-ReID. Specifically, a Cloth-irrelevant Spatial Attention module is introduced to eliminate the distraction of clothing appearance by acquiring knowledge from the human parsing module. To mitigate the resolution degradation issue and mine identity-sensitive cues from human faces, we propose to restore the missing facial details using prior facial knowledge, which is then propagated to a smaller network. After training, the extra computations for human parsing or face restoration are no longer required. Extensive experiments show that our framework outperforms state-of-the-art methods by a large margin. Our code is available at https://github.com/KimbingNg/DeskPro.

preprint2022arXiv

Local and Global GANs with Semantic-Aware Upsampling for Image Generation

In this paper, we address the task of semantic-guided image generation. One challenge common to most existing image-level generation methods is the difficulty in generating small objects and detailed local textures. To address this, in this work we consider generating images using local context. As such, we design a local class-specific generative network using semantic maps as guidance, which separately constructs and learns subgenerators for different classes, enabling it to capture finer details. To learn more discriminative class-specific feature representations for the local generation, we also propose a novel classification module. To combine the advantages of both global image-level and local class-specific generation, a joint generation network is designed with an attention fusion module and a dual-discriminator structure embedded. Lastly, we propose a novel semantic-aware upsampling method, which has a larger receptive field and can take far-away pixels that are semantically related for feature upsampling, enabling it to better preserve semantic consistency for instances with the same semantic labels. Extensive experiments on two image generation tasks show the superior performance of the proposed method. State-of-the-art results are established by large margins on both tasks and on nine challenging public benchmarks. The source code and trained models are available at https://github.com/Ha0Tang/LGGAN.

preprint2022arXiv

Looking Outside the Window: Wide-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images

Long-range contextual information is crucial for the semantic segmentation of High-Resolution (HR) Remote Sensing Images (RSIs). However, image cropping operations, commonly used for training neural networks, limit the perception of long-range contexts in large RSIs. To overcome this limitation, we propose a Wide-Context Network (WiCoNet) for the semantic segmentation of HR RSIs. Apart from extracting local features with a conventional CNN, the WiCoNet has an extra context branch to aggregate information from a larger image area. Moreover, we introduce a Context Transformer to embed contextual information from the context branch and selectively project it onto the local features. The Context Transformer extends the Vision Transformer, an emerging kind of neural network, to model the dual-branch semantic correlations. It overcomes the locality limitation of CNNs and enables the WiCoNet to see the bigger picture before segmenting the land-cover/land-use (LCLU) classes. Ablation studies and comparative experiments conducted on several benchmark datasets demonstrate the effectiveness of the proposed method. In addition, we present a new Beijing Land-Use (BLU) dataset. This is a large-scale HR satellite dataset with high-quality and fine-grained reference labels, which can facilitate future studies in this field.

preprint2022arXiv

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

Estimating 3D human poses from monocular videos is a challenging task due to depth ambiguity and self-occlusion. Most existing works attempt to solve both issues by exploiting spatial and temporal relationships. However, those works ignore the fact that it is an inverse problem where multiple feasible solutions (i.e., hypotheses) exist. To relieve this limitation, we propose a Multi-Hypothesis Transformer (MHFormer) that learns spatio-temporal representations of multiple plausible pose hypotheses. In order to effectively model multi-hypothesis dependencies and build strong relationships across hypothesis features, the task is decomposed into three stages: (i) Generate multiple initial hypothesis representations; (ii) Model self-hypothesis communication, merge multiple hypotheses into a single converged representation and then partition it into several diverged hypotheses; (iii) Learn cross-hypothesis communication and aggregate the multi-hypothesis features to synthesize the final 3D pose. Through the above processes, the final representation is enhanced and the synthesized pose is much more accurate. Extensive experiments show that MHFormer achieves state-of-the-art results on two challenging datasets: Human3.6M and MPI-INF-3DHP. Without bells and whistles, its performance surpasses the previous best result by a large margin of 3% on Human3.6M. Code and models are available at \url{https://github.com/Vegetebird/MHFormer}.

preprint2022arXiv

Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation

The essence of video semantic segmentation (VSS) is how to leverage temporal information for prediction. Previous efforts are mainly devoted to developing new techniques to calculate the cross-frame affinities such as optical flow and attention. Instead, this paper contributes from a different angle by mining relations among cross-frame affinities, upon which better temporal information aggregation could be achieved. We explore relations among affinities in two aspects: single-scale intrinsic correlations and multi-scale relations. Inspired by traditional feature processing, we propose Single-scale Affinity Refinement (SAR) and Multi-scale Affinity Aggregation (MAA). To make it feasible to execute MAA, we propose a Selective Token Masking (STM) strategy to select a subset of consistent reference tokens for different scales when calculating affinities, which also improves the efficiency of our method. At last, the cross-frame affinities strengthened by SAR and MAA are adopted for adaptively aggregating temporal information. Our experiments demonstrate that the proposed method performs favorably against state-of-the-art VSS methods. The code is publicly available at https://github.com/GuoleiSun/VSS-MRCFA

preprint2022arXiv

Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model

To achieve disentangled image manipulation, previous works depend heavily on manual annotation. Meanwhile, the available manipulations are limited to a pre-defined set the models were trained for. We propose a novel framework, i.e., Predict, Prevent, and Evaluate (PPE), for disentangled text-driven image manipulation that requires little manual annotation while being applicable to a wide variety of manipulations. Our method approaches the targets by deeply exploiting the power of the large-scale pre-trained vision-language model CLIP. Concretely, we firstly Predict the possibly entangled attributes for a given text command. Then, based on the predicted attributes, we introduce an entanglement loss to Prevent entanglements during training. Finally, we propose a new evaluation metric to Evaluate the disentangled image manipulation. We verify the effectiveness of our method on the challenging face editing task. Extensive experiments show that the proposed PPE framework achieves much better quantitative and qualitative results than the up-to-date StyleCLIP baseline.

preprint2022arXiv

Quantum Computation for Pricing Caps using the LIBOR Market Model

The LIBOR Market Model (LMM) is a widely used model for pricing interest rate derivatives. While the Black-Scholes model is well-known for pricing stock derivatives such as stock options, a larger portion of derivatives are based on interest rates instead of stocks. Pricing interest rate derivatives used to be challenging, as their previous models employed either the instantaneous interest or forward rate that could not be directly observed in the market. This has been much improved since LMM was raised, as it uses directly observable interbank offered rates and is expected to be more precise. Recently, quantum computing has been used to speed up option pricing tasks, but rarely on structured interest rate derivatives. Given the size of the interest rate derivatives market and the widespread use of LMM, we employ quantum computing to price an interest rate derivative, caps, based on the LMM. As caps pricing relates to path-dependent Monte Carlo iterations for different tenors, which is common for many complex structured derivatives, we developed our hybrid classical-quantum approach that applies the quantum amplitude estimation algorithm to estimate the expectation for the last tenor. We show that our hybrid approach still shows better convergence than pure classical Monte Carlo methods, providing a useful case study for quantum computing with a greater diversity of derivatives.

preprint2022arXiv

Quantum Deep Learning for Mutant COVID-19 Strain Prediction

New COVID-19 epidemic strains like Delta and Omicron with increased transmissibility and pathogenicity emerge and spread across the whole world rapidly while causing high mortality during the pandemic period. Early prediction of possible variants (especially spike protein) of COVID-19 epidemic strains based on available mutated SARS-CoV-2 RNA sequences may lead to early prevention and treatment. Here, combining the advantage of quantum and quantum-inspired algorithms with the wide application of deep learning, we propose a development tool named DeepQuantum, and use this software to realize the goal of predicting spike protein variation structure of COVID-19 epidemic strains. In addition, this hybrid quantum-classical model for the first time achieves quantum-inspired blur convolution similar to classical depthwise convolution and also successfully applies quantum progressive training with quantum circuits, both of which guarantee that our model is the quantum counterpart of the famous style-based GAN. The results state that the fidelities of random generating spike protein variation structure are always beyond 96% for Delta, 94% for Omicron. The training loss curve is more stable and converges better with multiple loss functions compared with the corresponding classical algorithm. At last, evidences that quantum-inspired algorithms promote the classical deep learning and hybrid models effectively predict the mutant strains are strong.

preprint2022arXiv

RCRN: Real-world Character Image Restoration Network via Skeleton Extraction

Constructing high-quality character image datasets is challenging because real-world images are often affected by image degradation. There are limitations when applying current image restoration methods to such real-world character images, since (i) the categories of noise in character images are different from those in general images; (ii) real-world character images usually contain more complex image degradation, e.g., mixed noise at different noise levels. To address these problems, we propose a real-world character restoration network (RCRN) to effectively restore degraded character images, where character skeleton information and scale-ensemble feature extraction are utilized to obtain better restoration performance. The proposed method consists of a skeleton extractor (SENet) and a character image restorer (CiRNet). SENet aims to preserve the structural consistency of the character and normalize complex noise. Then, CiRNet reconstructs clean images from degraded character images and their skeletons. Due to the lack of benchmarks for real-world character image restoration, we constructed a dataset containing 1,606 character images with real-world degradation to evaluate the validity of the proposed method. The experimental results demonstrate that RCRN outperforms state-of-the-art methods quantitatively and qualitatively.

preprint2022arXiv

Real-Time Portrait Stylization on the Edge

In this work we demonstrate real-time portrait stylization, specifically, translating self-portrait into cartoon or anime style on mobile devices. We propose a latency-driven differentiable architecture search method, maintaining realistic generative quality. With our framework, we obtain $10\times$ computation reduction on the generative model and achieve real-time video stylization on off-the-shelf smartphone using mobile GPUs.

preprint2022arXiv

Supervised Attention in Sequence-to-Sequence Models for Speech Recognition

Attention mechanism in sequence-to-sequence models is designed to model the alignments between acoustic features and output tokens in speech recognition. However, attention weights produced by models trained end to end do not always correspond well with actual alignments, and several studies have further argued that attention weights might not even correspond well with the relevance attribution of frames. Regardless, visual similarity between attention weights and alignments is widely used during training as an indicator of the models quality. In this paper, we treat the correspondence between attention weights and alignments as a learning problem by imposing a supervised attention loss. Experiments have shown significant improved performance, suggesting that learning the alignments well during training critically determines the performance of sequence-to-sequence models.

preprint2022arXiv

Topology-Preserving Shape Reconstruction and Registration via Neural Diffeomorphic Flow

Deep Implicit Functions (DIFs) represent 3D geometry with continuous signed distance functions learned through deep neural nets. Recently DIFs-based methods have been proposed to handle shape reconstruction and dense point correspondences simultaneously, capturing semantic relationships across shapes of the same class by learning a DIFs-modeled shape template. These methods provide great flexibility and accuracy in reconstructing 3D shapes and inferring correspondences. However, the point correspondences built from these methods do not intrinsically preserve the topology of the shapes, unlike mesh-based template matching methods. This limits their applications on 3D geometries where underlying topological structures exist and matter, such as anatomical structures in medical images. In this paper, we propose a new model called Neural Diffeomorphic Flow (NDF) to learn deep implicit shape templates, representing shapes as conditional diffeomorphic deformations of templates, intrinsically preserving shape topologies. The diffeomorphic deformation is realized by an auto-decoder consisting of Neural Ordinary Differential Equation (NODE) blocks that progressively map shapes to implicit templates. We conduct extensive experiments on several medical image organ segmentation datasets to evaluate the effectiveness of NDF on reconstructing and aligning shapes. NDF achieves consistently state-of-the-art organ shape reconstruction and registration results in both accuracy and quality. The source code is publicly available at https://github.com/Siwensun/Neural_Diffeomorphic_Flow--NDF.

preprint2022arXiv

Towards Interpretable Video Super-Resolution via Alternating Optimization

In this paper, we study a practical space-time video super-resolution (STVSR) problem which aims at generating a high-framerate high-resolution sharp video from a low-framerate low-resolution blurry video. Such problem often occurs when recording a fast dynamic event with a low-framerate and low-resolution camera, and the captured video would suffer from three typical issues: i) motion blur occurs due to object/camera motions during exposure time; ii) motion aliasing is unavoidable when the event temporal frequency exceeds the Nyquist limit of temporal sampling; iii) high-frequency details are lost because of the low spatial sampling rate. These issues can be alleviated by a cascade of three separate sub-tasks, including video deblurring, frame interpolation, and super-resolution, which, however, would fail to capture the spatial and temporal correlations among video sequences. To address this, we propose an interpretable STVSR framework by leveraging both model-based and learning-based methods. Specifically, we formulate STVSR as a joint video deblurring, frame interpolation, and super-resolution problem, and solve it as two sub-problems in an alternate way. For the first sub-problem, we derive an interpretable analytical solution and use it as a Fourier data transform layer. Then, we propose a recurrent video enhancement layer for the second sub-problem to further recover high-frequency details. Extensive experiments demonstrate the superiority of our method in terms of quantitative metrics and visual quality.

preprint2022arXiv

Unsupervised High-Resolution Portrait Gaze Correction and Animation

This paper proposes a gaze correction and animation method for high-resolution, unconstrained portrait images, which can be trained without the gaze angle and the head pose annotations. Common gaze-correction methods usually require annotating training data with precise gaze, and head pose information. Solving this problem using an unsupervised method remains an open problem, especially for high-resolution face images in the wild, which are not easy to annotate with gaze and head pose labels. To address this issue, we first create two new portrait datasets: CelebGaze and high-resolution CelebHQGaze. Second, we formulate the gaze correction task as an image inpainting problem, addressed using a Gaze Correction Module (GCM) and a Gaze Animation Module (GAM). Moreover, we propose an unsupervised training strategy, i.e., Synthesis-As-Training, to learn the correlation between the eye region features and the gaze angle. As a result, we can use the learned latent space for gaze animation with semantic interpolation in this space. Moreover, to alleviate both the memory and the computational costs in the training and the inference stage, we propose a Coarse-to-Fine Module (CFM) integrated with GCM and GAM. Extensive experiments validate the effectiveness of our method for both the gaze correction and the gaze animation tasks in both low and high-resolution face datasets in the wild and demonstrate the superiority of our method with respect to the state of the arts. Code is available at https://github.com/zhangqianhui/GazeAnimationV2

preprint2020arXiv

A Hybrid Quantum Memory Enabled Network at Room Temperature

Quantum memory capable of storage and retrieval of flying photons on demand is crucial for developing quantum information technologies. However, the devices needed for long-distance links are quite different from those envisioned for local processing. Here, we present the first hybrid quantum memory enabled network by demonstrating the interconnection and simultaneous operation of two types of quantum memory: an atomic-ensemble-based memory and an all-optical loop memory. The former generates and stores single atomic excitations that can then be converted to single photons; and the latter maps incoming photons in and out on demand, at room-temperature and with a broad acceptance bandwidth. Interfacing these two types of quantum memories, we observe a well-preserved quantum cross-correlation, reaching a value of 22, and a violation of the Cauchy-Schwarz inequality up to 549 standard deviations. Furthermore, we demonstrate the creation and storage of a fully operable heralded photon chain state that can achieve memory-built-in combining, swapping, splitting, tuning and chopping single photons in a chain temporally. Such a quantum network allows atomic excitations to be generated, stored, and converted to broadband photons, which are then transferred to the next node, stored, and faithfully retrieved, all at high speed and in a programmable fashion.

preprint2020arXiv

A Study of Bug Resolution Characteristics in Popular Programming Languages

This paper presents a large-scale study that investigates the bug resolution characteristics among popular Github projects written in different programming languages. We explore correlations but, of course, we cannot infer causation. Specifically, we analyse bug resolution data from approximately 70 million Source Line of Code, drawn from 3 million commits to 600 GitHub projects, primarily written in 10 programming languages. We find notable variations in apparent bug resolution time and patch (fix) size. While interpretation of results from such large-scale empirical studies is inherently difficult, we believe that the differences in medians are sufficiently large to warrant further investigation, replication, re-analysis and follow up research. For example, in our corpus, the median apparent bug resolution time (elapsed time from raise to resolve) for Ruby was 4X that for Go and 2.5X for Java. We also found that patches tend to touch more files for the corpus of strongly typed and for statically typed programs. However, we also found evidence for a lower elapsed resolution time for bug resolution committed to projects constructed from statically typed languages. These findings, if replicated in subsequent follow on studies, may shed further empirical light on the debate about the importance of static typing.

preprint2020arXiv

AttentionAnatomy: A unified framework for whole-body organs at risk segmentation using multiple partially annotated datasets

Organs-at-risk (OAR) delineation in computed tomography (CT) is an important step in Radiation Therapy (RT) planning. Recently, deep learning based methods for OAR delineation have been proposed and applied in clinical practice for separate regions of the human body (head and neck, thorax, and abdomen). However, there are few researches regarding the end-to-end whole-body OARs delineation because the existing datasets are mostly partially or incompletely annotated for such task. In this paper, our proposed end-to-end convolutional neural network model, called \textbf{AttentionAnatomy}, can be jointly trained with three partially annotated datasets, segmenting OARs from whole body. Our main contributions are: 1) an attention module implicitly guided by body region label to modulate the segmentation branch output; 2) a prediction re-calibration operation, exploiting prior information of the input images, to handle partial-annotation(HPA) problem; 3) a new hybrid loss function combining batch Dice loss and spatially balanced focal loss to alleviate the organ size imbalance problem. Experimental results of our proposed framework presented significant improvements in both Sørensen-Dice coefficient (DSC) and 95\% Hausdorff distance compared to the baseline model.

preprint2020arXiv

Audio-Visual Calibration with Polynomial Regression for 2-D Projection Using SVD-PHAT

This paper proposes a straightforward 2-D method to spatially calibrate the visual field of a camera with the auditory field of an array microphone by generating and overlaying an acoustic image over an optical image. Using a low-cost microphone array and an off-the-shelf camera, we show that polynomial regression can deal efficiently with non-linear camera distortion, and that a recently proposed sound source localization method for real-time processing, SVD-PHAT, can be adapted for this task.

preprint2020arXiv

Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention

The major challenge in audio-visual event localization task lies in how to fuse information from multiple modalities effectively. Recent works have shown that attention mechanism is beneficial to the fusion process. In this paper, we propose a novel joint attention mechanism with multimodal fusion methods for audio-visual event localization. Particularly, we present a concise yet valid architecture that effectively learns representations from multiple modalities in a joint manner. Initially, visual features are combined with auditory features and then turned into joint representations. Next, we make use of the joint representations to attend to visual features and auditory features, respectively. With the help of this joint co-attention, new visual and auditory features are produced, and thus both features can enjoy the mutually improved benefits from each other. It is worth noting that the joint co-attention unit is recursive meaning that it can be performed multiple times for obtaining better joint representations progressively. Extensive experiments on the public AVE dataset have shown that the proposed method achieves significantly better results than the state-of-the-art methods.

preprint2020arXiv

Belief Propagation Neural Networks

Learned neural solvers have successfully been used to solve combinatorial optimization and decision problems. More general counting variants of these problems, however, are still largely solved with hand-crafted solvers. To bridge this gap, we introduce belief propagation neural networks (BPNNs), a class of parameterized operators that operate on factor graphs and generalize Belief Propagation (BP). In its strictest form, a BPNN layer (BPNN-D) is a learned iterative operator that provably maintains many of the desirable properties of BP for any choice of the parameters. Empirically, we show that by training BPNN-D learns to perform the task better than the original BP: it converges 1.7x faster on Ising models while providing tighter bounds. On challenging model counting problems, BPNNs compute estimates 100's of times faster than state-of-the-art handcrafted methods, while returning an estimate of comparable quality.

preprint2020arXiv

Bipartite Graph Reasoning GANs for Person Image Generation

We present a novel Bipartite Graph Reasoning GAN (BiGraphGAN) for the challenging person image generation task. The proposed graph generator mainly consists of two novel blocks that aim to model the pose-to-pose and pose-to-image relations, respectively. Specifically, the proposed Bipartite Graph Reasoning (BGR) block aims to reason the crossing long-range relations between the source pose and the target pose in a bipartite graph, which mitigates some challenges caused by pose deformation. Moreover, we propose a new Interaction-and-Aggregation (IA) block to effectively update and enhance the feature representation capability of both person's shape and appearance in an interactive way. Experiments on two challenging and public datasets, i.e., Market-1501 and DeepFashion, show the effectiveness of the proposed BiGraphGAN in terms of objective quantitative scores and subjective visual realness. The source code and trained models are available at https://github.com/Ha0Tang/BiGraphGAN.

preprint2020arXiv

Cross-View Image Synthesis with Deformable Convolution and Attention Mechanism

Learning to generate natural scenes has always been a daunting task in computer vision. This is even more laborious when generating images with very different views. When the views are very different, the view fields have little overlap or objects are occluded, leading the task very challenging. In this paper, we propose to use Generative Adversarial Networks(GANs) based on a deformable convolution and attention mechanism to solve the problem of cross-view image synthesis (see Fig.1). It is difficult to understand and transform scenes appearance and semantic information from another view, thus we use deformed convolution in the U-net network to improve the network's ability to extract features of objects at different scales. Moreover, to better learn the correspondence between images from different views, we apply an attention mechanism to refine the intermediate feature map thus generating more realistic images. A large number of experiments on different size images on the Dayton dataset[1] show that our model can produce better results than state-of-the-art methods.

preprint2020arXiv

Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation

In this work, we propose a novel Cycle In Cycle Generative Adversarial Network (C$^2$GAN) for the task of keypoint-guided image generation. The proposed C$^2$GAN is a cross-modal framework exploring a joint exploitation of the keypoint and the image data in an interactive manner. C$^2$GAN contains two different types of generators, i.e., keypoint-oriented generator and image-oriented generator. Both of them are mutually connected in an end-to-end learnable fashion and explicitly form three cycled sub-networks, i.e., one image generation cycle and two keypoint generation cycles. Each cycle not only aims at reconstructing the input domain, and also produces useful output involving in the generation of another cycle. By so doing, the cycles constrain each other implicitly, which provides complementary information from the two different modalities and brings extra supervision across cycles, thus facilitating more robust optimization of the whole network. Extensive experimental results on two publicly available datasets, i.e., Radboud Faces and Market-1501, demonstrate that our approach is effective to generate more photo-realistic images compared with state-of-the-art models.

preprint2020arXiv

Direct Observation of Quantum Percolation Dynamics

Percolation, describing critical behaviors of phase transition in a geometrical context, prompts wide investigations in natural and social networks as a fundamental model. The introduction of quantum-intrinsic interference and tunneling brings percolation into quantum regime with more fascinating phenomena and unique features, which, however, hasn't been experimentally explored yet. Here we present an experimental demonstration of quantum transport in hexagonal percolation lattices by successfully mapping such large-scale porous structures into a photonic chip using femtosecond laser direct writing techniques. A quantum percolation threshold of 80% is observed in the prototyped laser-written lattices with up to 1,600 waveguides, which is significantly larger than the classical counterpart of 63%. We also investigate the spatial confinement by localization parameters and exhibit the transition from ballistic to diffusive propagation with the decrease of the occupation probability. Direct observation of quantum percolation may deepen the understanding of the relation among materials, quantum transport, geometric quenching, disorder and localization, and inspire applications for quantum technologies.

preprint2020arXiv

Dual Attention GANs for Semantic Image Synthesis

In this paper, we focus on the semantic image synthesis task that aims at transferring semantic label maps to photo-realistic images. Existing methods lack effective semantic constraints to preserve the semantic information and ignore the structural correlations in both spatial and channel dimensions, leading to unsatisfactory blurry and artifact-prone results. To address these limitations, we propose a novel Dual Attention GAN (DAGAN) to synthesize photo-realistic and semantically-consistent images with fine details from the input layouts without imposing extra training overhead or modifying the network architectures of existing methods. We also propose two novel modules, i.e., position-wise Spatial Attention Module (SAM) and scale-wise Channel Attention Module (CAM), to capture semantic structure attention in spatial and channel dimensions, respectively. Specifically, SAM selectively correlates the pixels at each position by a spatial attention map, leading to pixels with the same semantic label being related to each other regardless of their spatial distances. Meanwhile, CAM selectively emphasizes the scale-wise features at each channel by a channel attention map, which integrates associated features among all channel maps regardless of their scales. We finally sum the outputs of SAM and CAM to further improve feature representation. Extensive experiments on four challenging datasets show that DAGAN achieves remarkably better results than state-of-the-art methods, while using fewer model parameters. The source code and trained models are available at https://github.com/Ha0Tang/DAGAN.

preprint2020arXiv

Dual In-painting Model for Unsupervised Gaze Correction and Animation in the Wild

In this paper we address the problem of unsupervised gaze correction in the wild, presenting a solution that works without the need for precise annotations of the gaze angle and the head pose. We have created a new dataset called CelebAGaze, which consists of two domains X, Y, where the eyes are either staring at the camera or somewhere else. Our method consists of three novel modules: the Gaze Correction module (GCM), the Gaze Animation module (GAM), and the Pretrained Autoencoder module (PAM). Specifically, GCM and GAM separately train a dual in-painting network using data from the domain $X$ for gaze correction and data from the domain $Y$ for gaze animation. Additionally, a Synthesis-As-Training method is proposed when training GAM to encourage the features encoded from the eye region to be correlated with the angle information, resulting in a gaze animation which can be achieved by interpolation in the latent space. To further preserve the identity information~(e.g., eye shape, iris color), we propose the PAM with an Autoencoder, which is based on Self-Supervised mirror learning where the bottleneck features are angle-invariant and which works as an extra input to the dual in-painting models. Extensive experiments validate the effectiveness of the proposed method for gaze correction and gaze animation in the wild and demonstrate the superiority of our approach in producing more compelling results than state-of-the-art baselines. Our code, the pretrained models and the supplementary material are available at: https://github.com/zhangqianhui/GazeAnimation.

preprint2020arXiv

Exocentric to Egocentric Image Generation via Parallel Generative Adversarial Network

Cross-view image generation has been recently proposed to generate images of one view from another dramatically different view. In this paper, we investigate exocentric (third-person) view to egocentric (first-person) view image generation. This is a challenging task since egocentric view sometimes is remarkably different from exocentric view. Thus, transforming the appearances across the two views is a non-trivial task. To this end, we propose a novel Parallel Generative Adversarial Network (P-GAN) with a novel cross-cycle loss to learn the shared information for generating egocentric images from exocentric view. We also incorporate a novel contextual feature loss in the learning procedure to capture the contextual information in images. Extensive experiments on the Exo-Ego datasets show that our model outperforms the state-of-the-art approaches.

preprint2020arXiv

Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation

In this paper, we address the task of semantic-guided scene generation. One open challenge in scene generation is the difficulty of the generation of small objects and detailed local texture, which has been widely observed in global image-level generation methods. To tackle this issue, in this work we consider learning the scene generation in a local context, and correspondingly design a local class-specific generative network with semantic maps as a guidance, which separately constructs and learns sub-generators concentrating on the generation of different classes, and is able to provide more scene details. To learn more discriminative class-specific feature representations for the local generation, a novel classification module is also proposed. To combine the advantage of both the global image-level and the local class-specific generation, a joint generation network is designed with an attention fusion module and a dual-discriminator structure embedded. Extensive experiments on two scene image generation tasks show superior generation performance of the proposed model. The state-of-the-art results are established by large margins on both tasks and on challenging public benchmarks. The source code and trained models are available at https://github.com/Ha0Tang/LGGAN.

preprint2020arXiv

On a stochastic Camassa-Holm type equation with higher order nonlinearities

The subject of this paper is a generalized Camassa-Holm equation under random perturbation. We first establish local existence and uniqueness results as well as blow-up criteria for pathwise solutions in the Sobolev spaces $H^s$ with $s>3/2$. Then we analyze how noise affects the dependence of solutions on initial data. Even though the noise has some already known regularization effects, much less is known concerning the dependence on initial data. As a new concept we introduce the notion of stability of exiting times and construct an example showing that multiplicative noise (in Itô sense) cannot improve the stability of the exiting time, and simultaneously improve the continuity of the dependence on initial data. Finally, we obtain global existence theorems and estimate associated probabilities.

preprint2020arXiv

Quantum Go Machine

Go has long been considered as a testbed for artificial intelligence. By introducing certain quantum features, such as superposition and collapse of wavefunction, we experimentally demonstrate a quantum version of Go by using correlated photon pairs entangled in polarization degree of freedom. The total dimension of Hilbert space of the generated states grows exponentially as two players take turns to place the stones in time series. As nondeterministic and imperfect information games are more difficult to solve using nowadays technology, we excitedly find that the inherent randomness in quantum physics can bring the game nondeterministic trait, which does not exist in the classical counterpart. Some quantum resources, like coherence or entanglement, can also be encoded to represent the state of quantum stones. Adjusting the quantum resource may vary the average imperfect information (as comparison classical Go is a perfect information game) of a single game. We further verify its non-deterministic feature by showing the unpredictability of the time series data obtained from different classes of quantum state. Finally, by comparing quantum Go with a few typical games that are widely studied in artificial intelligence, we find that quantum Go can cover a wide range of game difficulties rather than a single point. Our results establish a paradigm of inventing new games with quantum-enabled difficulties by harnessing inherent quantum features and resources, and provide a versatile platform for the test of new algorithms to both classical and quantum machine learning.

preprint2020arXiv

Relevant Region Prediction for Crowd Counting

Crowd counting is a concerned and challenging task in computer vision. Existing density map based methods excessively focus on the individuals' localization which harms the crowd counting performance in highly congested scenes. In addition, the dependency between the regions of different density is also ignored. In this paper, we propose Relevant Region Prediction (RRP) for crowd counting, which consists of the Count Map and the Region Relation-Aware Module (RRAM). Each pixel in the count map represents the number of heads falling into the corresponding local area in the input image, which discards the detailed spatial information and forces the network pay more attention to counting rather than localizing individuals. Based on the Graph Convolutional Network (GCN), Region Relation-Aware Module is proposed to capture and exploit the important region dependency. The module builds a fully connected directed graph between the regions of different density where each node (region) is represented by weighted global pooled feature, and GCN is learned to map this region graph to a set of relation-aware regions representations. Experimental results on three datasets show that our method obviously outperforms other existing state-of-the-art methods.

preprint2020arXiv

TensorFlow Solver for Quantum PageRank in Large-Scale Networks

Google PageRank is a prevalent and useful algorithm for ranking the significance of nodes or websites in a network, and a recent quantum counterpart for PageRank algorithm has been raised to suggest a higher accuracy of ranking comparing to Google PageRank. The quantum PageRank algorithm is essentially based on quantum stochastic walks and can be expressed using Lindblad master equation, which, however, needs to solve the Kronecker products of an O(N^4) dimension and requires severely large memory and time when the number of nodes N in a network increases above 150. Here, we present an efficient solver for quantum PageRank by using the Runge-Kutta method to reduce the matrix dimension to O(N^2) and employing TensorFlow to conduct GPU parallel computing. We demonstrate its performance in solving quantum PageRank for the USA major airline network with up to 922 nodes. Compared with the previous quantum PageRank solver, our solver dramatically reduces the required memory and time to only 1% and 0.2%, respectively, making it practical to work in a normal computer with a memory of 4-8 GB in no more than 100 seconds. This efficient solver for large-scale quantum PageRank and quantum stochastic walks would greatly facilitate studies of quantum information in real-life applications.

preprint2020arXiv

The Habitable Exoplanet Observatory (HabEx) Mission Concept Study Final Report

The Habitable Exoplanet Observatory, or HabEx, has been designed to be the Great Observatory of the 2030s. For the first time in human history, technologies have matured sufficiently to enable an affordable space-based telescope mission capable of discovering and characterizing Earthlike planets orbiting nearby bright sunlike stars in order to search for signs of habitability and biosignatures. Such a mission can also be equipped with instrumentation that will enable broad and exciting general astrophysics and planetary science not possible from current or planned facilities. HabEx is a space telescope with unique imaging and multi-object spectroscopic capabilities at wavelengths ranging from ultraviolet (UV) to near-IR. These capabilities allow for a broad suite of compelling science that cuts across the entire NASA astrophysics portfolio. HabEx has three primary science goals: (1) Seek out nearby worlds and explore their habitability; (2) Map out nearby planetary systems and understand the diversity of the worlds they contain; (3) Enable new explorations of astrophysical systems from our own solar system to external galaxies by extending our reach in the UV through near-IR. This Great Observatory science will be selected through a competed GO program, and will account for about 50% of the HabEx primary mission. The preferred HabEx architecture is a 4m, monolithic, off-axis telescope that is diffraction-limited at 0.4 microns and is in an L2 orbit. HabEx employs two starlight suppression systems: a coronagraph and a starshade, each with their own dedicated instrument.

preprint2020arXiv

Vector Vortex Beam Emitter Embedded in a Photonic Chip

Vector vortex beams simultaneously carrying spin and orbital angular momentum of light promise additional degrees of freedom for modern optics and emerging resources for both classical and quantum information technologies. The inherently infinite dimensions can be exploited to enhance data capacity for sustaining the unprecedented growth in big data and internet traffic, and can be encoded to build quantum computing machines in high-dimensional Hilbert space. So far much progress has been made in the emission of vector vortex beams from a chip surface into free space, however, the generation of vector vortex beams inside a photonic chip hasn't been realized yet. Here, we demonstrate the first vector vortex beam emitter embedded in a photonic chip by using femtosecond laser direct writing. We achieve a conversion of vector vortex beams with an efficiency up to 30% and scalar vortex beams with an efficiency up to 74% from Gaussian beams. We also present an expanded coupled-mode model for understanding the mode conversion and the influence of the imperfection in fabrication. The fashion of embedded generation makes vector vortex beams directly ready for further transmission, manipulation and emission without any additional interconnection. Together with the ability to be integrated as an array, our results may enable vector vortex beams become accessible inside a photonic chip for high-capacity communication and high-dimensional quantum information processing.

preprint2020arXiv

Vector-Quantized Autoregressive Predictive Coding

Autoregressive Predictive Coding (APC), as a self-supervised objective, has enjoyed success in learning representations from large amounts of unlabeled data, and the learned representations are rich for many downstream tasks. However, the connection between low self-supervised loss and strong performance in downstream tasks remains unclear. In this work, we propose Vector-Quantized Autoregressive Predictive Coding (VQ-APC), a novel model that produces quantized representations, allowing us to explicitly control the amount of information encoded in the representations. By studying a sequence of increasingly limited models, we reveal the constituents of the learned representations. In particular, we confirm the presence of information with probing tasks, while showing the absence of information with mutual information, uncovering the model's preference in preserving speech information as its capacity becomes constrained. We find that there exists a point where phonetic and speaker information are amplified to maximize a self-supervised objective. As a byproduct, the learned codes for a particular model capacity correspond well to English phones.

preprint2020arXiv

When Dictionary Learning Meets Deep Learning: Deep Dictionary Learning and Coding Network for Image Recognition with Limited Data

We present a new Deep Dictionary Learning and Coding Network (DDLCN) for image recognition tasks with limited data. The proposed DDLCN has most of the standard deep learning layers (e.g., input/output, pooling, fully connected, etc.), but the fundamental convolutional layers are replaced by our proposed compound dictionary learning and coding layers. The dictionary learning learns an over-complete dictionary for input training data. At the deep coding layer, a locality constraint is added to guarantee that the activated dictionary bases are close to each other. Then the activated dictionary atoms are assembled and passed to the compound dictionary learning and coding layers. In this way, the activated atoms in the first layer can be represented by the deeper atoms in the second dictionary. Intuitively, the second dictionary is designed to learn the fine-grained components shared among the input dictionary atoms, thus a more informative and discriminative low-level representation of the dictionary atoms can be obtained. We empirically compare DDLCN with several leading dictionary learning methods and deep learning models. Experimental results on five popular datasets show that DDLCN achieves competitive results compared with state-of-the-art methods when the training data is limited. Code is available at https://github.com/Ha0Tang/DDLCN.

preprint2020arXiv

XingGAN for Person Image Generation

We propose a novel Generative Adversarial Network (XingGAN or CrossingGAN) for person image generation tasks, i.e., translating the pose of a given person to a desired one. The proposed Xing generator consists of two generation branches that model the person's appearance and shape information, respectively. Moreover, we propose two novel blocks to effectively transfer and update the person's shape and appearance embeddings in a crossing way to mutually improve each other, which has not been considered by any other existing GAN-based image generation work. Extensive experiments on two challenging datasets, i.e., Market-1501 and DeepFashion, demonstrate that the proposed XingGAN advances the state-of-the-art performance both in terms of objective quantitative scores and subjective visual realness. The source code and trained models are available at https://github.com/Ha0Tang/XingGAN.

preprint2016arXiv

Discriminative Segmental Cascades for Feature-Rich Phone Recognition

Discriminative segmental models, such as segmental conditional random fields (SCRFs) and segmental structured support vector machines (SSVMs), have had success in speech recognition via both lattice rescoring and first-pass decoding. However, such models suffer from slow decoding, hampering the use of computationally expensive features, such as segment neural networks or other high-order features. A typical solution is to use approximate decoding, either by beam pruning in a single pass or by beam pruning to generate a lattice followed by a second pass. In this work, we study discriminative segmental models trained with a hinge loss (i.e., segmental structured SVMs). We show that beam search is not suitable for learning rescoring models in this approach, though it gives good approximate decoding performance when the model is already well-trained. Instead, we consider an approach inspired by structured prediction cascades, which use max-marginal pruning to generate lattices. We obtain a high-accuracy phonetic recognition system with several expensive feature types: a segment neural network, a second-order language model, and second-order phone boundary features.

preprint2016arXiv

Efficient Segmental Cascades for Speech Recognition

Discriminative segmental models offer a way to incorporate flexible feature functions into speech recognition. However, their appeal has been limited by their computational requirements, due to the large number of possible segments to consider. Multi-pass cascades of segmental models introduce features of increasing complexity in different passes, where in each pass a segmental model rescores lattices produced by a previous (simpler) segmental model. In this paper, we explore several ways of making segmental cascades efficient and practical: reducing the feature set in the first pass, frame subsampling, and various pruning approaches. In experiments on phonetic recognition, we find that with a combination of such techniques, it is possible to maintain competitive performance while greatly reducing decoding, pruning, and training time.

preprint2016arXiv

End-to-End Training Approaches for Discriminative Segmental Models

Recent work on discriminative segmental models has shown that they can achieve competitive speech recognition performance, using features based on deep neural frame classifiers. However, segmental models can be more challenging to train than standard frame-based approaches. While some segmental models have been successfully trained end to end, there is a lack of understanding of their training under different settings and with different losses. We investigate a model class based on recent successful approaches, consisting of a linear model that combines segmental features based on an LSTM frame classifier. Similarly to hybrid HMM-neural network models, segmental models of this class can be trained in two stages (frame classifier training followed by linear segmental model weight training), end to end (joint training of both frame classifier and linear weights), or with end-to-end fine-tuning after two-stage training. We study segmental models trained end to end with hinge loss, log loss, latent hinge loss, and marginal log loss. We consider several losses for the case where training alignments are available as well as where they are not. We find that in general, marginal log loss provides the most consistent strong performance without requiring ground-truth alignments. We also find that training with dropout is very important in obtaining good performance with end-to-end training. Finally, the best results are typically obtained by a combination of two-stage training and fine-tuning.

preprint2016arXiv

Lexicon-Free Fingerspelling Recognition from Video: Data, Models, and Signer Adaptation

We study the problem of recognizing video sequences of fingerspelled letters in American Sign Language (ASL). Fingerspelling comprises a significant but relatively understudied part of ASL. Recognizing fingerspelling is challenging for a number of reasons: It involves quick, small motions that are often highly coarticulated; it exhibits significant variation between signers; and there has been a dearth of continuous fingerspelling data collected. In this work we collect and annotate a new data set of continuous fingerspelling videos, compare several types of recognizers, and explore the problem of signer variation. Our best-performing models are segmental (semi-Markov) conditional random fields using deep neural network-based features. In the signer-dependent setting, our recognizers achieve up to about 92% letter accuracy. The multi-signer setting is much more challenging, but with neural network adaptation we achieve up to 83% letter accuracies in this setting.

preprint2016arXiv

Non-classical photon correlation in a two-dimensional photonic lattice

Quantum interference and quantum correlation, as two main features of quantum optics, play an essential role in quantum information applications, such as multi-particle quantum walk and boson sampling. While many experimental demonstrations have been done in one-dimensional waveguide arrays, it remains unexplored in higher dimensions due to tight requirement of manipulating and detecting photons in large-scale. Here, we experimentally observe non-classical correlation of two identical photons in a fully coupled two-dimensional structure, i.e. photonic lattice manufactured by three-dimensional femtosecond laser writing. Photon interference consists of 36 Hong-Ou-Mandel interference and 9 bunching. The overlap between measured and simulated distribution is up to $0.890\pm0.001$. Clear photon correlation is observed in the two-dimensional photonic lattice. Combining with controllably engineered disorder, our results open new perspectives towards large-scale implementation of quantum simulation on integrated photonic chips.

preprint2016arXiv

Signer-independent Fingerspelling Recognition with Deep Neural Network Adaptation

We study the problem of recognition of fingerspelled letter sequences in American Sign Language in a signer-independent setting. Fingerspelled sequences are both challenging and important to recognize, as they are used for many content words such as proper nouns and technical terms. Previous work has shown that it is possible to achieve almost 90% accuracies on fingerspelling recognition in a signer-dependent setting. However, the more realistic signer-independent setting presents challenges due to significant variations among signers, coupled with the dearth of available training data. We investigate this problem with approaches inspired by automatic speech recognition. We start with the best-performing approaches from prior work, based on tandem models and segmental conditional random fields (SCRFs), with features based on deep neural network (DNN) classifiers of letters and phonological features. Using DNN adaptation, we find that it is possible to bridge a large part of the gap between signer-dependent and signer-independent performance. Using only about 115 transcribed words for adaptation from the target signer, we obtain letter accuracies of up to 82.7% with frame-level adaptation labels and 69.7% with only word labels.

preprint2011arXiv

Two-dimensional Transport Induced Linear Magneto-Resistance in Topological Insulator Bi$_2$Se$_3$ Nanoribbons

We report the study of a novel linear magneto-resistance (MR) under perpendicular magnetic fields in Bi2Se3 nanoribbons. Through angular dependence magneto-transport experiments, we show that this linear MR is purely due to two-dimensional (2D) transport, in agreement with the recently discovered linear MR from 2D topological surface state in bulk Bi2Te3, and the linear MR of other gapless semiconductors and graphene. We further show that the linear MR of Bi2Se3 nanoribbons persists to room temperature, underscoring the potential of exploiting topological insulator nanomaterials for room temperature magneto-electronic applications.

preprint2010arXiv

Magneto-transport Effects in Topological Insulator Bi$_2$Se$_3$ Nanoribbons

Magneto-resistance (MR) of Bi$_2$Se$_3$ nanoribbons is studied over a broad range of temperature ($T$=300K-2K) and under various magnetic field ($B$) orientations. The MR is strongly anisotropic with the perpendicular MR much larger than the longitudinal and transverse MRs. The perpendicular MR exhibits quadratic $B$-dependence in low fields and becomes linear at high $B$. However, when $T$ increases, the perpendicular MR becomes linear over the whole magnetic field range (0-9T) up to room temperature. This unusual linear MR is discussed in the context of the linear quantum MR of the topological surface-states. We also observe the boundary-scattering effect in MR at low temperatures, which indicates that the out-of-plane Fermi momentum is much smaller the in-plane Fermi momentum, excluding the simple three-dimensional Fermi surface picture.

Hao Tang

What is connected

Connect this record

See the researcher in context

Building this map preview

62 published item(s)

A framework for analyzing concept representations in neural models

Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment

Controllable Person Image Synthesis with Spatially-Adaptive Warped Normalization

3D-Aware Semantic-Guided Generative Model for Human Synthesis

AO2-DETR: Arbitrary-Oriented Object Detection Transformer

Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

CharFormer: A Glyph Fusion based Attentive Framework for High-precision Character Image Denoising

Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution

Continual Attentive Fusion for Incremental Learning in Semantic Segmentation

Contrastive Learning from Spatio-Temporal Mixed Skeleton Sequences for Self-Supervised Skeleton-Based Action Recognition

DE-Net: Dynamic Text-guided Image Editing Adversarial Networks

Deep Algebraic Fitting for Multiple Circle Primitives Extraction from Raw Point Clouds

Disentangle Saliency Detection into Cascaded Detail Modeling and Body Filling

Distribution-Path Dependent Nonlinear SPDEs with Application to Stochastic Transport Type Equations

First-principles Calculation of the Temperature-dependent Transition Energies in Spin Defects

Graph-based Generative Face Anonymisation with Pose Preservation

Hierarchical Sketch Induction for Paraphrase Generation

Identity-Sensitive Knowledge Propagation for Cloth-Changing Person Re-identification

Local and Global GANs with Semantic-Aware Upsampling for Image Generation

Looking Outside the Window: Wide-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images

MHFormer: Multi-Hypothesis Transformer for 3D Human Pose Estimation

Mining Relations among Cross-Frame Affinities for Video Semantic Segmentation

Predict, Prevent, and Evaluate: Disentangled Text-Driven Image Manipulation Empowered by Pre-Trained Vision-Language Model

Quantum Computation for Pricing Caps using the LIBOR Market Model

Quantum Deep Learning for Mutant COVID-19 Strain Prediction

RCRN: Real-world Character Image Restoration Network via Skeleton Extraction

Real-Time Portrait Stylization on the Edge

Supervised Attention in Sequence-to-Sequence Models for Speech Recognition

Topology-Preserving Shape Reconstruction and Registration via Neural Diffeomorphic Flow

Towards Interpretable Video Super-Resolution via Alternating Optimization

Unsupervised High-Resolution Portrait Gaze Correction and Animation

A Hybrid Quantum Memory Enabled Network at Room Temperature

A Study of Bug Resolution Characteristics in Popular Programming Languages

AttentionAnatomy: A unified framework for whole-body organs at risk segmentation using multiple partially annotated datasets

Audio-Visual Calibration with Polynomial Regression for 2-D Projection Using SVD-PHAT

Audio-Visual Event Localization via Recursive Fusion by Joint Co-Attention

Belief Propagation Neural Networks

Bipartite Graph Reasoning GANs for Person Image Generation

Cross-View Image Synthesis with Deformable Convolution and Attention Mechanism

Cycle In Cycle Generative Adversarial Networks for Keypoint-Guided Image Generation

Direct Observation of Quantum Percolation Dynamics

Dual Attention GANs for Semantic Image Synthesis

Dual In-painting Model for Unsupervised Gaze Correction and Animation in the Wild

Exocentric to Egocentric Image Generation via Parallel Generative Adversarial Network

Local Class-Specific and Global Image-Level Generative Adversarial Networks for Semantic-Guided Scene Generation

On a stochastic Camassa-Holm type equation with higher order nonlinearities

Quantum Go Machine

Relevant Region Prediction for Crowd Counting

TensorFlow Solver for Quantum PageRank in Large-Scale Networks

The Habitable Exoplanet Observatory (HabEx) Mission Concept Study Final Report

Vector Vortex Beam Emitter Embedded in a Photonic Chip

Vector-Quantized Autoregressive Predictive Coding

When Dictionary Learning Meets Deep Learning: Deep Dictionary Learning and Coding Network for Image Recognition with Limited Data

XingGAN for Person Image Generation

Discriminative Segmental Cascades for Feature-Rich Phone Recognition

Efficient Segmental Cascades for Speech Recognition

End-to-End Training Approaches for Discriminative Segmental Models

Lexicon-Free Fingerspelling Recognition from Video: Data, Models, and Signer Adaptation

Non-classical photon correlation in a two-dimensional photonic lattice

Signer-independent Fingerspelling Recognition with Deep Neural Network Adaptation

Two-dimensional Transport Induced Linear Magneto-Resistance in Topological Insulator Bi$_2$Se$_3$ Nanoribbons

Magneto-transport Effects in Topological Insulator Bi$_2$Se$_3$ Nanoribbons