Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
51works
0followers
27topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

51 published item(s)

preprint2026arXiv

A Proof-of-Concept Study of Multitask Learning for Cranial Synthetic CT Generation Across Heterogeneous MRI Field Strengths

Accurate synthesis of computed tomography (CT) images from magnetic resonance imaging (MRI) is clinically valuable for cranial applications such as attenuation correction, radiotherapy planning, and image-guided interventions. However, heterogeneity across MRI field strengths and acquisition protocols limits the generalizability of existing methods. In this study, we formulate cranial CT synthesis as a modular, structurally coupled problem and propose a deep learning framework to improve robustness across heterogeneous MRI conditions. The model is designed to adapt to variations in field strength and imaging protocols while preserving anatomical consistency. Experiments on multi-site datasets demonstrate improved performance and generalization compared with conventional approaches. The proposed method enables reliable CT synthesis across heterogeneous MRI settings, supporting broader clinical translation.

preprint2025arXiv

Ultrahigh-Energy Gamma-ray Emission Associated with Black Hole-Jet Systems

Black holes (BH), one of the most intriguing objects in the universe, can manifest themselves through electromagnetic radiation initiated by the accretion flow. Some stellar-mass BHs drive relativistic jets when accreting matter from their companion stars, forming microquasars. Non-thermal emission from the radio to tera-electronvolt (TeV) gamma-ray band has been observed from microquasars, indicating the acceleration of relativistic particles. Here we report detection of four microquasars (SS 433, V4641 Sgr, GRS 1915+105, MAXI J1820+070) of spectrum extending to the ultrahigh-energy (UHE; photon energy $E>100$ TeV) band and one microquasar (Cygnus X-1) of spectrum approaching 100 TeV, using the Large High Altitude Air Shower Observatory (LHAASO). Notably, the total emission associated with SS 433 cannot be interpreted with a single leptonic component. In the UHE band, its emission is in spatial coincidence with a giant atomic cloud, which is consistent with a hadronic origin. An elongated source is discovered from V4641 Sgr with the spectrum continuing up to 800 TeV. The detection of UHE gamma rays demonstrates that accreting BHs and their environments can operate as extremely efficient accelerators of particles out of 1 peta-electronvolt (PeV), suggesting microquasars to be important contributors to Galactic cosmic rays especially around the `knee' region.

preprint2024arXiv

Rotating black hole mimicker surrounded by the string cloud

Traversable wormholes and regular black holes usually represent completely different scenarios. But in the black bounce spacetime they can be described by a same line element, which is very attractive. Furthermore, the black hole photos taken by EHT show that black holes have spin, so spin is an indispensable intrinsic property of black holes in the actual universe. In this work, we derive a rotating black hole mimicker surrounded by the string cloud (SC), which can be interpolated to represent regular black hole spacetime and traversable wormhole spacetime. We investigate the effect of the spin $a$ and SC parameter $L$ on the observables (shadow radius $R_s$ and distortion $δ_s$) and energy emission rate of the black hole mimicker surrounded by the SC. We find that shadow for this spacetime is very sensitive to the $L$, i.e., the SC parameter can significantly increase the boundary of the shadow.

preprint2022arXiv

A nonlinear weighted anisotropic total variation regularization for electrical impedance tomography

This paper proposes a nonlinear weighted anisotropic total variation (NWATV) regularization technique for electrical impedance tomography (EIT). The key idea is to incorporate the internal inhomogeneity information (e.g., edges of the detected objects) into the EIT reconstruction process, aiming to preserve the conductivity profiles (to be detected). We study the NWATV image reconstruction by employing a novel soft thresholding based reformulation included in the alternating direction method of multipliers (ADMM). To evaluate the proposed approach, 2D and 3D numerical experiments and human EIT lung imaging are carried out. It is demonstrated that the properties of the internal inhomogeneity are well preserved and improved with the proposed regularization approach, in comparison to traditional total variation (TV) and recently proposed fidelity embedded regularization approaches. Owing to the simplicity of the proposed method, the computational cost is significantly decreased compared with the well established primal-dual algorithm. Meanwhile, it was found that the proposed regularization method is quite robust to the measurement noise, which is one of the main uncertainties in EIT.

preprint2022arXiv

Attribute Artifacts Removal for Geometry-based Point Cloud Compression

Geometry-based point cloud compression (G-PCC) can achieve remarkable compression efficiency for point clouds. However, it still leads to serious attribute compression artifacts, especially under low bitrate scenarios. In this paper, we propose a Multi-Scale Graph Attention Network (MS-GAT) to remove the artifacts of point cloud attributes compressed by G-PCC. We first construct a graph based on point cloud geometry coordinates and then use the Chebyshev graph convolutions to extract features of point cloud attributes. Considering that one point may be correlated with points both near and far away from it, we propose a multi-scale scheme to capture the short- and long-range correlations between the current point and its neighboring and distant points. To address the problem that various points may have different degrees of artifacts caused by adaptive quantization, we introduce the quantization step per point as an extra input to the proposed network. We also incorporate a weighted graph attentional layer into the network to pay special attention to the points with more attribute artifacts. To the best of our knowledge, this is the first attribute artifacts removal method for G-PCC. We validate the effectiveness of our method over various point clouds. Objective comparison results show that our proposed method achieves an average of 9.74% BD-rate reduction compared with Predlift and 10.13% BD-rate reduction compared with RAHT. Subjective comparison results present that visual artifacts such as color shifting, blurring, and quantization noise are reduced.

preprint2022arXiv

CERL: A Unified Optimization Framework for Light Enhancement with Realistic Noise

Low-light images captured in the real world are inevitably corrupted by sensor noise. Such noise is spatially variant and highly dependent on the underlying pixel intensity, deviating from the oversimplified assumptions in conventional denoising. Existing light enhancement methods either overlook the important impact of real-world noise during enhancement, or treat noise removal as a separate pre- or post-processing step. We present \underline{C}oordinated \underline{E}nhancement for \underline{R}eal-world \underline{L}ow-light Noisy Images (CERL), that seamlessly integrates light enhancement and noise suppression parts into a unified and physics-grounded optimization framework. For the real low-light noise removal part, we customize a self-supervised denoising model that can easily be adapted without referring to clean ground-truth images. For the light enhancement part, we also improve the design of a state-of-the-art backbone. The two parts are then joint formulated into one principled plug-and-play optimization. Our approach is compared against state-of-the-art low-light enhancement methods both qualitatively and quantitatively. Besides standard benchmarks, we further collect and test on a new realistic low-light mobile photography dataset (RLMP), whose mobile-captured photos display heavier realistic noise than those taken by high-quality cameras. CERL consistently produces the most visually pleasing and artifact-free results across all experiments. Our RLMP dataset and codes are available at: https://github.com/VITA-Group/CERL.

preprint2022arXiv

Design, Uncertainty Analysis and Measurement of a Silicon-based Platelet THz Corrugated Horn

Platelets corrugated horn is a promising technology for their scalability to a large corrugated horn array. In this paper, we present the design, fabrication, measurement and uncertainty analysis of a wideband 170-320 GHz platelet corrugated horn that features with low sidelobe across the band (<-30 dB). We also propose an accurate and universal method to analyze the axial misalignment of the platelets for the first time. It is based on the mode matching (MM) method with a closed-form solution to off-axis circular waveguide discontinuities obtained by using Graf addition theorem for the Bessel functions. The uncertainties introduced in the fabrication have been quantitatively analyzed using the Monte Carlo method. The analysis shows the cross-polarization of the corrugated horn degrades significantly with the axial misalignment. It well explains the discrepancy between the designed and the measured cross-polarization of platelets corrugated horn fabricated in THz band. The method can be used to determine the fabrication tolerance needed for other THz corrugated horns and evaluate the impact of the corrugated horn for astronomical observations.

preprint2022arXiv

Flow-Guided Transformer for Video Inpainting

We propose a flow-guided transformer, which innovatively leverage the motion discrepancy exposed by optical flows to instruct the attention retrieval in transformer for high fidelity video inpainting. More specially, we design a novel flow completion network to complete the corrupted flows by exploiting the relevant flow features in a local temporal window. With the completed flows, we propagate the content across video frames, and adopt the flow-guided transformer to synthesize the rest corrupted regions. We decouple transformers along temporal and spatial dimension, so that we can easily integrate the locally relevant completed flows to instruct spatial attention only. Furthermore, we design a flow-reweight module to precisely control the impact of completed flows on each spatial transformer. For the sake of efficiency, we introduce window partition strategy to both spatial and temporal transformers. Especially in spatial transformer, we design a dual perspective spatial MHSA, which integrates the global tokens to the window-based attention. Extensive experiments demonstrate the effectiveness of the proposed method qualitatively and quantitatively. Codes are available at https://github.com/hitachinsk/FGT.

preprint2022arXiv

Local discontinuous Galerkin method for the Backward Feynman-Kac Equation

Anomalous diffusions are ubiquitous in nature, whose functional distributions are governed by the backward Feynman-Kac equation. In this paper, the local discontinuous Galerkin (LDG) method is used to solve the 2D backward Feynman-Kac equation in a rectangular domain. The spatial semi-discrete LDG scheme of the equivalent form (obtained by Laplace transform) of the original equation is established. After discussing the properties of the fractional substantial calculus, the stability and optimal convergence rates $O(h^{k+1})$ of the semi-discrete scheme are proved by choosing an appropriate generalized numerical flux. The $L1$ scheme on the graded meshes is used to deal with the weak singularity of the solution near the initial time. Based on the theoretical results of a semi-discrete scheme, we investigate the stability and convergence of the fully discrete scheme, which shows the optimal convergence rates $O(h^{k+1}+τ^{\min\{2-α,γδ\}})$. Numerical experiments are carried out to show the efficiency and accuracy of the proposed scheme. In addition, we also verify the effect of the central numerical flux on the convergence rates and the condition number of the coefficient matrix.

preprint2022arXiv

Motion-Focused Contrastive Learning of Video Representations

Motion, as the most distinct phenomenon in a video to involve the changes over time, has been unique and critical to the development of video representation learning. In this paper, we ask the question: how important is the motion particularly for self-supervised video representation learning. To this end, we compose a duet of exploiting the motion for data augmentation and feature learning in the regime of contrastive learning. Specifically, we present a Motion-focused Contrastive Learning (MCL) method that regards such duet as the foundation. On one hand, MCL capitalizes on optical flow of each frame in a video to temporally and spatially sample the tubelets (i.e., sequences of associated frame patches across time) as data augmentations. On the other hand, MCL further aligns gradient maps of the convolutional layers to optical flow maps from spatial, temporal and spatio-temporal perspectives, in order to ground motion information in feature learning. Extensive experiments conducted on R(2+1)D backbone demonstrate the effectiveness of our MCL. On UCF101, the linear classifier trained on the representations learnt by MCL achieves 81.91% top-1 accuracy, outperforming ImageNet supervised pre-training by 6.78%. On Kinetics-400, MCL achieves 66.62% top-1 accuracy under the linear protocol. Code is available at https://github.com/YihengZhang-CV/MCL-Motion-Focused-Contrastive-Learning.

preprint2022arXiv

Multiple-Objective Packet Routing Optimization for Aeronautical ad-hoc Networks

Providing Internet service above the clouds is of ever-increasing interest and in this context aeronautical {\it{ad-hoc}} networking (AANET) constitutes a promising solution. However, the optimization of packet routing in large ad hoc networks is quite challenging. In this paper, we develop a discrete $ε$ multi-objective genetic algorithm ($ε$-DMOGA) for jointly optimizing the end-to-end latency, the end-to-end spectral efficiency (SE), and the path expiration time (PET) that specifies how long the routing path can be relied on without re-optimizing the path. More specifically, a distance-based adaptive coding and modulation (ACM) scheme specifically designed for aeronautical communications is exploited for quantifying each link&#39;s achievable SE. Furthermore, the queueing delay at each node is also incorporated into the multiple-objective optimization metric. Our $ε$-DMOGA assisted multiple-objective routing optimization is validated by real historical flight data collected over the Australian airspace on two selected representative dates.

preprint2022arXiv

Neural Compression-Based Feature Learning for Video Restoration

How to efficiently utilize the temporal features is crucial, yet challenging, for video restoration. The temporal features usually contain various noisy and uncorrelated information, and they may interfere with the restoration of the current frame. This paper proposes learning noise-robust feature representations to help video restoration. We are inspired by that the neural codec is a natural denoiser. In neural codec, the noisy and uncorrelated contents which are hard to predict but cost lots of bits are more inclined to be discarded for bitrate saving. Therefore, we design a neural compression module to filter the noise and keep the most useful information in features for video restoration. To achieve robustness to noise, our compression module adopts a spatial channel-wise quantization mechanism to adaptively determine the quantization step size for each position in the latent. Experiments show that our method can significantly boost the performance on video denoising, where we obtain 0.13 dB improvement over BasicVSR++ with only 0.23x FLOPs. Meanwhile, our method also obtains SOTA results on video deraining and dehazing.

preprint2022arXiv

Recurrent Dynamic Embedding for Video Object Segmentation

Space-time memory (STM) based video object segmentation (VOS) networks usually keep increasing memory bank every several frames, which shows excellent performance. However, 1) the hardware cannot withstand the ever-increasing memory requirements as the video length increases. 2) Storing lots of information inevitably introduces lots of noise, which is not conducive to reading the most important information from the memory bank. In this paper, we propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size. Specifically, we explicitly generate and update RDE by the proposed Spatio-temporal Aggregation Module (SAM), which exploits the cue of historical information. To avoid error accumulation owing to the recurrent usage of SAM, we propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos. Moreover, the predicted masks in the memory bank are inaccurate due to the inaccurate network inference, which affects the segmentation of the query frame. To address this problem, we design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank. Extensive experiments show our method achieves the best tradeoff between performance and speed. Code is available at https://github.com/Limingxing00/RDE-VOS-CVPR2022.

preprint2022arXiv

Retinal Vessel Segmentation with Pixel-wise Adaptive Filters

Accurate retinal vessel segmentation is challenging because of the complex texture of retinal vessels and low imaging contrast. Previous methods generally refine segmentation results by cascading multiple deep networks, which are time-consuming and inefficient. In this paper, we propose two novel methods to address these challenges. First, we devise a light-weight module, named multi-scale residual similarity gathering (MRSG), to generate pixel-wise adaptive filters (PA-Filters). Different from cascading multiple deep networks, only one PA-Filter layer can improve the segmentation results. Second, we introduce a response cue erasing (RCE) strategy to enhance the segmentation accuracy. Experimental results on the DRIVE, CHASE_DB1, and STARE datasets demonstrate that our proposed method outperforms state-of-the-art methods while maintaining a compact structure. Code is available at https://github.com/Limingxing00/Retinal-Vessel-Segmentation-ISBI20222.

preprint2022arXiv

The ringing of quantum corrected Schwarzschild black hole with GUP

Schwarzschild black holes with quantum corrections are studied under scalar field perturbations and electromagnetic field perturbations to analyze the effect of the correction term on the potential function and quasinormal mode (QNM). In classical general relativity, spacetime is continuous and there is no existence of the so-called minimal length. The introduction of the correction items of the generalized uncertainty principle (GUP), the parameter $β$, can change the singularity structure of the black hole gauge and may lead to discretization in time and space. We apply the sixth-order WKB method to approximate the QNM of Schwarzschild black holes with quantum corrections and perform numerical analysis to derive the results of the method. Also, we find that the effective potential and QNM in scalar fields are larger than those in electromagnetic fields.

preprint2022arXiv

Towards Hybrid-Optimization Video Coding

Video coding is a mathematical optimization problem of rate and distortion essentially. To solve this complex optimization problem, two popular video coding frameworks have been developed: block-based hybrid video coding and end-to-end learned video coding. If we rethink video coding from the perspective of optimization, we find that the existing two frameworks represent two directions of optimization solutions. Block-based hybrid coding represents the discrete optimization solution because those irrelevant coding modes are discrete in mathematics. It searches for the best one among multiple starting points (i.e. modes). However, the search is not efficient enough. On the other hand, end-to-end learned coding represents the continuous optimization solution because the gradient descent is based on a continuous function. It optimizes a group of model parameters efficiently by the numerical algorithm. However, limited by only one starting point, it is easy to fall into the local optimum. To better solve the optimization problem, we propose to regard video coding as a hybrid of the discrete and continuous optimization problem, and use both search and numerical algorithm to solve it. Our idea is to provide multiple discrete starting points in the global space and optimize the local optimum around each point by numerical algorithm efficiently. Finally, we search for the global optimum among those local optimums. Guided by the hybrid optimization idea, we design a hybrid optimization video coding framework, which is built on continuous deep networks entirely and also contains some discrete modes. We conduct a comprehensive set of experiments. Compared to the continuous optimization framework, our method outperforms pure learned video coding methods. Meanwhile, compared to the discrete optimization framework, our method achieves comparable performance to HEVC reference software HM16.10 in PSNR.

preprint2021arXiv

Engineered Raman Lasing in Photonic Integrated Chalcogenide Microresonators

Chalcogenide glass (ChG) is an attractive material for integrated nonlinear photonics due to its wide transparency and high nonlinearity, and its capability of being directly deposited and patterned on Silicon wafer substrates. It has a singular Raman effect among amorphous materials. Yet, the Raman lasing performance in high quality and chip integrated ChG microresonators remains unexplored. Here, we demonstrate an engineered Raman lasing dynamic based on home developed photonic integrated high-Q ChG microresonators. With a quality factor above 10^6, we achieve the record-low lasing threshold 3.25 mW among integrated planar photonic platforms. Both the single-mode Raman lasers and a broadband Raman-Kerr comb are observed and characterized, which is dependent on the dispersion of our flexible photonic platform and engineered via tuning the waveguide geometric size. The tunability of such a chipscale Raman laser is also demonstrated through tuning the pump wavelength and tuning the operating temperature on the chip. This allows for the access of single-mode lasing at arbitrary wavelengths in the range 1615-1755 nm. Our results may contribute to the understanding of rich Raman and Kerr nonlinear interactions in dissipative and nonlinear microresonators, and on application aspect, may pave a way to chip-scale efficient Raman lasers that is highly desired in spectroscopic applications in the infrared.

preprint2021arXiv

Marangoni Convection-Driven Laser Fountains and Waves on Free Surfaces of Liquids

It is well accepted that an outward Marangoni convection from a low surface tension region will make the surface depressed. Here, we report that this established perception is only valid for thin liquid films. Using surface laser heating, we show that in deep liquids a laser beam actually pulls up the fluid above the free surface generating fountains with different shapes. Whereas with decreasing liquid depth a transition from fountain to indentation with fountain in-indentation is observed. Further, high-speed imaging reveals a transient surface process before steady elevation is formed, and this dynamic deformation is subsequently utilized to resonantly excite giant surface waves by a modulated laser beam. Computational fluid dynamics models reveal the underlying flow patterns and quantify the depth-dependent and time-resolved surface deformations. Our discoveries and techniques have upended the century-old perception and opened up a new regime of interdisciplinary research and applications of Marangoni-induced interface phenomena and optocapillary fluidic surfaces-the control of fluids with light.

preprint2021arXiv

Robust Classification using Hidden Markov Models and Mixtures of Normalizing Flows

We test the robustness of a maximum-likelihood (ML) based classifier where sequential data as observation is corrupted by noise. The hypothesis is that a generative model, that combines the state transitions of a hidden Markov model (HMM) and the neural network based probability distributions for the hidden states of the HMM, can provide a robust classification performance. The combined model is called normalizing-flow mixture model based HMM (NMM-HMM). It can be trained using a combination of expectation-maximization (EM) and backpropagation. We verify the improved robustness of NMM-HMM classifiers in an application to speech recognition.

preprint2021arXiv

Soft magnetic microrobot doped with porous silica for stability-enhanced multimodal locomotion in nonideal environment

As an emerging field of robotics, magnetic-field-controlled soft microrobot has broad application prospects for its flexibility, locomotion diversity as well as remote controllability. Magnetic soft microrobots can perform multimodal locomotion under the control of a magnetic field, which may have potential applications in precision medicine. However, previous researches mainly focus on new locomotion in a relatively ideal environment, lacking exploration on the ability of magnetic microrobot locomotion to resist external disturbances and proceed in a nonideal environment. Here, a porous silica-doped soft magnetic microrobot is constructed for enhanced stability of multimodal locomotion in the nonideal biological environment. Porous silica spheres are doped into NdFeB-silicone elastomer base, improving adhesion properties as well as refining the comprehensive mechanical properties of the microrobot. Multimodal locomotions are achieved, and the influence of porous silica doping on the stability of each locomotion in nonideal environment is explored in depth. Motions in nonideal circumstances such as climbing, loading, current rushing, wind blowing, and obstacle hindering are conducted successfully with porous silica doping. Such a stability-enhanced multimodal locomotion system can be used in biocatalysis as well as thrombus removal, and its prospect for precision medicine is highlighted by in vivo demonstration of multimodal locomotion with nonideal disturbance.

preprint2021arXiv

Structural engineering from an inverse problems perspective

The field of structural engineering is vast, spanning areas from the design of new infrastructure to the assessment of existing infrastructure. From the onset, traditional entry-level university courses teach students to analyse structural response given data including external forces, geometry, member sizes, restraint, etc. -- characterising a forward problem (structural causalities $\to$ structural response). Shortly thereafter, junior engineers are introduced to structural design where they aim to, for example, select an appropriate structural form for members based on design criteria, which is the inverse of what they previously learned. Similar inverse realisations also hold true in structural health monitoring and a number of structural engineering sub-fields (response $\to$ structural causalities). In this light, we aim to demonstrate that many structural engineering sub-fields may be fundamentally or partially viewed as inverse problems and thus benefit via the rich and established methodologies from the inverse problems community. To this end, we conclude that the future of inverse problems in structural engineering is inexorably linked to engineering education and machine learning developments.

preprint2021arXiv

Synergy Between Semantic Segmentation and Image Denoising via Alternate Boosting

The capability of image semantic segmentation may be deteriorated due to noisy input image, where image denoising prior to segmentation helps. Both image denoising and semantic segmentation have been developed significantly with the advance of deep learning. Thus, we are interested in the synergy between them by using a holistic deep model. We observe that not only denoising helps combat the drop of segmentation accuracy due to noise, but also pixel-wise semantic information boosts the capability of denoising. We then propose a boosting network to perform denoising and segmentation alternately. The proposed network is composed of multiple segmentation and denoising blocks (SDBs), each of which estimates semantic map then uses the map to regularize denoising. Experimental results show that the denoised image quality is improved substantially and the segmentation accuracy is improved to close to that of clean images. Our code and models will be made publicly available.

preprint2020arXiv

$α$ Belief Propagation for Approximate Inference

Belief propagation (BP) algorithm is a widely used message-passing method for inference in graphical models. BP on loop-free graphs converges in linear time. But for graphs with loops, BP&#39;s performance is uncertain, and the understanding of its solution is limited. To gain a better understanding of BP in general graphs, we derive an interpretable belief propagation algorithm that is motivated by minimization of a localized $α$-divergence. We term this algorithm as $α$ belief propagation ($α$-BP). It turns out that $α$-BP generalizes standard BP. In addition, this work studies the convergence properties of $α$-BP. We prove and offer the convergence conditions for $α$-BP. Experimental simulations on random graphs validate our theoretical results. The application of $α$-BP to practical problems is also demonstrated.

preprint2020arXiv

A Game Theoretic Analysis of LQG Control under Adversarial Attack

Motivated by recent works addressing adversarial attacks on deep reinforcement learning, a deception attack on linear quadratic Gaussian control is studied in this paper. In the considered attack model, the adversary can manipulate the observation of the agent subject to a mutual information constraint. The adversarial problem is formulated as a novel dynamic cheap talk game to capture the strategic interaction between the adversary and the agent, the asymmetry of information availability, and the system dynamics. Necessary and sufficient conditions are provided for subgame perfect equilibria to exist in pure strategies and in behavioral strategies; and characteristics of the equilibria and the resulting control rewards are given. The results show that pure strategy equilibria are informative, while only babbling equilibria exist in behavioral strategies. Numerical results are shown to illustrate the impact of strategic adversarial interaction.

preprint2020arXiv

Bottom-Up Human Pose Estimation by Ranking Heatmap-Guided Adaptive Keypoint Estimates

The typical bottom-up human pose estimation framework includes two stages, keypoint detection and grouping. Most existing works focus on developing grouping algorithms, e.g., associative embedding, and pixel-wise keypoint regression that we adopt in our approach. We present several schemes that are rarely or unthoroughly studied before for improving keypoint detection and grouping (keypoint regression) performance. First, we exploit the keypoint heatmaps for pixel-wise keypoint regression instead of separating them for improving keypoint regression. Second, we adopt a pixel-wise spatial transformer network to learn adaptive representations for handling the scale and orientation variance to further improve keypoint regression quality. Last, we present a joint shape and heatvalue scoring scheme to promote the estimated poses that are more likely to be true poses. Together with the tradeoff heatmap estimation loss for balancing the background and keypoint pixels and thus improving heatmap estimation quality, we get the state-of-the-art bottom-up human pose estimation result. Code is available at https://github.com/HRNet/HRNet-Bottom-up-Pose-Estimation.

preprint2020arXiv

Deep High-Resolution Representation Learning for Visual Recognition

High-resolution representations are essential for position-sensitive vision problems, such as human pose estimation, semantic segmentation, and object detection. Existing state-of-the-art frameworks first encode the input image as a low-resolution representation through a subnetwork that is formed by connecting high-to-low resolution convolutions \emph{in series} (e.g., ResNet, VGGNet), and then recover the high-resolution representation from the encoded low-resolution representation. Instead, our proposed network, named as High-Resolution Network (HRNet), maintains high-resolution representations through the whole process. There are two key characteristics: (i) Connect the high-to-low resolution convolution streams \emph{in parallel}; (ii) Repeatedly exchange the information across resolutions. The benefit is that the resulting representation is semantically richer and spatially more precise. We show the superiority of the proposed HRNet in a wide range of applications, including human pose estimation, semantic segmentation, and object detection, suggesting that the HRNet is a stronger backbone for computer vision problems. All the codes are available at~{\url{https://github.com/HRNet}}.

preprint2020arXiv

Dual Temporal Memory Network for Efficient Video Object Segmentation

Video Object Segmentation (VOS) is typically formulated in a semi-supervised setting. Given the ground-truth segmentation mask on the first frame, the task of VOS is to track and segment the single or multiple objects of interests in the rest frames of the video at the pixel level. One of the fundamental challenges in VOS is how to make the most use of the temporal information to boost the performance. We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories to address the temporal modeling in VOS. Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network. The short-term memory sub-network models the fine-grained spatial-temporal interactions between local regions across neighboring frames in video via a graph-based learning framework, which can well preserve the visual consistency of local regions over time. The long-term memory sub-network models the long-range evolution of object via a Simplified-Gated Recurrent Unit (S-GRU), making the segmentation be robust against occlusions and drift errors. In our experiments, we show that our proposed method achieves a favorable and competitive performance on three frequently-used VOS datasets, including DAVIS 2016, DAVIS 2017 and Youtube-VOS in terms of both speed and accuracy.

preprint2020arXiv

Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation

The dominant speech separation models are based on complex recurrent or convolution neural network that model speech sequences indirectly conditioning on context, such as passing information through many intermediate states in recurrent neural network, leading to suboptimal separation performance. In this paper, we propose a dual-path transformer network (DPTNet) for end-to-end speech separation, which introduces direct context-awareness in the modeling for speech sequences. By introduces a improved transformer, elements in speech sequences can interact directly, which enables DPTNet can model for the speech sequences with direct context-awareness. The improved transformer in our approach learns the order information of the speech sequences without positional encodings by incorporating a recurrent neural network into the original transformer. In addition, the structure of dual paths makes our model efficient for extremely long speech sequence modeling. Extensive experiments on benchmark datasets show that our approach outperforms the current state-of-the-arts (20.6 dB SDR on the public WSj0-2mix data corpus).

preprint2020arXiv

Efficient Integer-Arithmetic-Only Convolutional Neural Networks

Integer-arithmetic-only networks have been demonstrated effective to reduce computational cost and to ensure cross-platform consistency. However, previous works usually report a decline in the inference accuracy when converting well-trained floating-point-number (FPN) networks into integer networks. We analyze this phonomenon and find that the decline is due to activation quantization. Specifically, when we replace conventional ReLU with Bounded ReLU, how to set the bound for each neuron is a key problem. Considering the tradeoff between activation quantization error and network learning ability, we set an empirical rule to tune the bound of each Bounded ReLU. We also design a mechanism to handle the cases of feature map addition and feature map concatenation. Based on the proposed method, our trained 8-bit integer ResNet outperforms the 8-bit networks of Google&#39;s TensorFlow and NVIDIA&#39;s TensorRT for image recognition. We also experiment on VDSR for image super-resolution and on VRCNN for compression artifact reduction, both of which serve for regression tasks that natively require high inference accuracy. Our integer networks achieve equivalent performance as the corresponding FPN networks, but have only 1/4 memory cost and run 2x faster on modern GPUs. Our code and models can be found at github.com/HengRuiZ/brelu.

preprint2020arXiv

Foreground-Background Imbalance Problem in Deep Object Detectors: A Review

Recent years have witnessed the remarkable developments made by deep learning techniques for object detection, a fundamentally challenging problem of computer vision. Nevertheless, there are still difficulties in training accurate deep object detectors, one of which is owing to the foreground-background imbalance problem. In this paper, we survey the recent advances about the solutions to the imbalance problem. First, we analyze the characteristics of the imbalance problem in different kinds of deep detectors, including one-stage and two-stage ones. Second, we divide the existing solutions into two categories: sampling heuristics and non-sampling schemes, and review them in detail. Third, we experimentally compare the performance of some state-of-the-art solutions on the COCO benchmark. Promising directions for future work are also discussed.

preprint2020arXiv

Graph Neural Networks for Massive MIMO Detection

In this paper, we innovately use graph neural networks (GNNs) to learn a message-passing solution for the inference task of massive multiple multiple-input multiple-output (MIMO) detection in wireless communication. We adopt a graphical model based on the Markov random field (MRF) where belief propagation (BP) yields poor results when it assumes a uniform prior over the transmitted symbols. Numerical simulations show that, under the uniform prior assumption, our GNN-based MIMO detection solution outperforms the minimum mean-squared error (MMSE) baseline detector, in contrast to BP. Furthermore, experiments demonstrate that the performance of the algorithm slightly improves by incorporating MMSE information into the prior.

preprint2020arXiv

Is There Tradeoff between Spatial and Temporal in Video Super-Resolution?

Recent advances of deep learning lead to great success of image and video super-resolution (SR) methods that are based on convolutional neural networks (CNN). For video SR, advanced algorithms have been proposed to exploit the temporal correlation between low-resolution (LR) video frames, and/or to super-resolve a frame with multiple LR frames. These methods pursue higher quality of super-resolved frames, where the quality is usually measured frame by frame in e.g. PSNR. However, frame-wise quality may not reveal the consistency between frames. If an algorithm is applied to each frame independently (which is the case of most previous methods), the algorithm may cause temporal inconsistency, which can be observed as flickering. It is a natural requirement to improve both frame-wise fidelity and between-frame consistency, which are termed spatial quality and temporal quality, respectively. Then we may ask, is a method optimized for spatial quality also optimized for temporal quality? Can we optimize the two quality metrics jointly?

preprint2020arXiv

Learning Trailer Moments in Full-Length Movies

A movie&#39;s key moments stand out of the screenplay to grab an audience&#39;s attention and make movie browsing efficient. But a lack of annotations makes the existing approaches not applicable to movie key moment detection. To get rid of human annotations, we leverage the officially-released trailers as the weak supervision to learn a model that can detect the key moments from full-length movies. We introduce a novel ranking network that utilizes the Co-Attention between movies and trailers as guidance to generate the training pairs, where the moments highly corrected with trailers are expected to be scored higher than the uncorrelated moments. Additionally, we propose a Contrastive Attention module to enhance the feature representations such that the comparative contrast between features of the key and non-key moments are maximized. We construct the first movie-trailer dataset, and the proposed Co-Attention assisted ranking network shows superior performance even over the supervised approach. The effectiveness of our Contrastive Attention module is also demonstrated by the performance improvement over the state-of-the-art on the public benchmarks.

preprint2020arXiv

Neural Network based Explicit Mixture Models and Expectation-maximization based Learning

We propose two neural network based mixture models in this article. The proposed mixture models are explicit in nature. The explicit models have analytical forms with the advantages of computing likelihood and efficiency of generating samples. Computation of likelihood is an important aspect of our models. Expectation-maximization based algorithms are developed for learning parameters of the proposed models. We provide sufficient conditions to realize the expectation-maximization based learning. The main requirements are invertibility of neural networks that are used as generators and Jacobian computation of functional form of the neural networks. The requirements are practically realized using a flow-based neural network. In our first mixture model, we use multiple flow-based neural networks as generators. Naturally the model is complex. A single latent variable is used as the common input to all the neural networks. The second mixture model uses a single flow-based neural network as a generator to reduce complexity. The single generator has a latent variable input that follows a Gaussian mixture distribution. We demonstrate efficiency of proposed mixture models through extensive experiments for generating samples and maximum likelihood based classification.

preprint2020arXiv

On Dominant Interference in Random Networks and Communication Reliability

In this paper, we study the characteristics of dominant interference power with directional reception in a random network modelled by a Poisson Point Process. Additionally, the Laplace functional of cumulative interference excluding the $n$ dominant interferers is also derived, which turns out to be a generalization of omni-directional reception and complete accumulative interference. As an application of these results, we study the impact of directional receivers in random networks in terms of outage probability and error probability with queue length constraint.

preprint2020arXiv

Optimizing electrode positions in 2D Electrical Impedance Tomography using deep learning

Electrical Impedance Tomography (EIT) is a powerful tool for non-destructive evaluation, state estimation, and process tomography - among numerous other use cases. For these applications, and in order to reliably reconstruct images of a given process using EIT, we must obtain high-quality voltage measurements from the target of interest. As such, it is obvious that the locations of electrodes used for measuring plays a key role in this task. Yet, to date, methods for optimally placing electrodes either require knowledge on the EIT target (which is, in practice, never fully known) or are computationally difficult to implement numerically. In this paper, we circumvent these challenges and present a straightforward deep learning based approach for optimizing electrodes positions. It is found that the optimized electrode positions outperformed &#34;standard&#34; uniformly-distributed electrode layouts in all test cases. Further, it is found that the use of optimized electrode positions computed using the approach derived herein can reduce errors in EIT reconstructions as well as improve the distinguishability of EIT measurements.

preprint2020arXiv

Optimizing Wireless Systems Using Unsupervised and Reinforced-Unsupervised Deep Learning

Resource allocation and transceivers in wireless networks are usually designed by solving optimization problems subject to specific constraints, which can be formulated as variable or functional optimization. If the objective and constraint functions of a variable optimization problem can be derived, standard numerical algorithms can be applied for finding the optimal solution, which however incur high computational cost when the dimension of the variable is high. To reduce the on-line computational complexity, learning the optimal solution as a function of the environment&#39;s status by deep neural networks (DNNs) is an effective approach. DNNs can be trained under the supervision of optimal solutions, which however, is not applicable to the scenarios without models or for functional optimization where the optimal solutions are hard to obtain. If the objective and constraint functions are unavailable, reinforcement learning can be applied to find the solution of a functional optimization problem, which is however not tailored to optimization problems in wireless networks. In this article, we introduce unsupervised and reinforced-unsupervised learning frameworks for solving both variable and functional optimization problems without the supervision of the optimal solutions. When the mathematical model of the environment is completely known and the distribution of environment&#39;s status is known or unknown, we can invoke unsupervised learning algorithm. When the mathematical model of the environment is incomplete, we introduce reinforced-unsupervised learning algorithms that learn the model by interacting with the environment. Our simulation results confirm the applicability of these learning frameworks by taking a user association problem as an example.

preprint2020arXiv

Powering Hidden Markov Model by Neural Network based Generative Models

Hidden Markov model (HMM) has been successfully used for sequential data modeling problems. In this work, we propose to power the modeling capacity of HMM by bringing in neural network based generative models. The proposed model is termed as GenHMM. In the proposed GenHMM, each HMM hidden state is associated with a neural network based generative model that has tractability of exact likelihood and provides efficient likelihood computation. A generative model in GenHMM consists of mixture of generators that are realized by flow models. A learning algorithm for GenHMM is proposed in expectation-maximization framework. The convergence of the learning GenHMM is analyzed. We demonstrate the efficiency of GenHMM by classification tasks on practical sequential data. Code available at https://github.com/FirstHandScientist/genhmm.

preprint2020arXiv

Propagation of a plane-strain hydraulic fracture accounting for a rough cohesive zone

The quasi-brittle nature of rocks challenges the basic assumptions of linear hydraulic fracture mechanics (LHFM): linear elastic fracture mechanics and smooth parallel plates lubrication fluid flow. We relax these hypotheses and investigate the growth of a plane-strain hydraulic fracture in an impermeable medium accounting for a rough cohesive zone and a fluid lag. In addition to a dimensionless toughness and the time-scale of coalescence of the fluid and fracture fronts as in the LHFM case, the solution now also depends on the in-situ-to-cohesive stress ratio and the intensity of the flow deviation induced by aperture roughness. The solution is appropriately described by a nucleation time-scale, which delineates the fracture growth into a nucleation phase, an intermediate stage and a late time stage where convergence toward LHFM predictions finally occurs. A highly non-linear hydro-mechanical coupling takes place as the fluid front enters the rough cohesive zone which itself evolves during the nucleation and intermediate stages. This coupling leads to significant additional viscous flow dissipation. As a result, the fracture evolution deviates from LHFM solutions with shorter fracture lengths, larger widths and net pressures. These deviations ultimately decrease at late times as the lag and cohesive zone fractions both become smaller. The deviations increase with larger dimensionless toughness and in-situ-to-cohesive stress ratio, as both further localize viscous dissipation near the fluid front located in the rough cohesive zone. The convergence toward LHFM can occur at very late time for realistic values of in-situ-to-cohesive stress ratio encountered at depth. The impact of a rough cohesive zone appears to be prominent for laboratory experiments and short in-situ injections in quasi-brittle rocks with ultimately a larger energy demand compared to LHFM predictions.

preprint2020arXiv

Region-based Energy Neural Network for Approximate Inference

Region-based free energy was originally proposed for generalized belief propagation (GBP) to improve loopy belief propagation (loopy BP). In this paper, we propose a neural network based energy model for inference in general Markov random fields (MRFs), which directly minimizes the region-based free energy defined on region graphs. We term our model Region-based Energy Neural Network (RENN). Unlike message-passing algorithms, RENN avoids iterative message propagation and is faster. Also different from recent deep neural network based models, inference by RENN does not require sampling, and RENN works on general MRFs. RENN can also be employed for MRF learning. Our experiments on marginal distribution estimation, partition function estimation, and learning of MRFs show that RENN outperforms the mean field method, loopy BP, GBP, and the state-of-the-art neural network based model.

preprint2020arXiv

SSFN -- Self Size-estimating Feed-forward Network with Low Complexity, Limited Need for Human Intervention, and Consistent Behaviour across Trials

We design a self size-estimating feed-forward network (SSFN) using a joint optimization approach for estimation of number of layers, number of nodes and learning of weight matrices. The learning algorithm has a low computational complexity, preferably within few minutes using a laptop. In addition the algorithm has a limited need for human intervention to tune parameters. SSFN grows from a small-size network to a large-size network, guaranteeing a monotonically non-increasing cost with addition of nodes and layers. The learning approach uses judicious a combination of `lossless flow property&#39; of some activation functions, convex optimization and instance of random matrix. Consistent performance -- low variation across Monte-Carlo trials -- is found for inference performance (classification accuracy) and estimation of network size.

preprint2020arXiv

Time-lapse reconstruction of the fracture front from diffracted waves arrivals in laboratory hydraulic fracture experiments

4D acoustic imaging via an array of 32 sources / 32 receivers is used to monitor hydraulic fracture propagating in a 250~mm cubic specimen under a true-triaxial state of stress. We present a method based on the arrivals of diffracted waves to reconstruct the fracture geometry (and fluid front when distinct from the fracture front). Using Bayesian model selection, we rank different possible fracture geometries (radial, elliptical, tilted or not) and estimate model error. The imaging is repeated every 4 seconds and provide a quantitative measurement of the growth of these low velocity fractures. We test the proposed method on two experiments performed in two different rocks (marble and gabbro) under experimental conditions characteristic respectively of the fluid lag-viscosity (marble) and toughness (gabbro) dominated hydraulic fracture propagation regimes. In both experiments, about 150 to 200 source-receiver combinations exhibit clear diffracted wave arrivals. The results of the inversion indicate a radial geometry evolving slightly into an ellipse towards the end of the experiment when the fractures feel the specimen boundaries. The estimated modelling error with all models is of the order of the wave arrival picking error. Posterior estimates indicate an uncertainty of the order of a millimeter on the fracture front location for a given acquisition sequence. The reconstructed fracture evolution from diffracted waves is shown to be consistent with the analysis of $90^{\circ}$ incidence transmitted waves across the growing fracture.

preprint2020arXiv

Transferring and Regularizing Prediction for Semantic Segmentation

Semantic segmentation often requires a large set of images with pixel-level annotations. In the view of extremely expensive expert labeling, recent research has shown that the models trained on photo-realistic synthetic data (e.g., computer games) with computer-generated annotations can be adapted to real images. Despite this progress, without constraining the prediction on real images, the models will easily overfit on synthetic data due to severe domain mismatch. In this paper, we novelly exploit the intrinsic properties of semantic segmentation to alleviate such problem for model transfer. Specifically, we present a Regularizer of Prediction Transfer (RPT) that imposes the intrinsic properties as constraints to regularize model transfer in an unsupervised fashion. These constraints include patch-level, cluster-level and context-level semantic prediction consistencies at different levels of image formation. As the transfer is label-free and data-driven, the robustness of prediction is addressed by selectively involving a subset of image regions for model regularization. Extensive experiments are conducted to verify the proposal of RPT on the transfer of models trained on GTA5 and SYNTHIA (synthetic data) to Cityscapes dataset (urban street scenes). RPT shows consistent improvements when injecting the constraints on several neural networks for semantic segmentation. More remarkably, when integrating RPT into the adversarial-based segmentation framework, we report to-date the best results: mIoU of 53.2%/51.7% when transferring from GTA5/SYNTHIA to Cityscapes, respectively.

preprint2020arXiv

Will Scale-free Popularity Develop Scale-free Geo-social Networks?

Empirical results show that spatial factors such as distance, population density and communication range affect our social activities, also reflected by the development of ties in social networks. This motivates the need for social network models that take these spatial factors into account. Therefore, in this paper we propose a gravity-low-based geo-social network model, where connections develop according to the popularity of the individuals, but are constrained through their geographic distance and the surrounding population density. Specifically, we consider a power-law distributed popularity, and random node positions governed by a Poisson point process. We evaluate the characteristics of the emerging networks, considering the degree distribution, the average degree of neighbors and the local clustering coefficient. These local metrics reflect the robustness of the network, the information dissemination speed and the communication locality. We show that unless the communication range is strictly limited, the emerging networks are scale-free, with a rank exponent affected by the spatial factors. Even the average neighbor degree and the local clustering coefficient show tendencies known in non-geographic scale-free networks, at least when considering individuals with low popularity. At high-popularity values, however, the spatial constraints lead to popularity-independent average neighbor degrees and clustering coefficients.

preprint2019arXiv

A Comprehensive Benchmark for Single Image Compression Artifacts Reduction

We present a comprehensive study and evaluation of existing single image compression artifacts removal algorithms, using a new 4K resolution benchmark including diversified foreground objects and background scenes with rich structures, called Large-scale Ideal Ultra high definition 4K (LIU4K) benchmark. Compression artifacts removal, as a common post-processing technique, aims at alleviating undesirable artifacts such as blockiness, ringing, and banding caused by quantization and approximation in the compression process. In this work, a systematic listing of the reviewed methods is presented based on their basic models (handcrafted models and deep networks). The main contributions and novelties of these methods are highlighted, and the main development directions, including architectures, multi-domain sources, signal structures, and new targeted units, are summarized. Furthermore, based on a unified deep learning configuration (i.e. same training data, loss function, optimization algorithm, etc.), we evaluate recent deep learning-based methods based on diversified evaluation measures. The experimental results show the state-of-the-art performance comparison of existing methods based on both full-reference, non-reference and task-driven metrics. Our survey would give a comprehensive reference source for future research on single image compression artifacts removal and inspire new directions of the related fields.

preprint2019arXiv

Deep Learning-Based Video Coding: A Review and A Case Study

The past decade has witnessed great success of deep learning technology in many disciplines, especially in computer vision and image processing. However, deep learning-based video coding remains in its infancy. This paper reviews the representative works about using deep learning for image/video coding, which has been an actively developing research area since the year of 2015. We divide the related works into two categories: new coding schemes that are built primarily upon deep networks (deep schemes), and deep network-based coding tools (deep tools) that shall be used within traditional coding schemes or together with traditional coding tools. For deep schemes, pixel probability modeling and auto-encoder are the two approaches, that can be viewed as predictive coding scheme and transform coding scheme, respectively. For deep tools, there have been several proposed techniques using deep learning to perform intra-picture prediction, inter-picture prediction, cross-channel prediction, probability distribution prediction, transform, post- or in-loop filtering, down- and up-sampling, as well as encoding optimizations. In the hope of advocating the research of deep learning-based video coding, we present a case study of our developed prototype video codec, namely Deep Learning Video Coding (DLVC). DLVC features two deep tools that are both based on convolutional neural network (CNN), namely CNN-based in-loop filter (CNN-ILF) and CNN-based block adaptive resolution coding (CNN-BARC). Both tools help improve the compression efficiency by a significant margin. With the two deep tools as well as other non-deep coding tools, DLVC is able to achieve on average 39.6\% and 33.0\% bits saving than HEVC, under random-access and low-delay configurations, respectively. The source code of DLVC has been released for future researches.

preprint2019arXiv

On The Classification-Distortion-Perception Tradeoff

Signal degradation is ubiquitous and computational restoration of degraded signal has been investigated for many years. Recently, it is reported that the capability of signal restoration is fundamentally limited by the perception-distortion tradeoff, i.e. the distortion and the perceptual difference between the restored signal and the ideal `original&#39; signal cannot be made both minimal simultaneously. Distortion corresponds to signal fidelity and perceptual difference corresponds to perceptual naturalness, both of which are important metrics in practice. Besides, there is another dimension worthy of consideration, namely the semantic quality or the utility for recognition purpose, of the restored signal. In this paper, we extend the previous perception-distortion tradeoff to the case of classification-distortion-perception (CDP) tradeoff, where we introduced the classification error rate of the restored signal in addition to distortion and perceptual difference. Two versions of the CDP tradeoff are considered, one using a predefined classifier and the other dealing with the optimal classifier for the restored signal. For both versions, we can rigorously prove the existence of the CDP tradeoff, i.e. the distortion, perceptual difference, and classification error rate cannot be made all minimal simultaneously. Our findings can be useful especially for computer vision researches where some low-level vision tasks (signal restoration) serve for high-level vision tasks (visual understanding).

preprint2019arXiv

Photoluminescence mapping and time-domain thermo-photoluminescence for rapid imaging and measurement of thermal conductivity of boron arsenide

Cubic boron arsenide (BAs) is attracting greater attention due to the recent experimental demonstration of ultrahigh thermal conductivity \k{appa} above 1000 W/mK. However, its bandgap has not been settled and a simple yet effective method to probe its crystal quality is missing. Furthermore, traditional \k{appa} measurement methods are destructive and time consuming, thus they cannot meet the urgent demand for fast screening of high \k{appa} materials. After we experimentally established 1.82 eV as the indirect bandgap of BAs and observed room-temperature band-edge photoluminescence, we developed two new optical techniques that can provide rapid and non-destructive characterization of \k{appa} with little sample preparation: photoluminescence mapping (PL-mapping) and time-domain thermo-photoluminescence (TDTP). PL-mapping provides nearly real-time image of crystal quality and \k{appa} over mm-sized crystal surfaces; while TDTP allows us to pick up any spot on the sample surface and measure its \k{appa} using nanosecond laser pulses. These new techniques reveal that the apparent single crystals are not only non-uniform in \k{appa}, but also are made of domains of very distinct \k{appa}. Because PL-mapping and TDTP are based on the band-edge PL and its dependence on temperature, they can be applied to other semiconductors, thus paving the way for rapid identification and development of high-\k{appa} semiconducting materials.

preprint2019arXiv

Two-Stream Action Recognition-Oriented Video Super-Resolution

We study the video super-resolution (SR) problem for facilitating video analytics tasks, e.g. action recognition, instead of for visual quality. The popular action recognition methods based on convolutional networks, exemplified by two-stream networks, are not directly applicable on video of low spatial resolution. This can be remedied by performing video SR prior to recognition, which motivates us to improve the SR procedure for recognition accuracy. Tailored for two-stream action recognition networks, we propose two video SR methods for the spatial and temporal streams respectively. On the one hand, we observe that regions with action are more important to recognition, and we propose an optical-flow guided weighted mean-squared-error loss for our spatial-oriented SR (SoSR) network to emphasize the reconstruction of moving objects. On the other hand, we observe that existing video SR methods incur temporal discontinuity between frames, which also worsens the recognition accuracy, and we propose a siamese network for our temporal-oriented SR (ToSR) training that emphasizes the temporal continuity between consecutive frames. We perform experiments using two state-of-the-art action recognition networks and two well-known datasets--UCF101 and HMDB51. Results demonstrate the effectiveness of our proposed SoSR and ToSR in improving recognition accuracy.