Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
58works
0followers
23topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

58 published item(s)

preprint2026arXiv

SnapGuard: Lightweight Prompt Injection Detection for Screenshot-Based Web Agents

Web agents have emerged as an effective paradigm for automating interactions with complex web environments, yet remain vulnerable to prompt injection attacks that embed malicious instructions into webpage content to induce unintended actions. This threat is further amplified for screenshot-based web agents, which operate on rendered visual webpages rather than structured textual representations, making predominant text-centric defenses ineffective. Although multimodal detection methods have been explored, they often rely on large vision-language models (VLMs), incurring significant computational overhead. The bottleneck lies in the complexity of modern webpages: VLMs must comprehend the global semantics of an entire page, resulting in substantial inference time and GPU memory usage. This raises a critical question: can we detect prompt injection attacks from screenshots in a lightweight manner? In this paper, we observe that injected webpages exhibit distinct characteristics compared to benign ones from both visual and textual perspectives. Building on this insight, we propose SnapGuard, a lightweight yet accurate method that reformulates prompt injection detection as multimodal representation analysis over webpage screenshots. SnapGuard leverages two complementary signals: a visual stability indicator that identifies abnormally smooth gradient distributions induced by malicious content, and action-oriented textual signals recovered via contrast-polarity reversal. Extensive evaluations across eight attacks and two benign settings demonstrate that SnapGuard achieves an F1 score of 0.75, outperforming GPT-4o-prompt while being 8x faster (1.81s vs. 14.50s) and introducing no additional memory overhead.

preprint2024arXiv

Accurate and Fast Compressed Video Captioning

Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ignore key information in videos and thus degrade performance. Additionally, redundant information in the sampled frames may result in low efficiency in the inference of video captioning. Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning. We show that even with a simple design, our method can achieve state-of-the-art performance on different benchmarks while running almost 2x faster than existing approaches. Code is available at https://github.com/acherstyx/CoCap.

preprint2024arXiv

DiffusionEdge: Diffusion Probabilistic Model for Crisp Edge Detection

Limited by the encoder-decoder architecture, learning-based edge detectors usually have difficulty predicting edge maps that satisfy both correctness and crispness. With the recent success of the diffusion probabilistic model (DPM), we found it is especially suitable for accurate and crisp edge detection since the denoising process is directly applied to the original image size. Therefore, we propose the first diffusion model for the task of general edge detection, which we call DiffusionEdge. To avoid expensive computational resources while retaining the final performance, we apply DPM in the latent space and enable the classic cross-entropy loss which is uncertainty-aware in pixel level to directly optimize the parameters in latent space in a distillation manner. We also adopt a decoupled architecture to speed up the denoising process and propose a corresponding adaptive Fourier filter to adjust the latent features of specific frequencies. With all the technical designs, DiffusionEdge can be stably trained with limited resources, predicting crisp and accurate edge maps with much fewer augmentation strategies. Extensive experiments on four edge detection benchmarks demonstrate the superiority of DiffusionEdge both in correctness and crispness. On the NYUDv2 dataset, compared to the second best, we increase the ODS, OIS (without post-processing) and AC by 30.2%, 28.1% and 65.1%, respectively. Code: https://github.com/GuHuangAI/DiffusionEdge.

preprint2023arXiv

DAP: Domain-aware Prompt Learning for Vision-and-Language Navigation

Following language instructions to navigate in unseen environments is a challenging task for autonomous embodied agents. With strong representation capabilities, pretrained vision-and-language models are widely used in VLN. However, most of them are trained on web-crawled general-purpose datasets, which incurs a considerable domain gap when used for VLN tasks. To address the problem, we propose a novel and model-agnostic domain-aware prompt learning (DAP) framework. For equipping the pretrained models with specific object-level and scene-level cross-modal alignment in VLN tasks, DAP applies a low-cost prompt tuning paradigm to learn soft visual prompts for extracting in-domain image semantics. Specifically, we first generate a set of in-domain image-text pairs with the help of the CLIP model. Then we introduce soft visual prompts in the input space of the visual encoder in a pretrained model. DAP injects in-domain visual knowledge into the visual encoder of the pretrained model in an efficient way. Experimental results on both R2R and REVERIE show the superiority of DAP compared to existing state-of-the-art methods.

preprint2023arXiv

Edge Preserving Implicit Surface Representation of Point Clouds

Learning implicit surface directly from raw data recently has become a very attractive representation method for 3D reconstruction tasks due to its excellent performance. However, as the raw data quality deteriorates, the implicit functions often lead to unsatisfactory reconstruction results. To this end, we propose a novel edge-preserving implicit surface reconstruction method, which mainly consists of a differentiable Laplican regularizer and a dynamic edge sampling strategy. Among them, the differential Laplican regularizer can effectively alleviate the implicit surface unsmoothness caused by the point cloud quality deteriorates; Meanwhile, in order to reduce the excessive smoothing at the edge regions of implicit suface, we proposed a dynamic edge extract strategy for sampling near the sharp edge of point cloud, which can effectively avoid the Laplacian regularizer from smoothing all regions. Finally, we combine them with a simple regularization term for robust implicit surface reconstruction. Compared with the state-of-the-art methods, experimental results show that our method significantly improves the quality of 3D reconstruction results. Moreover, we demonstrate through several experiments that our method can be conveniently and effectively applied to some point cloud analysis tasks, including point cloud edge feature extraction, normal estimation,etc.

preprint2023arXiv

Exceptional entanglement phenomena: non-Hermiticity meeting non-classicality

Non-Hermitian (NH) extension of quantum-mechanical Hamiltonians represents one of the most significant advancements in physics. During the past two decades, numerous captivating NH phenomena have been revealed and demonstrated, but all of which can appear in both quantum and classical systems. This leads to the fundamental question: what NH signature presents a radical departure from classical physics? The solution of this problem is indispensable for exploring genuine NH quantum mechanics, but remains experimentally untouched so far. Here, we resolve this basic issue by unveiling distinct exceptional entanglement phenomena, exemplified by an entanglement transition, occurring at the exceptional point of NH interacting quantum systems. We illustrate and demonstrate such purely quantum-mechanical NH effects with a naturally dissipative light-matter system, engineered in a circuit quantum electrodynamics architecture. Our results lay the foundation for studies of genuinely quantum-mechanical NH physics, signified by exceptional-point-enabled entanglement behaviors.

preprint2022arXiv

Accelerating Video Object Segmentation with Compressed Video

We propose an efficient plug-and-play acceleration framework for semi-supervised video object segmentation by exploiting the temporal redundancies in videos presented by the compressed bitstream. Specifically, we propose a motion vector-based warping method for propagating segmentation masks from keyframes to other frames in a bi-directional and multi-hop manner. Additionally, we introduce a residual-based correction module that can fix wrongly propagated segmentation masks from noisy or erroneous motion vectors. Our approach is flexible and can be added on top of several existing video object segmentation algorithms. We achieved highly competitive results on DAVIS17 and YouTube-VOS on various base models with substantial speed-ups of up to 3.5X with minor drops in accuracy.

preprint2022arXiv

Active Scene Understanding via Online Semantic Reconstruction

We propose a novel approach to robot-operated active understanding of unknown indoor scenes, based on online RGBD reconstruction with semantic segmentation. In our method, the exploratory robot scanning is both driven by and targeting at the recognition and segmentation of semantic objects from the scene. Our algorithm is built on top of the volumetric depth fusion framework (e.g., KinectFusion) and performs real-time voxel-based semantic labeling over the online reconstructed volume. The robot is guided by an online estimated discrete viewing score field (VSF) parameterized over the 3D space of 2D location and azimuth rotation. VSF stores for each grid the score of the corresponding view, which measures how much it reduces the uncertainty (entropy) of both geometric reconstruction and semantic labeling. Based on VSF, we select the next best views (NBV) as the target for each time step. We then jointly optimize the traverse path and camera trajectory between two adjacent NBVs, through maximizing the integral viewing score (information gain) along path and trajectory. Through extensive evaluation, we show that our method achieves efficient and accurate online scene parsing during exploratory scanning.

preprint2022arXiv

AutoTransition: Learning to Recommend Video Transition Effects

Video transition effects are widely used in video editing to connect shots for creating cohesive and visually appealing videos. However, it is challenging for non-professionals to choose best transitions due to the lack of cinematographic knowledge and design skills. In this paper, we present the premier work on performing automatic video transitions recommendation (VTR): given a sequence of raw video shots and companion audio, recommend video transitions for each pair of neighboring shots. To solve this task, we collect a large-scale video transition dataset using publicly available video templates on editing softwares. Then we formulate VTR as a multi-modal retrieval problem from vision/audio to video transitions and propose a novel multi-modal matching framework which consists of two parts. First we learn the embedding of video transitions through a video transition classification task. Then we propose a model to learn the matching correspondence from vision/audio inputs to video transitions. Specifically, the proposed model employs a multi-modal transformer to fuse vision and audio information, as well as capture the context cues in sequential transition outputs. Through both quantitative and qualitative experiments, we clearly demonstrate the effectiveness of our method. Notably, in the comprehensive user study, our method receives comparable scores compared with professional editors while improving the video editing efficiency by \textbf{300\scalebox{1.25}{$\times$}}. We hope our work serves to inspire other researchers to work on this new task. The dataset and codes are public at \url{https://github.com/acherstyx/AutoTransition}.

preprint2022arXiv

Box2Seg: Learning Semantics of 3D Point Clouds with Box-Level Supervision

Learning dense point-wise semantics from unstructured 3D point clouds with fewer labels, although a realistic problem, has been under-explored in literature. While existing weakly supervised methods can effectively learn semantics with only a small fraction of point-level annotations, we find that the vanilla bounding box-level annotation is also informative for semantic segmentation of large-scale 3D point clouds. In this paper, we introduce a neural architecture, termed Box2Seg, to learn point-level semantics of 3D point clouds with bounding box-level supervision. The key to our approach is to generate accurate pseudo labels by exploring the geometric and topological structure inside and outside each bounding box. Specifically, an attention-based self-training (AST) technique and Point Class Activation Mapping (PCAM) are utilized to estimate pseudo-labels. The network is further trained and refined with pseudo labels. Experiments on two large-scale benchmarks including S3DIS and ScanNet demonstrate the competitive performance of the proposed method. In particular, the proposed network can be trained with cheap, or even off-the-shelf bounding box-level annotations and subcloud-level tags.

preprint2022arXiv

Decoupling Features and Coordinates for Few-shot RGB Relocalization

Cross-scene model adaption is crucial for camera relocalization in real scenarios. It is often preferable that a pre-learned model can be fast adapted to a novel scene with as few training samples as possible. The existing state-of-the-art approaches, however, can hardly support such few-shot scene adaption due to the entangling of image feature extraction and scene coordinate regression. To address this issue, we approach camera relocalization with a decoupled solution where feature extraction, coordinate regression, and pose estimation are performed separately. Our key insight is that feature encoder used for coordinate regression should be learned by removing the distracting factor of coordinate systems, such that feature encoder is learned from multiple scenes for general feature representation and more important, view-insensitive capability. With this feature prior, and combined with a coordinate regressor, few-shot observations in a new scene are much easier to connect with the 3D world than the one with existing integrated solution. Experiments have shown the superiority of our approach compared to the state-of-the-art methods, producing higher accuracy on several scenes with diverse visual appearance and viewpoint distribution.

preprint2022arXiv

Decoupling Makes Weakly Supervised Local Feature Better

Weakly supervised learning can help local feature methods to overcome the obstacle of acquiring a large-scale dataset with densely labeled correspondences. However, since weak supervision cannot distinguish the losses caused by the detection and description steps, directly conducting weakly supervised learning within a joint describe-then-detect pipeline suffers limited performance. In this paper, we propose a decoupled describe-then-detect pipeline tailored for weakly supervised local feature learning. Within our pipeline, the detection step is decoupled from the description step and postponed until discriminative and robust descriptors are learned. In addition, we introduce a line-to-window search strategy to explicitly use the camera pose information for better descriptor learning. Extensive experiments show that our method, namely PoSFeat (Camera Pose Supervised Feature), outperforms previous fully and weakly supervised methods and achieves state-of-the-art performance on a wide range of downstream tasks.

preprint2022arXiv

DisARM: Displacement Aware Relation Module for 3D Detection

We introduce Displacement Aware Relation Module (DisARM), a novel neural network module for enhancing the performance of 3D object detection in point cloud scenes. The core idea of our method is that contextual information is critical to tell the difference when the instance geometry is incomplete or featureless. We find that relations between proposals provide a good representation to describe the context. However, adopting relations between all the object or patch proposals for detection is inefficient, and an imbalanced combination of local and global relations brings extra noise that could mislead the training. Rather than working with all relations, we found that training with relations only between the most representative ones, or anchors, can significantly boost the detection performance. A good anchor should be semantic-aware with no ambiguity and independent with other anchors as well. To find the anchors, we first perform a preliminary relation anchor module with an objectness-aware sampling approach and then devise a displacement-based module for weighing the relation importance for better utilization of contextual information. This lightweight relation module leads to significantly higher accuracy of object instance detection when being plugged into the state-of-the-art detectors. Evaluations on the public benchmarks of real-world scenes show that our method achieves state-of-the-art performance on both SUN RGB-D and ScanNet V2.

preprint2022arXiv

dpart: Differentially Private Autoregressive Tabular, a General Framework for Synthetic Data Generation

We propose a general, flexible, and scalable framework dpart, an open source Python library for differentially private synthetic data generation. Central to the approach is autoregressive modelling -- breaking the joint data distribution to a sequence of lower-dimensional conditional distributions, captured by various methods such as machine learning models (logistic/linear regression, decision trees, etc.), simple histogram counts, or custom techniques. The library has been created with a view to serve as a quick and accessible baseline as well as to accommodate a wide audience of users, from those making their first steps in synthetic data generation, to more experienced ones with domain expertise who can configure different aspects of the modelling and contribute new methods/mechanisms. Specific instances of dpart include Independent, an optimized version of PrivBayes, and a newly proposed model, dp-synthpop. Code: https://github.com/hazy/dpart

preprint2022arXiv

Fusion-Aware Point Convolution for Online Semantic 3D Scene Segmentation

Online semantic 3D segmentation in company with real-time RGB-D reconstruction poses special challenges such as how to perform 3D convolution directly over the progressively fused 3D geometric data, and how to smartly fuse information from frame to frame. We propose a novel fusion-aware 3D point convolution which operates directly on the geometric surface being reconstructed and exploits effectively the inter-frame correlation for high quality 3D feature learning. This is enabled by a dedicated dynamic data structure which organizes the online acquired point cloud with global-local trees. Globally, we compile the online reconstructed 3D points into an incrementally growing coordinate interval tree, enabling fast point insertion and neighborhood query. Locally, we maintain the neighborhood information for each point using an octree whose construction benefits from the fast query of the global tree.Both levels of trees update dynamically and help the 3D convolution effectively exploits the temporal coherence for effective information fusion across RGB-D frames.

preprint2022arXiv

HybridGNN: Learning Hybrid Representation in Multiplex Heterogeneous Networks

Recently, graph neural networks have shown the superiority of modeling the complex topological structures in heterogeneous network-based recommender systems. Due to the diverse interactions among nodes and abundant semantics emerging from diverse types of nodes and edges, there is a bursting research interest in learning expressive node representations in multiplex heterogeneous networks. One of the most important tasks in recommender systems is to predict the potential connection between two nodes under a specific edge type (i.e., relationship). Although existing studies utilize explicit metapaths to aggregate neighbors, practically they only consider intra-relationship metapaths and thus fail to leverage the potential uplift by inter-relationship information. Moreover, it is not always straightforward to exploit inter-relationship metapaths comprehensively under diverse relationships, especially with the increasing number of node and edge types. In addition, contributions of different relationships between two nodes are difficult to measure. To address the challenges, we propose HybridGNN, an end-to-end GNN model with hybrid aggregation flows and hierarchical attentions to fully utilize the heterogeneity in the multiplex scenarios. Specifically, HybridGNN applies a randomized inter-relationship exploration module to exploit the multiplexity property among different relationships. Then, our model leverages hybrid aggregation flows under intra-relationship metapaths and randomized exploration to learn the rich semantics. To explore the importance of different aggregation flow and take advantage of the multiplexity property, we bring forward a novel hierarchical attention module which leverages both metapath-level attention and relationship-level attention. Extensive experimental results suggest that HybridGNN achieves the best performance compared to several state-of-the-art baselines.

preprint2022arXiv

Learning Fine-Grained Segmentation of 3D Shapes without Part Labels

Learning-based 3D shape segmentation is usually formulated as a semantic labeling problem, assuming that all parts of training shapes are annotated with a given set of tags. This assumption, however, is impractical for learning fine-grained segmentation. Although most off-the-shelf CAD models are, by construction, composed of fine-grained parts, they usually miss semantic tags and labeling those fine-grained parts is extremely tedious. We approach the problem with deep clustering, where the key idea is to learn part priors from a shape dataset with fine-grained segmentation but no part labels. Given point sampled 3D shapes, we model the clustering priors of points with a similarity matrix and achieve part segmentation through minimizing a novel low rank loss. To handle highly densely sampled point sets, we adopt a divide-and-conquer strategy. We partition the large point set into a number of blocks. Each block is segmented using a deep-clustering-based part prior network trained in a category-agnostic manner. We then train a graph convolution network to merge the segments of all blocks to form the final segmentation result. Our method is evaluated with a challenging benchmark of fine-grained segmentation, showing state-of-the-art performance.

preprint2022arXiv

Learning High-DOF Reaching-and-Grasping via Dynamic Representation of Gripper-Object Interaction

We approach the problem of high-DOF reaching-and-grasping via learning joint planning of grasp and motion with deep reinforcement learning. To resolve the sample efficiency issue in learning the high-dimensional and complex control of dexterous grasping, we propose an effective representation of grasping state characterizing the spatial interaction between the gripper and the target object. To represent gripper-object interaction, we adopt Interaction Bisector Surface (IBS) which is the Voronoi diagram between two close by 3D geometric objects and has been successfully applied in characterizing spatial relations between 3D objects. We found that IBS is surprisingly effective as a state representation since it well informs the fine-grained control of each finger with spatial relation against the target object. This novel grasp representation, together with several technical contributions including a fast IBS approximation, a novel vector-based reward and an effective training strategy, facilitate learning a strong control model of high-DOF grasping with good sample efficiency, dynamic adaptability, and cross-category generality. Experiments show that it generates high-quality dexterous grasp for complex shapes with smooth grasping motions.

preprint2022arXiv

Measuring Small Longitudinal Phase Shifts via Weak Measurement Amplification

Weak measurement amplification, which is considered as a very promising scheme in precision measurement, has been applied to various small physical quantities estimation. Since many quantities can be converted to phase signal, it is thus interesting and important to consider measuring ultra-small longitudinal phase shifts by using weak measurement. Here, we propose and experimentally demonstrate a novel weak measurement amplification based ultra-small longitudinal phase estimation, which is suitable for polarization interferometry. We realize one order of magnitude amplification measurement of small phase signal directly introduced by Liquid Crystal Variable Retarder and show its robust to finite visibility of interference. Our results may find important applications in high-precision measurements, such as gravitational waves detection.

preprint2022arXiv

Natural selection from the perspective of mathematical and physical laws

The theory of evolution by natural selection cannot be used to evaluate the truth value of the following proposition: Through evolution, there exists at least one species that can adapt to any one given environment. To address this issue, this study attempted to define natural selection from the perspective of mathematical and physical laws. This study roughly classified biological activities into molecular, cellular, individual, ecological, and biogeochemical (atomic) levels according to scale and complexity, and selected typical phenomena from each level to analyze the relationship between adaptive evolution and several laws of mathematics and physics. Then, we proposed that natural selection favors heritable variations that allows organisms to better use and/or hide the laws in a certain environment. Reproductive advantage is by far the most obvious consequence of natural selection. Moreover, adaptive evolution can lead to the emergence of laws that ensures that each law only controls some properties of organisms. This study found that organisms significantly influence themselves in all five levels of biological activities, and the whole biosphere can be considered as a huge and evolution-driven feedback loop. Organisms can carry more laws than non-living matter, but the carrying capacity is limited. Therefore, this study's findings suggest that adaptive evolution makes organisms subject to more laws of mathematics and physics until the highest carrying capacity is reached.

preprint2022arXiv

NeoNav: Improving the Generalization of Visual Navigation via Generating Next Expected Observations

We propose improving the cross-target and cross-scene generalization of visual navigation through learning an agent that is guided by conceiving the next observations it expects to see. This is achieved by learning a variational Bayesian model, called NeoNav, which generates the next expected observations (NEO) conditioned on the current observations of the agent and the target view. Our generative model is learned through optimizing a variational objective encompassing two key designs. First, the latent distribution is conditioned on current observations and the target view, leading to a model-based, target-driven navigation. Second, the latent space is modeled with a Mixture of Gaussians conditioned on the current observation and the next best action. Our use of mixture-of-posteriors prior effectively alleviates the issue of over-regularized latent space, thus significantly boosting the model generalization for new targets and in novel scenes. Moreover, the NEO generation models the forward dynamics of agent-environment interaction, which improves the quality of approximate inference and hence benefits data efficiency. We have conducted extensive evaluations on both real-world and synthetic benchmarks, and show that our model consistently outperforms the state-of-the-art models in terms of success rate, data efficiency, and generalization.

preprint2022arXiv

New Formulation of Mixed-Integer Conic Programming for Globally Optimal Grasp Planning

We present a two-level branch-and-bound (BB) algorithm to compute the optimal gripper pose that maximizes a grasp metric in a restricted search space. Our method can take the gripper's kinematics feasibility into consideration to ensure that a given gripper can reach the set of grasp points without collisions or predict infeasibility with finite-time termination when no pose exists for a given set of grasp points. Our main technical contribution is a novel mixed-integer conic programming (MICP) formulation for the inverse kinematics of the gripper that uses a small number of binary variables and tightened constraints, which can be efficiently solved via a low-level BB algorithm. Our experiments show that optimal gripper poses for various target objects can be computed taking 20-180 minutes of computation on a desktop machine and the computed grasp quality, in terms of the Q1 metric, is better than those generated using sampling-based planners.

preprint2022arXiv

Objective-aware Traffic Simulation via Inverse Reinforcement Learning

Traffic simulators act as an essential component in the operating and planning of transportation systems. Conventional traffic simulators usually employ a calibrated physical car-following model to describe vehicles' behaviors and their interactions with traffic environment. However, there is no universal physical model that can accurately predict the pattern of vehicle's behaviors in different situations. A fixed physical model tends to be less effective in a complicated environment given the non-stationary nature of traffic dynamics. In this paper, we formulate traffic simulation as an inverse reinforcement learning problem, and propose a parameter sharing adversarial inverse reinforcement learning model for dynamics-robust simulation learning. Our proposed model is able to imitate a vehicle's trajectories in the real world while simultaneously recovering the reward function that reveals the vehicle's true objective which is invariant to different dynamics. Extensive experiments on synthetic and real-world datasets show the superior performance of our approach compared to state-of-the-art methods and its robustness to variant dynamics of traffic.

preprint2022arXiv

Observation of Emergent $\mathbb{Z}_2$ Gauge Invariance in a Superconducting Circuit

Lattice gauge theories (LGTs) are one of the most fundamental subjects in many-body physics, and has recently attracted considerable research interests in quantum simulations. Here we experimentally investigate the emergent $\mathbb{Z}_2$ gauge invariance in a 1D superconducting circuit with 10 transmon qubits. By precisely adjusting staggered longitudinal and transverse fields to each qubit, we construct an effective Hamiltonian containing an LGT and gauge-broken terms. The corresponding matter sector can exhibit a localization, and there also exists a 3-qubit operator, of which the expectation value can retain nonzero for a long time in low-energy regimes. The above localization can be regarded as the confinement of matter fields, and the 3-body operator is the $\mathbb{Z}_2$ gauge generator. These experimental results demonstrate that, despite the absence of gauge structure in the effective Hamiltonian, $\mathbb{Z}_2$ gauge invariance can still emerge in low-energy regimes. Our work provides a method for both theoretically and experimentally studying the rich physics in quantum many-body systems with emergent gauge invariance.

preprint2022arXiv

On Learning the Right Attention Point for Feature Enhancement

We present a novel attention-based mechanism to learn enhanced point features for point cloud processing tasks, e.g., classification and segmentation. Unlike prior works, which were trained to optimize the weights of a pre-selected set of attention points, our approach learns to locate the best attention points to maximize the performance of a specific task, e.g., point cloud classification. Importantly, we advocate the use of single attention point to facilitate semantic understanding in point feature learning. Specifically, we formulate a new and simple convolution, which combines convolutional features from an input point and its corresponding learned attention point, or LAP, for short. Our attention mechanism can be easily incorporated into state-of-the-art point cloud classification and segmentation networks. Extensive experiments on common benchmarks such as ModelNet40, ShapeNetPart, and S3DIS all demonstrate that our LAP-enabled networks consistently outperform the respective original networks, as well as other competitive alternatives, which employ multiple attention points, either pre-selected or learned under our LAP framework.

preprint2022arXiv

On Upper Bounds in Dimension Gaps of CFT's

We consider CFT's arising from branes probing singularities of internal manifolds. We focus on holographic models with internal space including arbtirary Sasaki-Einstein manifolds coming from CY as well as arbitrary sphere quotients. In all these cases we show that there is a universal upper bound (depending only on the spacetime dimension) for the conformal dimension of the first non-trivial spin 2 operator in the dual CFT and a minimal diameter (in AdS units) for the internal space of the holographic dual and conjecture it holds for all CFT's.

preprint2022arXiv

Online 3D Bin Packing with Constrained Deep Reinforcement Learning

We solve a challenging yet practically useful variant of 3D Bin Packing Problem (3D-BPP). In our problem, the agent has limited information about the items to be packed into the bin, and an item must be packed immediately after its arrival without buffering or readjusting. The item's placement also subjects to the constraints of collision avoidance and physical stability. We formulate this online 3D-BPP as a constrained Markov decision process. To solve the problem, we propose an effective and easy-to-implement constrained deep reinforcement learning (DRL) method under the actor-critic framework. In particular, we introduce a feasibility predictor to predict the feasibility mask for the placement actions and use it to modulate the action probabilities output by the actor during training. Such supervisions and transformations to DRL facilitate the agent to learn feasible policies efficiently. Our method can also be generalized e.g., with the ability to handle lookahead or items with different orientations. We have conducted extensive evaluation showing that the learned policy significantly outperforms the state-of-the-art methods. A user study suggests that our method attains a human-level performance.

preprint2022arXiv

PartNet: A Recursive Part Decomposition Network for Fine-grained and Hierarchical Shape Segmentation

Deep learning approaches to 3D shape segmentation are typically formulated as a multi-class labeling problem. Existing models are trained for a fixed set of labels, which greatly limits their flexibility and adaptivity. We opt for top-down recursive decomposition and develop the first deep learning model for hierarchical segmentation of 3D shapes, based on recursive neural networks. Starting from a full shape represented as a point cloud, our model performs recursive binary decomposition, where the decomposition network at all nodes in the hierarchy share weights. At each node, a node classifier is trained to determine the type (adjacency or symmetry) and stopping criteria of its decomposition. The features extracted in higher level nodes are recursively propagated to lower level ones. Thus, the meaningful decompositions in higher levels provide strong contextual cues constraining the segmentations in lower levels. Meanwhile, to increase the segmentation accuracy at each node, we enhance the recursive contextual feature with the shape feature extracted for the corresponding part. Our method segments a 3D shape in point cloud into an unfixed number of parts, depending on the shape complexity, showing strong generality and flexibility. It achieves the state-of-the-art performance, both for fine-grained and semantic segmentation, on the public benchmark and a new benchmark of fine-grained segmentation proposed in this work. We also demonstrate its application for fine-grained part refinements in image-to-shape reconstruction.

preprint2022arXiv

RayMVSNet: Learning Ray-based 1D Implicit Fields for Accurate Multi-View Stereo

Learning-based multi-view stereo (MVS) has by far centered around 3D convolution on cost volumes. Due to the high computation and memory consumption of 3D CNN, the resolution of output depth is often considerably limited. Different from most existing works dedicated to adaptive refinement of cost volumes, we opt to directly optimize the depth value along each camera ray, mimicking the range (depth) finding of a laser scanner. This reduces the MVS problem to ray-based depth optimization which is much more light-weight than full cost volume optimization. In particular, we propose RayMVSNet which learns sequential prediction of a 1D implicit field along each camera ray with the zero-crossing point indicating scene depth. This sequential modeling, conducted based on transformer features, essentially learns the epipolar line search in traditional multi-view stereo. We also devise a multi-task learning for better optimization convergence and depth accuracy. Our method ranks top on both the DTU and the Tanks \& Temples datasets over all previous learning-based methods, achieving overall reconstruction score of 0.33mm on DTU and f-score of 59.48% on Tanks & Temples.

preprint2022arXiv

Recurrent 3D Attentional Networks for End-to-End Active Object Recognition

Active vision is inherently attention-driven: The agent actively selects views to attend in order to fast achieve the vision task while improving its internal representation of the scene being observed. Inspired by the recent success of attention-based models in 2D vision tasks based on single RGB images, we propose to address the multi-view depth-based active object recognition using attention mechanism, through developing an end-to-end recurrent 3D attentional network. The architecture takes advantage of a recurrent neural network (RNN) to store and update an internal representation. Our model, trained with 3D shape datasets, is able to iteratively attend to the best views targeting an object of interest for recognizing it. To realize 3D view selection, we derive a 3D spatial transformer network which is differentiable for training with backpropagation, achieving much faster convergence than the reinforcement learning employed by most existing attention-based models. Experiments show that our method, with only depth input, achieves state-of-the-art next-best-view performance in time efficiency and recognition accuracy.

preprint2022arXiv

Reinforcement Learning-based Visual Navigation with Information-Theoretic Regularization

To enhance the cross-target and cross-scene generalization of target-driven visual navigation based on deep reinforcement learning (RL), we introduce an information-theoretic regularization term into the RL objective. The regularization maximizes the mutual information between navigation actions and visual observation transforms of an agent, thus promoting more informed navigation decisions. This way, the agent models the action-observation dynamics by learning a variational generative model. Based on the model, the agent generates (imagines) the next observation from its current observation and navigation target. This way, the agent learns to understand the causality between navigation actions and the changes in its observations, which allows the agent to predict the next action for navigation by comparing the current and the imagined next observations. Cross-target and cross-scene evaluations on the AI2-THOR framework show that our method attains at least a $10\%$ improvement of average success rate over some state-of-the-art models. We further evaluate our model in two real-world settings: navigation in unseen indoor scenes from a discrete Active Vision Dataset (AVD) and continuous real-world environments with a TurtleBot.We demonstrate that our navigation model is able to successfully achieve navigation tasks in these scenarios. Videos and models can be found in the supplementary material.

preprint2022arXiv

Repairing Systematic Outliers by Learning Clean Subspaces in VAEs

Data cleaning often comprises outlier detection and data repair. Systematic errors result from nearly deterministic transformations that occur repeatedly in the data, e.g. specific image pixels being set to default values or watermarks. Consequently, models with enough capacity easily overfit to these errors, making detection and repair difficult. Seeing as a systematic outlier is a combination of patterns of a clean instance and systematic error patterns, our main insight is that inliers can be modelled by a smaller representation (subspace) in a model than outliers. By exploiting this, we propose Clean Subspace Variational Autoencoder (CLSVAE), a novel semi-supervised model for detection and automated repair of systematic errors. The main idea is to partition the latent space and model inlier and outlier patterns separately. CLSVAE is effective with much less labelled data compared to previous related models, often with less than 2% of the data. We provide experiments using three image datasets in scenarios with different levels of corruption and labelled set sizes, comparing to relevant baselines. CLSVAE provides superior repairs without human intervention, e.g. with just 0.25% of labelled data we see a relative error decrease of 58% compared to the closest baseline.

preprint2022arXiv

RIM-Net: Recursive Implicit Fields for Unsupervised Learning of Hierarchical Shape Structures

We introduce RIM-Net, a neural network which learns recursive implicit fields for unsupervised inference of hierarchical shape structures. Our network recursively decomposes an input 3D shape into two parts, resulting in a binary tree hierarchy. Each level of the tree corresponds to an assembly of shape parts, represented as implicit functions, to reconstruct the input shape. At each node of the tree, simultaneous feature decoding and shape decomposition are carried out by their respective feature and part decoders, with weight sharing across the same hierarchy level. As an implicit field decoder, the part decoder is designed to decompose a sub-shape, via a two-way branched reconstruction, where each branch predicts a set of parameters defining a Gaussian to serve as a local point distribution for shape reconstruction. With reconstruction losses accounted for at each hierarchy level and a decomposition loss at each node, our network training does not require any ground-truth segmentations, let alone hierarchies. Through extensive experiments and comparisons to state-of-the-art alternatives, we demonstrate the quality, consistency, and interpretability of hierarchical structural inference by RIM-Net.

preprint2022arXiv

Scalable Spike Source Localization in Extracellular Recordings using Amortized Variational Inference

Determining the positions of neurons in an extracellular recording is useful for investigating functional properties of the underlying neural circuitry. In this work, we present a Bayesian modelling approach for localizing the source of individual spikes on high-density, microelectrode arrays. To allow for scalable inference, we implement our model as a variational autoencoder and perform amortized variational inference. We evaluate our method on both biophysically realistic simulated and real extracellular datasets, demonstrating that it is more accurate than and can improve spike sorting performance over heuristic localization methods such as center of mass.

preprint2022arXiv

Semi-Supervised Co-Analysis of 3D Shape Styles from Projected Lines

We present a semi-supervised co-analysis method for learning 3D shape styles from projected feature lines, achieving style patch localization with only weak supervision. Given a collection of 3D shapes spanning multiple object categories and styles, we perform style co-analysis over projected feature lines of each 3D shape and then backproject the learned style features onto the 3D shapes. Our core analysis pipeline starts with mid-level patch sampling and pre-selection of candidate style patches. Projective features are then encoded via patch convolution. Multi-view feature integration and style clustering are carried out under the framework of partially shared latent factor (PSLF) learning, a multi-view feature learning scheme. PSLF achieves effective multi-view feature fusion by distilling and exploiting consistent and complementary feature information from multiple views, while also selecting style patches from the candidates. Our style analysis approach supports both unsupervised and semi-supervised analysis. For the latter, our method accepts both user-specified shape labels and style-ranked triplets as clustering constraints.We demonstrate results from 3D shape style analysis and patch localization as well as improvements over state-of-the-art methods. We also present several applications enabled by our style analysis.

preprint2022arXiv

The complex mKdV equation with step-like initial data: Large time asymptotic analysis

In this paper, we study large-time asymptotics for the complex modified Korteveg-de Vries equation \begin{equation} u_t + \frac{1}{2}u_{xxx}+3|u|^2 u_x=0, \end{equation} with the step-like initial data \begin{equation} u(x,0)=u_0(x)= \begin{cases} 0, & {x \ge 0,}\\ A e^{iBx}, &{x < 0.} \end{cases} \end{equation} It is shown that the step-like initial problem can be described by a matrix Riemann-Hilbert problem. We apply the steepest descent method to obtain different large-time asymptotics in the the Zakharov-Manakov region, a plane wave region and a slow decay region.

preprint2022arXiv

Towards Target-Driven Visual Navigation in Indoor Scenes via Generative Imitation Learning

We present a target-driven navigation system to improve mapless visual navigation in indoor scenes. Our method takes a multi-view observation of a robot and a target as inputs at each time step to provide a sequence of actions that move the robot to the target without relying on odometry or GPS at runtime. The system is learned by optimizing a combinational objective encompassing three key designs. First, we propose that an agent conceives the next observation before making an action decision. This is achieved by learning a variational generative module from expert demonstrations. We then propose predicting static collision in advance, as an auxiliary task to improve safety during navigation. Moreover, to alleviate the training data imbalance problem of termination action prediction, we also introduce a target checking module to differentiate from augmenting navigation policy with a termination action. The three proposed designs all contribute to the improved training data efficiency, static collision avoidance, and navigation generalization performance, resulting in a novel target-driven mapless navigation system. Through experiments on a TurtleBot, we provide evidence that our model can be integrated into a robotic system and navigate in the real world. Videos and models can be found in the supplementary material.

preprint2022arXiv

Visualization for Epidemiological Modelling: Challenges, Solutions, Reflections & Recommendations

We report on an ongoing collaboration between epidemiological modellers and visualization researchers by documenting and reflecting upon knowledge constructs -- a series of ideas, approaches and methods taken from existing visualization research and practice -- deployed and developed to support modelling of the COVID-19 pandemic. Structured independent commentary on these efforts is synthesized through iterative reflection to develop: evidence of the effectiveness and value of visualization in this context; open problems upon which the research communities may focus; guidance for future activity of this type; and recommendations to safeguard the achievements and promote, advance, secure and prepare for future collaborations of this kind. In describing and comparing a series of related projects that were undertaken in unprecedented conditions, our hope is that this unique report, and its rich interactive supplementary materials, will guide the scientific community in embracing visualization in its observation, analysis and modelling of data as well as in disseminating findings. Equally we hope to encourage the visualization community to engage with impactful science in addressing its emerging data challenges. If we are successful, this showcase of activity may stimulate mutually beneficial engagement between communities with complementary expertise to address problems of significance in epidemiology and beyond. https://ramp-vis.github.io/RAMPVIS-PhilTransA-Supplement/

preprint2021arXiv

A geometry-based relaxation algorithm for equilibrating a trivalent polygonal network in two dimensions and its implications

The equilibration of a trivalent polygonal network in two dimensions (2D) is a universal phenomenon in nature, but the underlying mathematical mechanism remains unclear. In this study, a relaxation algorithm based on a simple geometrical rule was developed to simulate the equilibration. The proposed algorithm was implemented in Python language. The simulated relaxation changed the polygonal cell of the Voronoi network from an ellipse&#39;s inscribed polygon toward the ellipse&#39;s maximal inscribed polygon. Meanwhile, the Aboav-Weaire&#39;s law, which describes the neighboring relationship between cells, still holds statistically. The succeed of simulation strongly supports the ellipse packing hypothesis that was proposed to explain the dynamic behaviors of a trivalent 2D structure. The simulation results also showed that the edge of large cells tends to be shorter than edges of small cells, and vice versa. In addition, the relaxation increases the area and edge length of large cells, and it decreases the area and edge length of small cells. The pattern of changes in the area of different-edged cells due to relaxation is almost the same as the growth pattern described by the von-Neumann-Mullins law. The results presented in this work can help to understand the mathematical mechanisms of the dynamic behaviors of trivalent 2D structures.

preprint2021arXiv

A hydrodynamic study of hyperon spin polarization in relativistic heavy ion collisions

We perform a systematic study of the spin polarization of hyperons in heavy-ion collisions using the MUSIC hydrodynamic model with A Multi-Phase Transport (AMPT) pre-equilibrium dynamics. Our model calculations nicely describe the measured collision-energy, centrality, rapidity, and $p_T$ dependence of $Λ$ polarization. We also study and predict the global spin polarization of $Ξ^-$ and $Ω^-$ as a function of collision energy, which provides a baseline for the studies of the magnetic moment, spin, and mass dependence of the spin polarization. For the local spin polarization, we calculate the radial and azimuthal components of the transverse $Λ$ polarization and find specific modulating behavior which could reflect the circular vortical structure. However, our model fails to describe the azimuthal-angle dependence of the longitudinal and transverse $Λ$ polarization, which indicates that the hydrodynamic framework with the spin Cooper-Frye formula under the assumption of thermal equilibrium of spin degree of freedom needs to be improved.

preprint2021arXiv

Higher-Rank Tensor Non-Abelian Field Theory: Higher-Moment or Subdimensional Polynomial Global Symmetry, Algebraic Variety, Noether&#39;s Theorem, and Gauging

With a view toward a fracton theory in condensed matter, we introduce a higher-moment polynomial degree-p global symmetry, acting on complex scalar/vector/tensor fields (e.g., ordinary or vector global symmetry for p$=0$ and p$=1$ respectively). We relate this higher-moment global symmetry of $n$-dimensional space, to a lower degree (either ordinary or higher-moment, e.g., degree-(p-$\ell$)) subdimensional or subsystem global symmetry on layers of $(n-\ell)$-submanifolds. These submanifolds are algebraic affine varieties (i.e., solutions of polynomials). The structure of layers of submanifolds as subvarieties can be studied via mathematical tools of embedding, foliation, and algebraic geometry. We also generalize Noether&#39;s theorem for this higher-moment polynomial global symmetry. We can promote the higher-moment global symmetry to a local symmetry, and derive a new family of higher-rank-m symmetric tensor gauge theory by gauging, with m = p$+1$. By further gauging a discrete $\mathbb{Z}_2^C$ charge conjugation (particle-hole) symmetry, we derive a new general class of rank-m tensor non-abelian gauge field theory (the gauge structure is non-commutative thus non-abelian but not an ordinary group): a hybrid class of (symmetric or non-symmetric) higher-rank-m tensor gauge theory and anti-symmetric tensor topological field theory, generalizing [arXiv:1909.13879], interplaying between gapless and gapped sectors.

preprint2021arXiv

Metrological characterisation of non-Gaussian entangled states of superconducting qubits

Multipartite entangled states are significant resources for both quantum information processing and quantum metrology. In particular, non-Gaussian entangled states are predicted to achieve a higher sensitivity of precision measurements than Gaussian states. On the basis of metrological sensitivity, the conventional linear Ramsey squeezing parameter (RSP) efficiently characterises the Gaussian entangled atomic states but fails for much wider classes of highly sensitive non-Gaussian states. These complex non-Gaussian entangled states can be classified by the nonlinear squeezing parameter (NLSP), as a generalisation of the RSP with respect to nonlinear observables, and identified via the Fisher information. However, the NLSP has never been measured experimentally. Using a 19-qubit programmable superconducting processor, here we report the characterisation of multiparticle entangled states generated during its nonlinear dynamics. First, selecting 10 qubits, we measure the RSP and the NLSP by single-shot readouts of collective spin operators in several different directions. Then, by extracting the Fisher information of the time-evolved state of all 19 qubits, we observe a large metrological gain of 9.89$^{+0.28}_{-0.29}$ dB over the standard quantum limit, indicating a high level of multiparticle entanglement for quantum-enhanced phase sensitivity. Benefiting from high-fidelity full controls and addressable single-shot readouts, the superconducting processor with interconnected qubits provides an ideal platform for engineering and benchmarking non-Gaussian entangled states that are useful for quantum-enhanced metrology.

preprint2021arXiv

Quantum speedup dynamics process in Schwarzschild space-time

Quantum speed limit time (QSLT) can be used to characterize the intrinsic minimal time interval for a quantum system evolving from an initial state to a target state. We investigate the QSLT of the open system in Schwarzschild space-time. We show that, in some typical noisy channels,the Hawking effect can be beneficial to the evolution of the system. For an initial entangled state, the evolution speed of the system can be enhanced in the depolarizing, bit flip, and bit-phase flip channels as the Hawking temperature increases, which are in sharp contrast to the phase flip channel. Moreover, the optimal initial entanglement exists in other noise channels except the phase flip channel, which minimizes the QSLT of the system and thus leads to the maximum evolution speed of the system.

preprint2020arXiv

Basketball Player&#39;s Value Evaluation by a Networks-based Variant Parameter Hidden Markov Model

Determining the value of basketball players through analyzing the players&#39; behavior is important for the managers of modern basketball teams. However, conventional methods always utilize isolated statistical data, leading to ineffective and inaccurate evaluations. Existing models based on dynamic network theory offer major improvements to the results of such evaluations, but said models remain imprecise because they focus merely on evaluating the values of individual players rather than considering them within their current teams. To solve this problem, we propose an analysis and evaluation model based on networks and a hidden Markov model. To the best of our knowledge, we are the first to combine a network form representing the players who are playing with the use of a hidden Markov model to mine the network and generate the desired results. Applying our approach to SportVU data collected from the National Basketball Association shows that this analysis and evaluation model can effectively analyze the performance of each player in a game and provides an assistive tool for team managers.

preprint2020arXiv

Deep Differentiable Grasp Planner for High-DOF Grippers

We present an end-to-end algorithm for training deep neural networks to grasp novel objects. Our algorithm builds all the essential components of a grasping system using a forward-backward automatic differentiation approach, including the forward kinematics of the gripper, the collision between the gripper and the target object, and the metric for grasp poses. In particular, we show that a generalized Q1 grasp metric is defined and differentiable for inexact grasps generated by a neural network, and the derivatives of our generalized Q1 metric can be computed from a sensitivity analysis of the induced optimization problem. We show that the derivatives of the (self-)collision terms can be efficiently computed from a watertight triangle mesh of low-quality. Altogether, our algorithm allows for the computation of grasp poses for high-DOF grippers in an unsupervised mode with no ground truth data, or it improves the results in a supervised mode using a small dataset. Our new learning algorithm significantly simplifies the data preparation for learning-based grasping systems and leads to higher qualities of learned grasps on common 3D shape datasets [7, 49, 26, 25], achieving a 22% higher success rate on physical hardware and a 0.12 higher value on the Q1 grasp quality metric.

preprint2020arXiv

DynamicPPL: Stan-like Speed for Dynamic Probabilistic Models

We present the preliminary high-level design and features of DynamicPPL.jl, a modular library providing a lightning-fast infrastructure for probabilistic programming. Besides a computational performance that is often close to or better than Stan, DynamicPPL provides an intuitive DSL that allows the rapid development of complex dynamic probabilistic programs. Being entirely written in Julia, a high-level dynamic programming language for numerical computing, DynamicPPL inherits a rich set of features available through the Julia ecosystem. Since DynamicPPL is a modular, stand-alone library, any probabilistic programming system written in Julia, such as Turing.jl, can use DynamicPPL to specify models and trace their model parameters. The main features of DynamicPPL are: 1) a meta-programming based DSL for specifying dynamic models using an intuitive tilde-based notation; 2) a tracing data-structure for tracking RVs in dynamic probabilistic models; 3) a rich contextual dispatch system allowing tailored behaviour during model execution; and 4) a user-friendly syntax for probabilistic queries. Finally, we show in a variety of experiments that DynamicPPL, in combination with Turing.jl, achieves computational performance that is often close to or better than Stan.

preprint2020arXiv

Finite-system Multicriticality at the Superradiant Quantum Phase Transition

We demonstrate the existence of finite-system multicriticality in a qubit-boson model where biased qubits collectively coupled to a single-mode bosonic field. The interplay between biases and boson-qubit coupling produces a rich phase diagram showing multiple superradiant phases and phase boundaries of different orders. In particular, multiple phases can become indistinguishable in appropriate bias configurations, which is the signature of multicriticality. A series of universality classes characterizing these multicritical points are identified. Moreover, we present a trapped-ion realization with the potential to explore the multicritical phenomena experimentally using a small number of ions. The results open a novel way to probe multicritical universality classes in experiments.

preprint2020arXiv

Generating Grasp Poses for a High-DOF Gripper Using Neural Networks

We present a learning-based method for representing grasp poses of a high-DOF hand using neural networks. Due to redundancy in such high-DOF grippers, there exists a large number of equally effective grasp poses for a given target object, making it difficult for the neural network to find consistent grasp poses. We resolve this ambiguity by generating an augmented dataset that covers many possible grasps for each target object and train our neural networks using a consistency loss function to identify a one-to-one mapping from objects to grasp poses. We further enhance the quality of neural-network-predicted grasp poses using a collision loss function to avoid penetrations. We use an object dataset that combines the BigBIRD Database, the KIT Database, the YCB Database, and the Grasp Dataset to show that our method can generate high-DOF grasp poses with higher accuracy than supervised learning baselines. The quality of the grasp poses is on par with the groundtruth poses in the dataset. In addition, our method is robust and can handle noisy object models such as those constructed from multi-view depth images, allowing our method to be implemented on a 25-DOF Shadow Hand hardware platform.

preprint2020arXiv

Generative Ratio Matching Networks

Deep generative models can learn to generate realistic-looking images, but many of the most effective methods are adversarial and involve a saddlepoint optimization, which requires a careful balancing of training between a generator network and a critic network. Maximum mean discrepancy networks (MMD-nets) avoid this issue by using kernel as a fixed adversary, but unfortunately, they have not on their own been able to match the generative quality of adversarial training. In this work, we take their insight of using kernels as fixed adversaries further and present a novel method for training deep generative models that does not involve saddlepoint optimization. We call our method generative ratio matching or GRAM for short. In GRAM, the generator and the critic networks do not play a zero-sum game against each other, instead, they do so against a fixed kernel. Thus GRAM networks are not only stable to train like MMD-nets but they also match and beat the generative quality of adversarially trained generative networks.

preprint2020arXiv

Learning in the Frequency Domain

Deep neural networks have achieved remarkable success in computer vision tasks. Existing neural networks mainly operate in the spatial domain with fixed input sizes. For practical applications, images are usually large and have to be downsampled to the predetermined input size of neural networks. Even though the downsampling operations reduce computation and the required communication bandwidth, it removes both redundant and salient information obliviously, which results in accuracy degradation. Inspired by digital signal processing theories, we analyze the spectral bias from the frequency perspective and propose a learning-based frequency selection method to identify the trivial frequency components which can be removed without accuracy loss. The proposed method of learning in the frequency domain leverages identical structures of the well-known neural networks, such as ResNet-50, MobileNetV2, and Mask R-CNN, while accepting the frequency-domain information as the input. Experiment results show that learning in the frequency domain with static channel selection can achieve higher accuracy than the conventional spatial downsampling approach and meanwhile further reduce the input data size. Specifically for ImageNet classification with the same input size, the proposed method achieves 1.41% and 0.66% top-1 accuracy improvements on ResNet-50 and MobileNetV2, respectively. Even with half input size, the proposed method still improves the top-1 accuracy on ResNet-50 by 1%. In addition, we observe a 0.8% average precision improvement on Mask R-CNN for instance segmentation on the COCO dataset.

preprint2020arXiv

Learning Part Generation and Assembly for Structure-aware Shape Synthesis

Learning powerful deep generative models for 3D shape synthesis is largely hindered by the difficulty in ensuring plausibility encompassing correct topology and reasonable geometry. Indeed, learning the distribution of plausible 3D shapes seems a daunting task for the holistic approaches, given the significant topological variations of 3D objects even within the same category. Enlightened by the fact that 3D shape structure is characterized as part composition and placement, we propose to model 3D shape variations with a part-aware deep generative network, coined as PAGENet. The network is composed of an array of per-part VAE-GANs, generating semantic parts composing a complete shape, followed by a part assembly module that estimates a transformation for each part to correlate and assemble them into a plausible structure. Through delegating the learning of part composition and part placement into separate networks, the difficulty of modeling structural variations of 3D shapes is greatly reduced. We demonstrate through both qualitative and quantitative evaluations that PAGENet generates 3D shapes with plausible, diverse and detailed structure, and show two applications, i.e., semantic shape segmentation and part-based shape editing.

preprint2020arXiv

MLCVNet: Multi-Level Context VoteNet for 3D Object Detection

In this paper, we address the 3D object detection task by capturing multi-level contextual information with the self-attention mechanism and multi-scale feature fusion. Most existing 3D object detection methods recognize objects individually, without giving any consideration on contextual information between these objects. Comparatively, we propose Multi-Level Context VoteNet (MLCVNet) to recognize 3D objects correlatively, building on the state-of-the-art VoteNet. We introduce three context modules into the voting and classifying stages of VoteNet to encode contextual information at different levels. Specifically, a Patch-to-Patch Context (PPC) module is employed to capture contextual information between the point patches, before voting for their corresponding object centroid points. Subsequently, an Object-to-Object Context (OOC) module is incorporated before the proposal and classification stage, to capture the contextual information between object candidates. Finally, a Global Scene Context (GSC) module is designed to learn the global scene context. We demonstrate these by capturing contextual information at patch, object and scene levels. Our method is an effective way to promote detection accuracy, achieving new state-of-the-art detection performance on challenging 3D object detection datasets, i.e., SUN RGBD and ScanNet. We also release our code at https://github.com/NUAAXQ/MLCVNet.

preprint2020arXiv

Pentaquark components in low-lying baryon resonances

We study pentaquark states of both light $q^4\bar q$ and hidden heavy $q^3 Q\bar Q$ (q = u,d,s quark in SU(3) flavor symmetry; Q = c, b quark) systems with a general group theory approach in the constituent quark model, and the spectrum of light baryon resonances in the ansatz that the $l=1$ baryon states may consist of the $q^3$ as well as $q^4\bar q$ pentaquark component. The model is fitted to ground state baryons and light baryon resonances which are believed to be normal three-quark states. The work reveals that the $N(1535)1/2^{-}$ and $N(1520)3/2^-$ may consist of a large $q^4\bar q$ component while the $N(1895)1/2^{-}$ and $N(1875)3/2^-$ are respectively their partners, and the $N^+(1685)$ might be a $q^4\bar q$ state. By the way, a new set of color-spin-flavor-spatial wave function for $q^3 Q\bar Q$ systems in the compact pentaquark picture are constructed systematically for studying hidden charm pentaquark states.

preprint2020arXiv

PQ-NET: A Generative Part Seq2Seq Network for 3D Shapes

We introduce PQ-NET, a deep neural network which represents and generates 3D shapes via sequential part assembly. The input to our network is a 3D shape segmented into parts, where each part is first encoded into a feature representation using a part autoencoder. The core component of PQ-NET is a sequence-to-sequence or Seq2Seq autoencoder which encodes a sequence of part features into a latent vector of fixed size, and the decoder reconstructs the 3D shape, one part at a time, resulting in a sequential assembly. The latent space formed by the Seq2Seq encoder encodes both part structure and fine part geometry. The decoder can be adapted to perform several generative tasks including shape autoencoding, interpolation, novel shape generation, and single-view 3D reconstruction, where the generated shapes are all composed of meaningful parts.

preprint2020arXiv

SymmetryNet: Learning to Predict Reflectional and Rotational Symmetries of 3D Shapes from Single-View RGB-D Images

We study the problem of symmetry detection of 3D shapes from single-view RGB-D images, where severely missing data renders geometric detection approach infeasible. We propose an end-to-end deep neural network which is able to predict both reflectional and rotational symmetries of 3D objects present in the input RGB-D image. Directly training a deep model for symmetry prediction, however, can quickly run into the issue of overfitting. We adopt a multi-task learning approach. Aside from symmetry axis prediction, our network is also trained to predict symmetry correspondences. In particular, given the 3D points present in the RGB-D image, our network outputs for each 3D point its symmetric counterpart corresponding to a specific predicted symmetry. In addition, our network is able to detect for a given shape multiple symmetries of different types. We also contribute a benchmark of 3D symmetry detection based on single-view RGB-D images. Extensive evaluation on the benchmark demonstrates the strong generalization ability of our method, in terms of high accuracy of both symmetry axis prediction and counterpart estimation. In particular, our method is robust in handling unseen object instances with large variation in shape, multi-symmetry composition, as well as novel object categories.

preprint2019arXiv

Multiple instance dense connected convolution neural network for aerial image scene classification

With the development of deep learning, many state-of-the-art natural image scene classification methods have demonstrated impressive performance. While the current convolution neural network tends to extract global features and global semantic information in a scene, the geo-spatial objects can be located at anywhere in an aerial image scene and their spatial arrangement tends to be more complicated. One possible solution is to preserve more local semantic information and enhance feature propagation. In this paper, an end to end multiple instance dense connected convolution neural network (MIDCCNN) is proposed for aerial image scene classification. First, a 23 layer dense connected convolution neural network (DCCNN) is built and served as a backbone to extract convolution features. It is capable of preserving middle and low level convolution features. Then, an attention based multiple instance pooling is proposed to highlight the local semantics in an aerial image scene. Finally, we minimize the loss between the bag-level predictions and the ground truth labels so that the whole framework can be trained directly. Experiments on three aerial image datasets demonstrate that our proposed methods can outperform current baselines by a large margin.

preprint2019arXiv

Probing the dynamical phase transition with a superconducting quantum simulator

Non-equilibrium quantum many-body systems, which are difficult to study via classical computation, have attracted wide interest. Quantum simulation can provide insights into these problems. Here, using a programmable quantum simulator with 16 all-to-all connected superconducting qubits, we investigate the dynamical phase transition in the Lipkin-Meshkov-Glick model with a quenched transverse field. Clear signatures of the dynamical phase transition, merging different concepts of dynamical criticality, are observed by measuring the non-equilibrium order parameter, nonlocal correlations, and the Loschmidt echo. Moreover, near the dynamical critical point, we obtain the optimal spin squeezing of $-7.0\pm 0.8$ decibels, showing multipartite entanglement useful for measurements with precision five-fold beyond the standard quantum limit. Based on the capability of entangling qubits simultaneously and the accurate single-shot readout of multi-qubit states, this superconducting quantum simulator can be used to study other problems in non-equilibrium quantum many-body systems.

preprint2018arXiv

Dynamical speedup of a two-level system induced by coupling in the hierarchical environment

We investigate the dynamics of a two-level system in the presence of an overall environment composed of two layers. The first layer is just one single-mode cavity which decays to memoryless reservoir while the second layer is the two coupled single-mode cavities which decay to memoryless or memory-keeping reservoirs. In the weak-coupling regime between the qubit and the first-layer environment, our attention is focused on the effects of the coupling in the hierarchical environment on the non-Markovian speedup dynamics behavior of the system. We show that, by controlling the coupling in the second-layer environment, the multiple dynamics crossovers from Markovian to non-Markovian and from no-speedup to speedup can be realized. This results hold independently on the nature of the second-layer environment. Differently, we find that how the coupling between the two layers affects the non-Markovian speedup dynamics behavior depends on the nature of the second-layer environment.