Source author record

Jun Xiao

Jun Xiao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Computation and Language cond-mat.mtrl-sci Machine Learning Artificial Intelligence Multimedia cond-mat.mes-hall Information Retrieval physics.atom-ph Distributed, Parallel, and Cluster Computing eess.IV Multiagent Systems nucl-th physics.atm-clus physics.chem-ph physics.optics q-fin.PM

Catalog footprint

What is connected

35works

17topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Milestone-Guided Policy Learning for Long-Horizon Language Agents

While long-horizon agentic tasks require language agents to perform dozens of sequential decisions, training such agents with reinforcement learning remains challenging. We identify two root causes: credit misattribution, where correct early actions are penalized due to terminal failures, and sample inefficiency, where scarce successful trajectories result in near-total loss of learning signal. We introduce a milestone-guided policy learning framework, BEACON, that leverages the compositional structure of long-horizon tasks to ensure precise credit assignment. BEACON partitions trajectories at milestone boundaries, applies temporal reward shaping within segments to credit partial progress, and estimates advantages at dual scales to prevent distant failures from corrupting the evaluation of local actions. On ALFWorld, WebShop, and ScienceWorld, BEACON consistently outperforms GRPO and GiGPO. Notably, on long-horizon ALFWorld tasks, BEACON achieves 92.9% success rate, nearly doubling GRPO's 53.5%, while improving effective sample utilization from 23.7% to 82.0%. These results establish milestone-anchored credit assignment as an effective paradigm for training long-horizon language agents. Code is available at https://github.com/ZJU-REAL/BEACON.

preprint2023arXiv

VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching

The prevailing framework for matching multimodal inputs is based on a two-stage process: 1) detecting proposals with an object detector and 2) matching text queries with proposals. Existing two-stage solutions mostly focus on the matching step. In this paper, we argue that these methods overlook an obvious \emph{mismatch} between the roles of proposals in the two stages: they generate proposals solely based on the detection confidence (i.e., query-agnostic), hoping that the proposals contain all instances mentioned in the text query (i.e., query-aware). Due to this mismatch, chances are that proposals relevant to the text query are suppressed during the filtering process, which in turn bounds the matching performance. To this end, we propose VL-NMS, which is the first method to yield query-aware proposals at the first stage. VL-NMS regards all mentioned instances as critical objects, and introduces a lightweight module to predict a score for aligning each proposal with a critical object. These scores can guide the NMS operation to filter out proposals irrelevant to the text query, increasing the recall of critical objects, resulting in a significantly improved matching performance. Since VL-NMS is agnostic to the matching step, it can be easily integrated into any state-of-the-art two-stage matching methods. We validate the effectiveness of VL-NMS on two multimodal matching tasks, namely referring expression grounding and image-text matching. Extensive ablation studies on several baselines and benchmarks consistently demonstrate the superiority of VL-NMS.

preprint2022arXiv

A Knowledge-Enhanced Adversarial Model for Cross-lingual Structured Sentiment Analysis

Structured sentiment analysis, which aims to extract the complex semantic structures such as holders, expressions, targets, and polarities, has obtained widespread attention from both industry and academia. Unfortunately, the existing structured sentiment analysis datasets refer to a few languages and are relatively small, limiting neural network models' performance. In this paper, we focus on the cross-lingual structured sentiment analysis task, which aims to transfer the knowledge from the source language to the target one. Notably, we propose a Knowledge-Enhanced Adversarial Model (\texttt{KEAM}) with both implicit distributed and explicit structural knowledge to enhance the cross-lingual transfer. First, we design an adversarial embedding adapter for learning an informative and robust representation by capturing implicit semantic information from diverse multi-lingual embeddings adaptively. Then, we propose a syntax GCN encoder to transfer the explicit semantic information (e.g., universal dependency tree) among multiple languages. We conduct experiments on five datasets and compare \texttt{KEAM} with both the supervised and unsupervised methods. The extensive experimental results show that our \texttt{KEAM} model outperforms all the unsupervised baselines in various metrics.

preprint2022arXiv

Accurate Lung Nodules Segmentation with Detailed Representation Transfer and Soft Mask Supervision

Accurate lung lesion segmentation from Computed Tomography (CT) images is crucial to the analysis and diagnosis of lung diseases such as COVID-19 and lung cancer. However, the smallness and variety of lung nodules and the lack of high-quality labeling make the accurate lung nodule segmentation difficult. To address these issues, we first introduce a novel segmentation mask named Soft Mask which has richer and more accurate edge details description and better visualization and develop a universal automatic Soft Mask annotation pipeline to deal with different datasets correspondingly. Then, a novel Network with detailed representation transfer and Soft Mask supervision (DSNet) is proposed to process the input low-resolution images of lung nodules into high-quality segmentation results. Our DSNet contains a special Detail Representation Transfer Module (DRTM) for reconstructing the detailed representation to alleviate the small size of lung nodules images, and an adversarial training framework with Soft Mask for further improving the accuracy of segmentation. Extensive experiments validate that our DSNet outperforms other state-of-the-art methods for accurate lung nodule segmentation and has strong generalization ability in other accurate medical segmentation tasks with competitive results. Besides, we provide a new challenging lung nodules segmentation dataset for further studies.

preprint2022arXiv

ACDNet: Adaptively Combined Dilated Convolution for Monocular Panorama Depth Estimation

Depth estimation is a crucial step for 3D reconstruction with panorama images in recent years. Panorama images maintain the complete spatial information but introduce distortion with equirectangular projection. In this paper, we propose an ACDNet based on the adaptively combined dilated convolution to predict the dense depth map for a monocular panoramic image. Specifically, we combine the convolution kernels with different dilations to extend the receptive field in the equirectangular projection. Meanwhile, we introduce an adaptive channel-wise fusion module to summarize the feature maps and get diverse attention areas in the receptive field along the channels. Due to the utilization of channel-wise attention in constructing the adaptive channel-wise fusion module, the network can capture and leverage the cross-channel contextual information efficiently. Finally, we conduct depth estimation experiments on three datasets (both virtual and real-world) and the experimental results demonstrate that our proposed ACDNet substantially outperforms the current state-of-the-art (SOTA) methods. Our codes and model parameters are accessed in https://github.com/zcq15/ACDNet.

preprint2022arXiv

Active Learning for Point Cloud Semantic Segmentation via Spatial-Structural Diversity Reasoning

The expensive annotation cost is notoriously known as the main constraint for the development of the point cloud semantic segmentation technique. Active learning methods endeavor to reduce such cost by selecting and labeling only a subset of the point clouds, yet previous attempts ignore the spatial-structural diversity of the selected samples, inducing the model to select clustered candidates with similar shapes in a local area while missing other representative ones in the global environment. In this paper, we propose a new 3D region-based active learning method to tackle this problem. Dubbed SSDR-AL, our method groups the original point clouds into superpoints and incrementally selects the most informative and representative ones for label acquisition. We achieve the selection mechanism via a graph reasoning network that considers both the spatial and structural diversities of superpoints. To deploy SSDR-AL in a more practical scenario, we design a noise-aware iterative labeling strategy to confront the "noisy annotation" problem introduced by the previous "dominant labeling" strategy in superpoints. Extensive experiments on two point cloud benchmarks demonstrate the effectiveness of SSDR-AL in the semantic segmentation task. Particularly, SSDR-AL significantly outperforms the baseline method and reduces the annotation cost by up to 63.0% and 24.0% when achieving 90% performance of fully supervised learning, respectively.

preprint2022arXiv

Bidirectional Self-Training with Multiple Anisotropic Prototypes for Domain Adaptive Semantic Segmentation

A thriving trend for domain adaptive segmentation endeavors to generate the high-quality pseudo labels for target domain and retrain the segmentor on them. Under this self-training paradigm, some competitive methods have sought to the latent-space information, which establishes the feature centroids (a.k.a prototypes) of the semantic classes and determines the pseudo label candidates by their distances from these centroids. In this paper, we argue that the latent space contains more information to be exploited thus taking one step further to capitalize on it. Firstly, instead of merely using the source-domain prototypes to determine the target pseudo labels as most of the traditional methods do, we bidirectionally produce the target-domain prototypes to degrade those source features which might be too hard or disturbed for the adaptation. Secondly, existing attempts simply model each category as a single and isotropic prototype while ignoring the variance of the feature distribution, which could lead to the confusion of similar categories. To cope with this issue, we propose to represent each category with multiple and anisotropic prototypes via Gaussian Mixture Model, in order to fit the de facto distribution of source domain and estimate the likelihood of target samples based on the probability density. We apply our method on GTA5->Cityscapes and Synthia->Cityscapes tasks and achieve 61.2 and 62.8 respectively in terms of mean IoU, substantially outperforming other competitive self-training methods. Noticeably, in some categories which severely suffer from the categorical confusion such as "truck" and "bus", our method achieves 56.4 and 68.8 respectively, which further demonstrates the effectiveness of our design.

preprint2022arXiv

Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs

Today's VidSGG models are all proposal-based methods, i.e., they first generate numerous paired subject-object snippets as proposals, and then conduct predicate classification for each proposal. In this paper, we argue that this prevalent proposal-based framework has three inherent drawbacks: 1) The ground-truth predicate labels for proposals are partially correct. 2) They break the high-order relations among different predicate instances of a same subject-object pair. 3) VidSGG performance is upper-bounded by the quality of the proposals. To this end, we propose a new classification-then-grounding framework for VidSGG, which can avoid all the three overlooked drawbacks. Meanwhile, under this framework, we reformulate the video scene graphs as temporal bipartite graphs, where the entities and predicates are two types of nodes with time slots, and the edges denote different semantic roles between these nodes. This formulation takes full advantage of our new framework. Accordingly, we further propose a novel BIpartite Graph based SGG model: BIG. It consists of a classification stage and a grounding stage, where the former aims to classify the categories of all the nodes and the edges, and the latter tries to localize the temporal location of each relation instance. Extensive ablations on two VidSGG datasets have attested to the effectiveness of our framework and BIG. Code is available at https://github.com/Dawn-LX/VidSGG-BIG.

preprint2022arXiv

Consensus Graph Representation Learning for Better Grounded Image Captioning

The contemporary visual captioning models frequently hallucinate objects that are not actually in a scene, due to the visual misclassification or over-reliance on priors that resulting in the semantic inconsistency between the visual information and the target lexical words. The most common way is to encourage the captioning model to dynamically link generated object words or phrases to appropriate regions of the image, i.e., the grounded image captioning (GIC). However, GIC utilizes an auxiliary task (grounding objects) that has not solved the key issue of object hallucination, i.e., the semantic inconsistency. In this paper, we take a novel perspective on the issue above - exploiting the semantic coherency between the visual and language modalities. Specifically, we propose the Consensus Rraph Representation Learning framework (CGRL) for GIC that incorporates a consensus representation into the grounded captioning pipeline. The consensus is learned by aligning the visual graph (e.g., scene graph) to the language graph that consider both the nodes and edges in a graph. With the aligned consensus, the captioning model can capture both the correct linguistic characteristics and visual relevance, and then grounding appropriate image regions further. We validate the effectiveness of our model, with a significant decline in object hallucination (-9% CHAIRi) on the Flickr30k Entities dataset. Besides, our CGRL also evaluated by several automatic metrics and human evaluation, the results indicate that the proposed approach can simultaneously improve the performance of image captioning (+2.9 Cider) and grounding (+2.3 F1LOC).

preprint2022arXiv

DS-MVSNet: Unsupervised Multi-view Stereo via Depth Synthesis

In recent years, supervised or unsupervised learning-based MVS methods achieved excellent performance compared with traditional methods. However, these methods only use the probability volume computed by cost volume regularization to predict reference depths and this manner cannot mine enough information from the probability volume. Furthermore, the unsupervised methods usually try to use two-step or additional inputs for training which make the procedure more complicated. In this paper, we propose the DS-MVSNet, an end-to-end unsupervised MVS structure with the source depths synthesis. To mine the information in probability volume, we creatively synthesize the source depths by splattering the probability volume and depth hypotheses to source views. Meanwhile, we propose the adaptive Gaussian sampling and improved adaptive bins sampling approach that improve the depths hypotheses accuracy. On the other hand, we utilize the source depths to render the reference images and propose depth consistency loss and depth smoothness loss. These can provide additional guidance according to photometric and geometric consistency in different views without additional inputs. Finally, we conduct a series of experiments on the DTU dataset and Tanks & Temples dataset that demonstrate the efficiency and robustness of our DS-MVSNet compared with the state-of-the-art methods.

preprint2022arXiv

Explicit Image Caption Editing

Given an image and a reference caption, the image caption editing task aims to correct the misalignment errors and generate a refined caption. However, all existing caption editing works are implicit models, ie, they directly produce the refined captions without explicit connections to the reference captions. In this paper, we introduce a new task: Explicit Caption Editing (ECE). ECE models explicitly generate a sequence of edit operations, and this edit operation sequence can translate the reference caption into a refined one. Compared to the implicit editing, ECE has multiple advantages: 1) Explainable: it can trace the whole editing path. 2) Editing Efficient: it only needs to modify a few words. 3) Human-like: it resembles the way that humans perform caption editing, and tries to keep original sentence structures. To solve this new task, we propose the first ECE model: TIger. TIger is a non-autoregressive transformer-based model, consisting of three modules: Tagger_del, Tagger_add, and Inserter. Specifically, Tagger_del decides whether each word should be preserved or not, Tagger_add decides where to add new words, and Inserter predicts the specific word for adding. To further facilitate ECE research, we propose two new ECE benchmarks by re-organizing two existing datasets, dubbed COCO-EE and Flickr30K-EE, respectively. Extensive ablations on both two benchmarks have demonstrated the effectiveness of TIger.

preprint2022arXiv

Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation

Recently, increasing efforts have been focused on Weakly Supervised Scene Graph Generation (WSSGG). The mainstream solution for WSSGG typically follows the same pipeline: they first align text entities in the weak image-level supervisions (e.g., unlocalized relation triplets or captions) with image regions, and then train SGG models in a fully-supervised manner with aligned instance-level "pseudo" labels. However, we argue that most existing WSSGG works only focus on object-consistency, which means the grounded regions should have the same object category label as text entities. While they neglect another basic requirement for an ideal alignment: interaction-consistency, which means the grounded region pairs should have the same interactions (i.e., visual relations) as text entity pairs. Hence, in this paper, we propose to enhance a simple grounding module with both object-aware and interaction-aware knowledge to acquire more reliable pseudo labels. To better leverage these two types of knowledge, we regard them as two teachers and fuse their generated targets to guide the training process of our grounding module. Specifically, we design two different strategies to adaptively assign weights to different teachers by assessing their reliability on each training sample. Extensive experiments have demonstrated that our method consistently improves WSSGG performance on various kinds of weak supervision.

preprint2022arXiv

Label Semantic Knowledge Distillation for Unbiased Scene Graph Generation

The Scene Graph Generation (SGG) task aims to detect all the objects and their pairwise visual relationships in a given image. Although SGG has achieved remarkable progress over the last few years, almost all existing SGG models follow the same training paradigm: they treat both object and predicate classification in SGG as a single-label classification problem, and the ground-truths are one-hot target labels. However, this prevalent training paradigm has overlooked two characteristics of current SGG datasets: 1) For positive samples, some specific subject-object instances may have multiple reasonable predicates. 2) For negative samples, there are numerous missing annotations. Regardless of the two characteristics, SGG models are easy to be confused and make wrong predictions. To this end, we propose a novel model-agnostic Label Semantic Knowledge Distillation (LS-KD) for unbiased SGG. Specifically, LS-KD dynamically generates a soft label for each subject-object instance by fusing a predicted Label Semantic Distribution (LSD) with its original one-hot target label. LSD reflects the correlations between this instance and multiple predicate categories. Meanwhile, we propose two different strategies to predict LSD: iterative self-KD and synchronous self-KD. Extensive ablations and results on three SGG tasks have attested to the superiority and generality of our proposed LS-KD, which can consistently achieve decent trade-off performance between different predicate categories.

preprint2022arXiv

Learning Regularized Multi-Scale Feature Flow for High Dynamic Range Imaging

Reconstructing ghosting-free high dynamic range (HDR) images of dynamic scenes from a set of multi-exposure images is a challenging task, especially with large object motion and occlusions, leading to visible artifacts using existing methods. To address this problem, we propose a deep network that tries to learn multi-scale feature flow guided by the regularized loss. It first extracts multi-scale features and then aligns features from non-reference images. After alignment, we use residual channel attention blocks to merge the features from different images. Extensive qualitative and quantitative comparisons show that our approach achieves state-of-the-art performance and produces excellent results where color artifacts and geometric distortions are significantly reduced.

preprint2022arXiv

Online Video Super-Resolution with Convolutional Kernel Bypass Graft

Deep learning-based models have achieved remarkable performance in video super-resolution (VSR) in recent years, but most of these models are less applicable to online video applications. These methods solely consider the distortion quality and ignore crucial requirements for online applications, e.g., low latency and low model complexity. In this paper, we focus on online video transmission, in which VSR algorithms are required to generate high-resolution video sequences frame by frame in real time. To address such challenges, we propose an extremely low-latency VSR algorithm based on a novel kernel knowledge transfer method, named convolutional kernel bypass graft (CKBG). First, we design a lightweight network structure that does not require future frames as inputs and saves extra time costs for caching these frames. Then, our proposed CKBG method enhances this lightweight base model by bypassing the original network with ``kernel grafts'', which are extra convolutional kernels containing the prior knowledge of external pretrained image SR models. In the testing phase, we further accelerate the grafted multi-branch network by converting it into a simple single-path structure. Experiment results show that our proposed method can process online video sequences up to 110 FPS, with very low model complexity and competitive SR performance.

preprint2022arXiv

Rethinking Data Augmentation for Robust Visual Question Answering

Data Augmentation (DA) -- generating extra training samples beyond original training set -- has been widely-used in today's unbiased VQA models to mitigate the language biases. Current mainstream DA strategies are synthetic-based methods, which synthesize new samples by either editing some visual regions/words, or re-generating them from scratch. However, these synthetic samples are always unnatural and error-prone. To avoid this issue, a recent DA work composes new augmented samples by randomly pairing pristine images and other human-written questions. Unfortunately, to guarantee augmented samples have reasonable ground-truth answers, they manually design a set of heuristic rules for several question types, which extremely limits its generalization abilities. To this end, we propose a new Knowledge Distillation based Data Augmentation for VQA, dubbed KDDAug. Specifically, we first relax the requirements of reasonable image-question pairs, which can be easily applied to any question types. Then, we design a knowledge distillation (KD) based answer assignment to generate pseudo answers for all composed image-question pairs, which are robust to both in-domain and out-of-distribution settings. Since KDDAug is a model-agnostic DA strategy, it can be seamlessly incorporated into any VQA architectures. Extensive ablation studies on multiple backbones and benchmarks have demonstrated the effectiveness and generalization abilities of KDDAug.

preprint2022arXiv

Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning

Centralized Training with Decentralized Execution (CTDE) has been a popular paradigm in cooperative Multi-Agent Reinforcement Learning (MARL) settings and is widely used in many real applications. One of the major challenges in the training process is credit assignment, which aims to deduce the contributions of each agent according to the global rewards. Existing credit assignment methods focus on either decomposing the joint value function into individual value functions or measuring the impact of local observations and actions on the global value function. These approaches lack a thorough consideration of the complicated interactions among multiple agents, leading to an unsuitable assignment of credit and subsequently mediocre results on MARL. We propose Shapley Counterfactual Credit Assignment, a novel method for explicit credit assignment which accounts for the coalition of agents. Specifically, Shapley Value and its desired properties are leveraged in deep MARL to credit any combinations of agents, which grants us the capability to estimate the individual credit for each agent. Despite this capability, the main technical difficulty lies in the computational complexity of Shapley Value who grows factorially as the number of agents. We instead utilize an approximation method via Monte Carlo sampling, which reduces the sample complexity while maintaining its effectiveness. We evaluate our method on StarCraft II benchmarks across different scenarios. Our method outperforms existing cooperative MARL algorithms significantly and achieves the state-of-the-art, with especially large margins on tasks with more severe difficulties.

preprint2022arXiv

The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation

Unbiased SGG has achieved significant progress over recent years. However, almost all existing SGG models have overlooked the ground-truth annotation qualities of prevailing SGG datasets, i.e., they always assume: 1) all the manually annotated positive samples are equally correct; 2) all the un-annotated negative samples are absolutely background. In this paper, we argue that both assumptions are inapplicable to SGG: there are numerous "noisy" groundtruth predicate labels that break these two assumptions, and these noisy samples actually harm the training of unbiased SGG models. To this end, we propose a novel model-agnostic NoIsy label CorrEction strategy for SGG: NICE. NICE can not only detect noisy samples but also reassign more high-quality predicate labels to them. After the NICE training, we can obtain a cleaner version of SGG dataset for model training. Specifically, NICE consists of three components: negative Noisy Sample Detection (Neg-NSD), positive NSD (Pos-NSD), and Noisy Sample Correction (NSC). Firstly, in Neg-NSD, we formulate this task as an out-of-distribution detection problem, and assign pseudo labels to all detected noisy negative samples. Then, in Pos-NSD, we use a clustering-based algorithm to divide all positive samples into multiple sets, and treat the samples in the noisiest set as noisy positive samples. Lastly, in NSC, we use a simple but effective weighted KNN to reassign new predicate labels to noisy positive samples. Extensive results on different backbones and tasks have attested to the effectiveness and generalization abilities of each component of NICE.

preprint2022arXiv

Unified Group Fairness on Federated Learning

Federated learning (FL) has emerged as an important machine learning paradigm where a global model is trained based on the private data from distributed clients. However, most of existing FL algorithms cannot guarantee the performance fairness towards different groups because of data distribution shift over groups. In this paper, we formulate the problem of unified group fairness on FL, where the groups can be formed by clients (including existing clients and newly added clients) and sensitive attribute(s). To solve this problem, we first propose a general fair federated framework. Then we construct a unified group fairness risk from the view of federated uncertainty set with theoretical analyses to guarantee unified group fairness on FL. We also develop an efficient federated optimization algorithm named Federated Mirror Descent Ascent with Momentum Acceleration (FMDA-M) with convergence guarantee. We validate the advantages of the FMDA-M algorithm with various kinds of distribution shift settings in experiments, and the results show that FMDA-M algorithm outperforms the existing fair FL algorithms on unified group fairness.

preprint2022arXiv

Unified Normalization for Accelerating and Stabilizing Transformers

Solid results from Transformers have made them prevailing architectures in various natural language and vision tasks. As a default component in Transformers, Layer Normalization (LN) normalizes activations within each token to boost the robustness. However, LN requires on-the-fly statistics calculation in inference as well as division and square root operations, leading to inefficiency on hardware. What is more, replacing LN with other hardware-efficient normalization schemes (e.g., Batch Normalization) results in inferior performance, even collapse in training. We find that this dilemma is caused by abnormal behaviors of activation statistics, including large fluctuations over iterations and extreme outliers across layers. To tackle these issues, we propose Unified Normalization (UN), which can speed up the inference by being fused with other linear operations and achieve comparable performance on par with LN. UN strives to boost performance by calibrating the activation and gradient statistics with a tailored fluctuation smoothing strategy. Meanwhile, an adaptive outlier filtration strategy is applied to avoid collapse in training whose effectiveness is theoretically proved and experimentally verified in this paper. We demonstrate that UN can be an efficient drop-in alternative to LN by conducting extensive experiments on language and vision tasks. Besides, we evaluate the efficiency of our method on GPU. Transformers equipped with UN enjoy about 31% inference speedup and nearly 18% memory reduction. Code will be released at https://github.com/hikvision-research/Unified-Normalization.

preprint2021arXiv

Kinetic Energy Distribution of Fragments for Thermal Neutron-Induced $^{235}$U and $^{239}$Pu Fission Reactions

Focused on the generation and evolution of vast complementary pairs of the primary fission fragments at scission moment, Dinuclear and Statistical Model (DSM) is proposed. (1) It is assumed that the fissile nucleus elongates along a symmetric coaxis until it breaks into two primary fission fragments. (2) Every complementary pair of the primary fission fragments is approximatively described as two ellipsoids with large deformation at scission moment. (3) The kinetic energy in every complementary pair of the primary fragments is mainly provided by Coulomb repulsion, which is explicitly expressed through strict six-dimensional integrals. (4) Only three phenomenological coefficients are obtained to globally describe the quadrupole deformation parameters of arbitrary primary fragments both for $^{235}$U($n_{th}, f$) and $^{239}$Pu($n_{th}, f$) reactions, on the basis of the common characteristics of the measured data, such as mass and charge distributions, kinetic energy distributions. In the framework of DSM, the explicit average total kinetic energy distribution $\overline{TKE}(A)$ and the average kinetic energy distribution $\overline{KE}(A)$ are consistently represented. The theoretical results in this paper agree well with the experimental data. Furthermore, this model is expected as the reliable approach to generally evaluate the corresponding observebles for thermal neutron-induced fission of actinides.

preprint2021arXiv

Probing Multiple Electric Dipole Forbidden Optical Transitions in Highly Charged Nickel Ions

Highly charged ions (HCIs) are promising candidates for the next generation of atomic clocks, owing to their tightly bound electron cloud, which significantly suppresses the common environmental disturbances to the quantum oscillator. Here we propose and pursue an experimental strategy that, while focusing on various HCIs of a single atomic element, keeps the number of candidate clock transitions as large as possible. Following this strategy, we identify four adjacent charge states of nickel HCIs that offer as many as six optical transitions. Experimentally, we demonstrated the essential capability of producing these ions in the low-energy compact Shanghai-Wuhan Electron Beam Ion Trap. We measured the wavelengths of four magnetic-dipole ($M$1) and one electric-quadrupole ($E$2) clock transitions with an accuracy of several ppm with a novel calibration method; two of these lines were observed and characterized for the first time in controlled laboratory settings. Compared to the earlier determinations, our measurements improved wavelength accuracy by an order of magnitude. Such measurements are crucial for constraining the range of laser wavelengths for finding the "needle in a haystack" narrow lines. In addition, we calculated frequencies and quality factors, evaluated sensitivity of these six transitions to the hypothetical variation of the electromagnetic fine structure constant $α$ needed for fundamental physics applications. We argue that all the six transitions in nickel HCIs offer intrinsic immunity to all common perturbations of quantum oscillators, and one of them has the projected fractional frequency uncertainty down to the remarkable level of 10$^{-19}$.

preprint2020arXiv

Berry curvature memory through electrically driven stacking transitions

In two-dimensional layered quantum materials, the stacking order of the layers determines both the crystalline symmetry and electronic properties such as the Berry curvature, topology and electron correlation. Electrical stimuli can influence quasiparticle interactions and the free-energy landscape, making it possible to dynamically modify the stacking order and reveal hidden structures that host different quantum properties. Here we demonstrate electrically driven stacking transitions that can be applied to design nonvolatile memory based on Berry curvature in few-layer WTe$_2$. The interplay of out-of-plane electric fields and electrostatic doping controls in-plane interlayer sliding and creates multiple polar and centrosymmetric stacking orders. In situ nonlinear Hall transport reveals such stacking rearrangements result in a layer-parity-selective Berry curvature memory in momentum space, where the sign reversal of the Berry curvature and its dipole only occurs in odd-layer crystals. Our findings open an avenue towards exploring coupling between topology, electron correlations, and ferroelectricity in hidden stacking orders and demonstrate a new low-energy-cost, electrically controlled topological memory in the atomically thin limit.

preprint2020arXiv

CIAN: Cross-Image Affinity Net for Weakly Supervised Semantic Segmentation

Weakly supervised semantic segmentation with only image-level labels saves large human effort to annotate pixel-level labels. Cutting-edge approaches rely on various innovative constraints and heuristic rules to generate the masks for every single image. Although great progress has been achieved by these methods, they treat each image independently and do not take account of the relationships across different images. In this paper, however, we argue that the cross-image relationship is vital for weakly supervised segmentation. Because it connects related regions across images, where supplementary representations can be propagated to obtain more consistent and integral regions. To leverage this information, we propose an end-to-end cross-image affinity module, which exploits pixel-level cross-image relationships with only image-level labels. By means of this, our approach achieves 64.3% and 65.3% mIoU on Pascal VOC 2012 validation and test set respectively, which is a new state-of-the-art result by only using image-level labels for weakly supervised semantic segmentation, demonstrating the superiority of our approach.

preprint2020arXiv

Counterfactual Samples Synthesizing for Robust Visual Question Answering

Despite Visual Question Answering (VQA) has realized impressive progress over the last few years, today's VQA models tend to capture superficial linguistic correlations in the train set and fail to generalize to the test set with different QA distributions. To reduce the language biases, several recent works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on VQA-CP. However, since the complexity of design, current methods are unable to equip the ensemble-based models with two indispensable characteristics of an ideal VQA model: 1) visual-explainable: the model should rely on the right visual regions when making decisions. 2) question-sensitive: the model should be sensitive to the linguistic variations in question. To this end, we propose a model-agnostic Counterfactual Samples Synthesizing (CSS) training scheme. The CSS generates numerous counterfactual training samples by masking critical objects in images or words in questions, and assigning different ground-truth answers. After training with the complementary samples (ie, the original and generated samples), the VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. In return, the performance of these models is further boosted. Extensive ablations have shown the effectiveness of CSS. Particularly, by building on top of the model LMH, we achieve a record-breaking performance of 58.95% on VQA-CP v2, with 6.5% gains.

preprint2020arXiv

Evaluation Framework For Large-scale Federated Learning

Federated learning is proposed as a machine learning setting to enable distributed edge devices, such as mobile phones, to collaboratively learn a shared prediction model while keeping all the training data on device, which can not only take full advantage of data distributed across millions of nodes to train a good model but also protect data privacy. However, learning in scenario above poses new challenges. In fact, data across a massive number of unreliable devices is likely to be non-IID (identically and independently distributed), which may make the performance of models trained by federated learning unstable. In this paper, we introduce a framework designed for large-scale federated learning which consists of approaches to generating dataset and modular evaluation framework. Firstly, we construct a suite of open-source non-IID datasets by providing three respects including covariate shift, prior probability shift, and concept shift, which are grounded in real-world assumptions. In addition, we design several rigorous evaluation metrics including the number of network nodes, the size of datasets, the number of communication rounds and communication resources etc. Finally, we present an open-source benchmark for large-scale federated learning research.

preprint2020arXiv

Hierarchical Fashion Graph Network for Personalized Outfit Recommendation

Fashion outfit recommendation has attracted increasing attentions from online shopping services and fashion communities.Distinct from other scenarios (e.g., social networking or content sharing) which recommend a single item (e.g., a friend or picture) to a user, outfit recommendation predicts user preference on a set of well-matched fashion items.Hence, performing high-quality personalized outfit recommendation should satisfy two requirements -- 1) the nice compatibility of fashion items and 2) the consistence with user preference. However, present works focus mainly on one of the requirements and only consider either user-outfit or outfit-item relationships, thereby easily leading to suboptimal representations and limiting the performance. In this work, we unify two tasks, fashion compatibility modeling and personalized outfit recommendation. Towards this end, we develop a new framework, Hierarchical Fashion Graph Network(HFGN), to model relationships among users, items, and outfits simultaneously. In particular, we construct a hierarchical structure upon user-outfit interactions and outfit-item mappings. We then get inspirations from recent graph neural networks, and employ the embedding propagation on such hierarchical graph, so as to aggregate item information into an outfit representation, and then refine a user's representation via his/her historical outfits. Furthermore, we jointly train these two tasks to optimize these representations. To demonstrate the effectiveness of HFGN, we conduct extensive experiments on a benchmark dataset, and HFGN achieves significant improvements over the state-of-the-art compatibility matching models like NGNN and outfit recommenders like FHN.

preprint2020arXiv

Reinforcement-Learning based Portfolio Management with Augmented Asset Movement Prediction States

Portfolio management (PM) is a fundamental financial planning task that aims to achieve investment goals such as maximal profits or minimal risks. Its decision process involves continuous derivation of valuable information from various data sources and sequential decision optimization, which is a prospective research direction for reinforcement learning (RL). In this paper, we propose SARL, a novel State-Augmented RL framework for PM. Our framework aims to address two unique challenges in financial PM: (1) data heterogeneity -- the collected information for each asset is usually diverse, noisy and imbalanced (e.g., news articles); and (2) environment uncertainty -- the financial market is versatile and non-stationary. To incorporate heterogeneous data and enhance robustness against environment uncertainty, our SARL augments the asset information with their price movement prediction as additional states, where the prediction can be solely based on financial data (e.g., asset prices) or derived from alternative sources such as news. Experiments on two real-world datasets, (i) Bitcoin market and (ii) HighTech stock market with 7-year Reuters news articles, validate the effectiveness of SARL over existing PM approaches, both in terms of accumulated profits and risk-adjusted profits. Moreover, extensive simulations are conducted to demonstrate the importance of our proposed state augmentation, providing new insights and boosting performance significantly over standard RL-based PM method and other baselines.

preprint2020arXiv

Strain-Induced Room-Temperature Ferroelectricity in SrTiO$_3$ Membranes

Advances in complex oxide heteroepitaxy have highlighted the enormous potential of utilizing strain engineering via lattice mismatch to control ferroelectricity in thin-film heterostructures. This approach, however, lacks the ability to produce large and continuously variable strain states, thus limiting the potential for designing and tuning the desired properties of ferroelectric films. Here, we observe and explore dynamic strain-induced ferroelectricity in SrTiO$_3$ by laminating freestanding oxide films onto a stretchable polymer substrate. Using a combination of scanning probe microscopy, optical second harmonic generation measurements, and atomistic modeling, we demonstrate robust room-temperature ferroelectricity in SrTiO$_3$ with 2.0% uniaxial tensile strain, corroborated by the notable features of 180° ferroelectric domains and an extrapolated transition temperature of 400 K. Our work reveals the enormous potential of employing oxide membranes to create and enhance ferroelectricity in environmentally benign lead-free oxides, which hold great promise for applications ranging from non-volatile memories and microwave electronics.

preprint2020arXiv

Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling

Visual Storytelling~(VIST) is a task to tell a narrative story about a certain topic according to the given photo stream. The existing studies focus on designing complex models, which rely on a huge amount of human-annotated data. However, the annotation of VIST is extremely costly and many topics cannot be covered in the training dataset due to the long-tail topic distribution. In this paper, we focus on enhancing the generalization ability of the VIST model by considering the few-shot setting. Inspired by the way humans tell a story, we propose a topic adaptive storyteller to model the ability of inter-topic generalization. In practice, we apply the gradient-based meta-learning algorithm on multi-modal seq2seq models to endow the model the ability to adapt quickly from topic to topic. Besides, We further propose a prototype encoding structure to model the ability of intra-topic derivation. Specifically, we encode and restore the few training story text to serve as a reference to guide the generation at inference time. Experimental results show that topic adaptation and prototype encoding structure mutually bring benefit to the few-shot model on BLEU and METEOR metric. The further case study shows that the stories generated after few-shot adaptation are more relative and expressive.

preprint2016arXiv

Ultrafast fluorescent decay induced by metal-mediated dipole-dipole interaction in two-dimensional molecular aggregates

Two-dimensional molecular aggregate (2DMA), a thin sheet of strongly interacting dipole molecules self-assembled at close distance on an ordered lattice, is a fascinating fluorescent material. It is distinctively different from the single or colloidal dye molecules or quantum dots in most previous research. In this paper, we verify for the first time that when a 2DMA is placed at a nanometric distance from a metallic substrate, the strong and coherent interaction between the dipoles inside the 2DMA dominates its fluorescent decay at picosecond timescale. Our streak-camera lifetime measurement and interacting lattice-dipole calculation reveal that the metal-mediated dipole-dipole interaction shortens the fluorescent lifetime to about one half and increases the energy dissipation rate by ten times than expected from the noninteracting single-dipole picture. Our finding can enrich our understanding of nanoscale energy transfer in molecular excitonic systems and may designate a new direction for developing fast and efficient optoelectronic devices.

preprint2015arXiv

Optical Selection Rule based on Valley-Exciton Locking for 2D Valleytronics

Optical selection rule fundamentally determines the optical transition between energy states in a variety of physical systems from hydrogen atoms to bulk crystals such as GaAs. It is important for optoelectronic applications such as lasers, energy-dispersive X-ray spectroscopy and quantum computation. Recently, single layer transition metal dichalcogenide (TMDC) exhibits valleys in momentum space with nontrivial Berry curvature and excitons with large binding energy. However, it is unclear how the unique valley degree of freedom combined with the strong excitonic effect influences the optical excitation. Here we discover a new set of optical selection rules in monolayer WS2,imposed by valley and exciton angular momentum. We experimentally demonstrated such a principle for second harmonic generation (SHG) and two-photon luminescence (TPL). Moreover, the two-photon induced valley populations yield net circular polarized photoluminescence after a sub-ps interexciton relaxation (2p->1s) and last for 8 ps. The discovery of this new optical selection rule in valleytronic 2D system not only largely extend information degrees but sets a foundation in control of optical transitions that is crucial to valley optoeletronic device applications such as 2D valley-polarized light emitting diodes (LED), optical switches and coherent control for quantum computing.

preprint2015arXiv

Tungsten spectroscopy in the EUV observed in SH-HtscEBIT

We have recorded extreme ultraviolet spectra from $\mathrm{W^{11+}}$ to $\mathrm{W^{15+}}$ ions using a new flat field spectrometer installed at the Shanghai high temperature superconducting electron beam ion trap. The spectra were recorded at beam energies ranging between 200 eV and 400 eV and showed spectral lines/transition arrays in the 170 - 260 Å region. The charge states and spectra transitions were identified by comparison with calculations using a detailed relativistic configuration interaction method and collisional-radiative model, both incorporated in the Flexible Atomic Code. Atomic structure calculations showed that the dominant emission arises from $5d$ $\rightarrow$ $5p$ and $5p$ $\rightarrow$ $5s$ transitions. The work also identified the ground-state configuration of $W^{13+}$ as $4f^{13}5s^2$ both theoretically and experimentally.

preprint2014arXiv

Metric Learning Driven Multi-Task Structured Output Optimization for Robust Keypoint Tracking

As an important and challenging problem in computer vision and graphics, keypoint-based object tracking is typically formulated in a spatio-temporal statistical learning framework. However, most existing keypoint trackers are incapable of effectively modeling and balancing the following three aspects in a simultaneous manner: temporal model coherence across frames, spatial model consistency within frames, and discriminative feature construction. To address this issue, we propose a robust keypoint tracker based on spatio-temporal multi-task structured output optimization driven by discriminative metric learning. Consequently, temporal model coherence is characterized by multi-task structured keypoint model learning over several adjacent frames, while spatial model consistency is modeled by solving a geometric verification based structured learning problem. Discriminative feature construction is enabled by metric learning to ensure the intra-class compactness and inter-class separability. Finally, the above three modules are simultaneously optimized in a joint learning scheme. Experimental results have demonstrated the effectiveness of our tracker.

preprint2014arXiv

Observation of Piezoelectricity in Monolayer Molybdenum Disulfide

Piezoelectricity offers precise and robust conversion between electricity and mechanical force. Here we report the first experimental evidence of piezoelectricity in a single layer of molybdenum disulfide (MoS2) crystal as a result of inversion symmetry breaking of the atomic structure, with measured piezoelectric coefficient e11 = 2.9e-10 C/m. Through the angular dependence of electro-mechanical coupling, we uniquely determined the two-dimensional (2D) crystal orientation. We observed that only MoS2 membranes with odd number of layers exhibited piezoelectricity, in sharp contrast to the conventional materials. The piezoelectricity discovered in single molecular membrane promises scaling down of nano-electro-mechanical systems (NEMS) to single atomic unit cell - the ultimate material limit.

Jun Xiao

What is connected

Connect this record

See the researcher in context

Building this map preview

35 published item(s)

Milestone-Guided Policy Learning for Long-Horizon Language Agents

VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language Matching

A Knowledge-Enhanced Adversarial Model for Cross-lingual Structured Sentiment Analysis

Accurate Lung Nodules Segmentation with Detailed Representation Transfer and Soft Mask Supervision

ACDNet: Adaptively Combined Dilated Convolution for Monocular Panorama Depth Estimation

Active Learning for Point Cloud Semantic Segmentation via Spatial-Structural Diversity Reasoning

Bidirectional Self-Training with Multiple Anisotropic Prototypes for Domain Adaptive Semantic Segmentation

Classification-Then-Grounding: Reformulating Video Scene Graphs as Temporal Bipartite Graphs

Consensus Graph Representation Learning for Better Grounded Image Captioning

DS-MVSNet: Unsupervised Multi-view Stereo via Depth Synthesis

Explicit Image Caption Editing

Integrating Object-aware and Interaction-aware Knowledge for Weakly Supervised Scene Graph Generation

Label Semantic Knowledge Distillation for Unbiased Scene Graph Generation

Learning Regularized Multi-Scale Feature Flow for High Dynamic Range Imaging

Online Video Super-Resolution with Convolutional Kernel Bypass Graft

Rethinking Data Augmentation for Robust Visual Question Answering

Shapley Counterfactual Credits for Multi-Agent Reinforcement Learning

The Devil is in the Labels: Noisy Label Correction for Robust Scene Graph Generation

Unified Group Fairness on Federated Learning

Unified Normalization for Accelerating and Stabilizing Transformers

Kinetic Energy Distribution of Fragments for Thermal Neutron-Induced $^{235}$U and $^{239}$Pu Fission Reactions

Probing Multiple Electric Dipole Forbidden Optical Transitions in Highly Charged Nickel Ions

Berry curvature memory through electrically driven stacking transitions

CIAN: Cross-Image Affinity Net for Weakly Supervised Semantic Segmentation

Counterfactual Samples Synthesizing for Robust Visual Question Answering

Evaluation Framework For Large-scale Federated Learning

Hierarchical Fashion Graph Network for Personalized Outfit Recommendation

Reinforcement-Learning based Portfolio Management with Augmented Asset Movement Prediction States

Strain-Induced Room-Temperature Ferroelectricity in SrTiO$_3$ Membranes

Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling

Ultrafast fluorescent decay induced by metal-mediated dipole-dipole interaction in two-dimensional molecular aggregates

Optical Selection Rule based on Valley-Exciton Locking for 2D Valleytronics

Tungsten spectroscopy in the EUV observed in SH-HtscEBIT

Metric Learning Driven Multi-Task Structured Output Optimization for Robust Keypoint Tracking

Observation of Piezoelectricity in Monolayer Molybdenum Disulfide