Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
60works
0followers
23topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

60 published item(s)

preprint2026arXiv

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to strengthen sustained, on-demand access to visual evidence. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for enhanced visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM shows improved robustness in longer generations and accelerates internal prediction convergence.

preprint2025arXiv

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

Recent Large Reasoning Models (LRMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference. However, a growing concern lies in their tendency to produce excessively long reasoning traces, which are often filled with redundant content (e.g., repeated definitions), over-analysis of simple problems, and superficial exploration of multiple reasoning paths for harder tasks. This inefficiency introduces significant challenges for training, inference, and real-world deployment (e.g., in agent-based systems), where token economy is critical. In this survey, we provide a comprehensive overview of recent efforts aimed at improving reasoning efficiency in LRMs, with a particular focus on the unique challenges that arise in this new paradigm. We identify common patterns of inefficiency, examine methods proposed across the LRM lifecycle, i.e., from pretraining to inference, and discuss promising future directions for research. To support ongoing development, we also maintain a real-time GitHub repository tracking recent progress in the field. We hope this survey serves as a foundation for further exploration and inspires innovation in this rapidly evolving area.

preprint2025arXiv

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

preprint2025arXiv

MiMo-Audio: Audio Language Models are Few-Shot Learners

Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

preprint2025arXiv

Sommerfeld Enhancement from Background Force and the Galactic Center GeV Excess

We study the impact of background-induced forces on dark matter (DM) annihilation and their implications for indirect detection. In the presence of a finite number density of background particles, loop-level interactions can generate an effective force that is significantly enhanced relative to the vacuum case. We construct a two-component DM model in which the dominant component is a fermionic particle $χ$ and the subdominant component is an ultralight pseudoscalar particle $ϕ$. The annihilation of $χ$ proceeds through the p-wave channel and produces gamma-ray emission. The finite density of $ϕ$ particles induces a background-enhanced force between $χ$ particles, leading to a sizable Sommerfeld enhancement of the annihilation. We show that a viable region of parameter space in this model can account for the gamma-ray excess observed in the Galactic Center using Fermi-LAT data. The background-induced force substantially amplifies the Sommerfeld enhancement and thus enlarges the parameter space capable of explaining the excess, highlighting the importance of background effects in astrophysical environments.

preprint2023arXiv

Hypotheses Tree Building for One-Shot Temporal Sentence Localization

Given an untrimmed video, temporal sentence localization (TSL) aims to localize a specific segment according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on dense video frame annotations, which require a tremendous amount of human effort to collect. In this paper, we target another more practical and challenging setting: one-shot temporal sentence localization (one-shot TSL), which learns to retrieve the query information among the entire video with only one annotated frame. Particularly, we propose an effective and novel tree-structure baseline for one-shot TSL, called Multiple Hypotheses Segment Tree (MHST), to capture the query-aware discriminative frame-wise information under the insufficient annotations. Each video frame is taken as the leaf-node, and the adjacent frames sharing the same visual-linguistic semantics will be merged into the upper non-leaf node for tree building. At last, each root node is an individual segment hypothesis containing the consecutive frames of its leaf-nodes. During the tree construction, we also introduce a pruning strategy to eliminate the interference of query-irrelevant nodes. With our designed self-supervised loss functions, our MHST is able to generate high-quality segment hypotheses for ranking and selection with the query. Experiments on two challenging datasets demonstrate that MHST achieves competitive performance compared to existing methods.

preprint2023arXiv

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.

preprint2022arXiv

A Deep Reinforcement Learning based Approach for NOMA-based Random Access Network with Truncated Channel Inversion Power Control

As a main use case of 5G and Beyond wireless network, the ever-increasing machine type communications (MTC) devices pose critical challenges over MTC network in recent years. It is imperative to support massive MTC devices with limited resources. To this end, Non-orthogonal multiple access (NOMA) based random access network has been deemed as a prospective candidate for MTC network. In this paper, we propose a deep reinforcement learning (RL) based approach for NOMA-based random access network with truncated channel inversion power control. Specifically, each MTC device randomly selects a pre-defined power level with a certain probability for data transmission. Devices are using channel inversion power control yet subject to the upper bound of the transmission power. Due to the stochastic feature of the channel fading and the limited transmission power, devices with different achievable power levels have been categorized as different types of devices. In order to achieve high throughput with considering the fairness between all devices, two objective functions are formulated. One is to maximize the minimum long-term expected throughput of all MTC devices, the other is to maximize the geometric mean of the long-term expected throughput for all MTC devices. A Policy based deep reinforcement learning approach is further applied to tune the transmission probabilities of each device to solve the formulated optimization problems. Extensive simulations are conducted to show the merits of our proposed approach.

preprint2022arXiv

A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models

Large pre-trained vision-language (VL) models can learn a new task with a handful of examples and generalize to a new task without fine-tuning. However, these VL models are hard to deploy for real-world applications due to their impractically huge sizes and slow inference speed. To solve this limitation, we study prompt-based low-resource learning of VL tasks with our proposed method, FewVLM, relatively smaller than recent few-shot learners. For FewVLM, we pre-train a sequence-to-sequence transformer model with prefix language modeling (PrefixLM) and masked language modeling (MaskedLM). Furthermore, we analyze the effect of diverse prompts for few-shot tasks. Experimental results on VQA show that FewVLM with prompt-based learning outperforms Frozen which is 31x larger than FewVLM by 18.2% point and achieves comparable results to a 246x larger model, PICa. In our analysis, we observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance, (2) models with noisy prompts learn as quickly as hand-crafted prompts given larger training data, and (3) MaskedLM helps VQA tasks while PrefixLM boosts captioning performance. Our code is publicly available at \url{https://github.com/woojeongjin/FewVLM}

preprint2022arXiv

A SBP-SAT FDTD Subgridding Method Using Staggered Yee's Grids Without Modifying Field Components

A summation-by-parts simultaneous approximation term (SBP-SAT) finite-difference time-domain (FDTD) subgridding method is proposed to model geometrically fine structures in this paper. Compared with our previous work, the proposed SBP-SAT FDTD method uses the staggered Yee's grid without adding or modifying any field components through field extrapolation on the boundaries to make the discrete operators satisfy the SBP property. The accuracy of extrapolation keeps consistency with that of the second-order finite-difference scheme near the boundaries. In addition, the SATs are used to weakly enforce the tangential boundary conditions between multiple mesh blocks with different mesh sizes. With carefully designed interpolation matrices and selected free parameters of the SATs, no dissipation occurs in the whole computational domain. Therefore, its long-time stability is theoretically guaranteed. Three numerical examples are carried out to validate its effectiveness. Results show that the proposed SBP-SAT FDTD subgridding method is stable, accurate, efficient, and easy to implement based on existing FDTD codes with only a few modifications.

preprint2022arXiv

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Large-scale pre-trained language models have achieved tremendous success across a wide range of natural language understanding (NLU) tasks, even surpassing human performance. However, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. While several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing. In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. Our findings are summarized as follows. (i) Most existing adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples, with around 90% of them either changing the original semantic meanings or misleading human annotators as well. Therefore, we perform a careful filtering process to curate a high-quality benchmark. (ii) All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy. We hope our work will motivate the development of new adversarial attacks that are more stealthy and semantic-preserving, as well as new robust language models against sophisticated adversarial attacks. AdvGLUE is available at https://adversarialglue.github.io.

preprint2022arXiv

Backdoor Attacks on Crowd Counting

Crowd counting is a regression task that estimates the number of people in a scene image, which plays a vital role in a range of safety-critical applications, such as video surveillance, traffic monitoring and flow control. In this paper, we investigate the vulnerability of deep learning based crowd counting models to backdoor attacks, a major security threat to deep learning. A backdoor attack implants a backdoor trigger into a target model via data poisoning so as to control the model's predictions at test time. Different from image classification models on which most of existing backdoor attacks have been developed and tested, crowd counting models are regression models that output multi-dimensional density maps, thus requiring different techniques to manipulate. In this paper, we propose two novel Density Manipulation Backdoor Attacks (DMBA$^{-}$ and DMBA$^{+}$) to attack the model to produce arbitrarily large or small density estimations. Experimental results demonstrate the effectiveness of our DMBA attacks on five classic crowd counting models and four types of datasets. We also provide an in-depth analysis of the unique challenges of backdooring crowd counting models and reveal two key elements of effective attacks: 1) full and dense triggers and 2) manipulation of the ground truth counts or density maps. Our work could help evaluate the vulnerability of crowd counting models to potential backdoor attacks.

preprint2022arXiv

CP violating dark photon kinetic mixing and type-III seesaw model

The hypothetical dark photon portal connecting the visible and dark sectors of the Universe has received considerable attention in recent years, with a focus on CP-conserving kinetic mixing between the Standard Model (SM) hypercharge gauge boson and a new U(1)$_X$ gauge boson. In the effective field theory context, one may write down non-renormalizable CP-violating kinetic mixing interactions involving the $X$ and SU(2)$_L$ gauge bosons. We construct for the first time a renormalizable model for CP-violating kinetic mixing that induces CP-violating non-Abelian kinetic mixing at mass dimension five. The model grows out of the type-III seesaw model, with the lepton triplets containing right-handed neutrinos playing a crucial role in making the model renormalizable and providing a bridge to the origin of neutrino mass. This scenario also accommodates electron electric dipole moments (EDM) as large as current experimental bound, making future EDM searches an important probe of this scenario.

preprint2022arXiv

Disks and Outflows in the Intermediate-mass Star Forming Region NGC 2071 IR

We present ALMA band 6/7 (1.3 mm/0.87 mm) and VLA Ka band (9 mm) observations toward NGC 2071 IR, an intermediate-mass star forming region. We characterize the continuum and associated molecular line emission towards the most luminous protostars, i.e., IRS1 and IRS3, on ~100 au (0. 2") scales. IRS1 is partly resolved in millimeter and centimeter continuum, which shows a potential disk. IRS3 has a well resolved disk appearance in millimeter continuum and is further resolved into a close binary system separated by ~40 au at 9 mm. Both sources exhibit clear velocity gradients across their disk major axes in multiple spectral lines including C18O, H2CO, SO, SO2, and complex organic molecules like CH3OH, 13CH3OH and CH3OCHO. We use an analytic method to fit the Keplerian rotation of the disks, and give constraints on physical parameters with a MCMC routine. The IRS3 binary system is estimated to have a total mass of 1.4-1.5$M_\odot$. IRS1 has a central mass of 3-5$M_\odot$ based on both kinematic modeling and its spectral energy distribution, assuming that it is dominated by a single protostar. For both IRS1 and IRS3, the inferred ejection directions from different tracers, including radio jet, water maser, molecular outflow, and H2 emission, are not always consistent, and for IRS1, these can be misaligned by ~50$^{\circ}$. IRS3 is better explained by a single precessing jet. A similar mechanism may be present in IRS1 as well but an unresolved multiple system in IRS1 is also possible.

preprint2022arXiv

Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. Our evaluations demonstrate the effectiveness of the proposed method. Code and models are available: https://github.com/3dpose/3D-Multi-Person-Pose.

preprint2022arXiv

Efficient Algorithms for Planning with Participation Constraints

We consider the problem of planning with participation constraints introduced in [Zhang et al., 2022]. In this problem, a principal chooses actions in a Markov decision process, resulting in separate utilities for the principal and the agent. However, the agent can and will choose to end the process whenever his expected onward utility becomes negative. The principal seeks to compute and commit to a policy that maximizes her expected utility, under the constraint that the agent should always want to continue participating. We provide the first polynomial-time exact algorithm for this problem for finite-horizon settings, where previously only an additive $\varepsilon$-approximation algorithm was known. Our approach can also be extended to the (discounted) infinite-horizon case, for which we give an algorithm that runs in time polynomial in the size of the input and $\log(1/\varepsilon)$, and returns a policy that is optimal up to an additive error of $\varepsilon$.

preprint2022arXiv

Flavor Specific $U(1)_{B_q-L_μ}$ Gauge Model for Muon $g-2$ and $b \to s \bar μμ$ Anomalies

The muon $(g-2)_μ$ and $b\to s \bar μμ$ induced $B$ anomalies as hints of new physics beyond the standard model (SM) have attracted much attention. These two anomalies indicate that there may exist new interaction specifically related to muon. A lot of theoretical ideas have been proposed to explain these anomalies. Gauged flavor specific $U(1)_{B_q-L_μ}$ is among the promising ones. The new gauge boson $Z'$ from $U(1)_{B_q-L_μ}$ interacts with muon and provides necessary ingredient to solve the $(g-2)_μ$ anomaly. The $Z'$-quark coupling can generate flavor changing interactions after diagonalization of quark mass matrix between weak eigen-state and mass eigen-state basis. We revisit challenges for such models attempting to explain the $(g-2)_μ$ and $B$ anomalies separately or simultaneously. We find although for $U(1)_{B_q-L_μ}$ models there is still parameter space to provide solutions for separately explaining the $(g-2)_μ$ and $B$ anomalies, there exists no parameter space for such models to solve both the anomalies simultaneously, after taking into account existing constraints from $τ\to μγ$, $τ\to 3 μ$, neutrino trident and $B_s - \bar B_s$ data. Among them leptonic processes restrict $Z^\prime$ mass to be less than a few hundred MeV if required to solve the $(g-2)_μ$ anomaly, which causes conflict between data from $B_s - \bar B_s$, $D^0 - \bar D^0$ mixing and also hadron decays with $Z^\prime$ in the final states. The effects of $U(1)_Y$ and $U(1)_{B_q-L_μ}$ kinetic mixing on these anomalies are also studied. We find that neither can these effects do much to bring the two anomalies together to be solved simultaneously.

preprint2022arXiv

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn transferable features for a range of downstream tasks by mapping multiple modalities into a shared embedding space. Typically, this has employed separate encoders for each modality. However, recent work suggests that transformers can support learning across multiple modalities and allow knowledge sharing. Inspired by this, we investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. More specifically, we question how many parameters of a transformer model can be shared across modalities during contrastive pre-training, and rigorously examine architectural design choices that position the proportion of parameters shared along a spectrum. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Additionally, we find that light-weight modality-specific parallel modules further improve performance. Experimental results show that the proposed MS-CLIP approach outperforms vanilla CLIP by up to 13\% relative in zero-shot ImageNet classification (pre-trained on YFCC-100M), while simultaneously supporting a reduction of parameters. In addition, our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks. Furthermore, we discover that sharing parameters leads to semantic concepts from different modalities being encoded more closely in the embedding space, facilitating the transferring of common semantic structure (e.g., attention patterns) from language to vision. Code is available at \href{https://github.com/Hxyou/MSCLIP}{URL}.

preprint2022arXiv

Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

Temporal sentence grounding (TSG) is crucial and fundamental for video understanding. Although the existing methods train well-designed deep networks with a large amount of data, we find that they can easily forget the rarely appeared cases in the training stage due to the off-balance data distribution, which influences the model generalization and leads to undesirable performance. To tackle this issue, we propose a memory-augmented network, called Memory-Guided Semantic Learning Network (MGSL-Net), that learns and memorizes the rarely appeared content in TSG tasks. Specifically, MGSL-Net consists of three main parts: a cross-modal inter-action module, a memory augmentation module, and a heterogeneous attention module. We first align the given video-query pair by a cross-modal graph convolutional network, and then utilize a memory module to record the cross-modal shared semantic features in the domain-specific persistent memory. During training, the memory slots are dynamically associated with both common and rare cases, alleviating the forgetting issue. In testing, the rare cases can thus be enhanced by retrieving the stored memories, resulting in better generalization. At last, the heterogeneous attention module is utilized to integrate the enhanced multi-modal features in both video and query domains. Experimental results on three benchmarks show the superiority of our method on both effectiveness and efficiency, which substantially improves the accuracy not only on the entire dataset but also on rare cases.

preprint2022arXiv

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Vision transformers (ViTs) have gained increasing popularity as they are commonly believed to own higher modeling capacity and representation flexibility, than traditional convolutional networks. However, it is questionable whether such potential has been fully unleashed in practice, as the learned ViTs often suffer from over-smoothening, yielding likely redundant models. Recent works made preliminary attempts to identify and alleviate such redundancy, e.g., via regularizing embedding similarity or re-injecting convolution-like structures. However, a "head-to-toe assessment" regarding the extent of redundancy in ViTs, and how much we could gain by thoroughly mitigating such, has been absent for this field. This paper, for the first time, systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space. In view of them, we advocate a principle of diversity for training ViTs, by presenting corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information. Extensive experiments on ImageNet with a number of ViT backbones validate the effectiveness of our proposals, largely eliminating the observed ViT redundancy and significantly boosting the model generalization. For example, our diversified DeiT obtains 0.70%~1.76% accuracy boosts on ImageNet with highly reduced similarity. Our codes are fully available in https://github.com/VITA-Group/Diverse-ViT.

preprint2022arXiv

Towards the Development of A Three-Dimensional SBP-SAT FDTD Method: Theory and Validation

To enhance the scalability and performance of the traditional finite-difference time-domain (FDTD) methods, a three-dimensional summation-by-parts simultaneous approximation term (SBP-SAT) FDTD method is developed to solve complex electromagnetic problems. It is theoretically stable and can be further used for multiple mesh blocks with different mesh sizes. This paper mainly focuses on the fundamental theoretical aspects upon its three-dimensional implementation, the SAT for various boundary conditions, and the numerical dispersion properties and the comparison with the FDTD method. The proposed SBP-SAT FDTD method inherits all the merits of the FDTD method, which is matrix-free, easy to implement, and has the same level of accuracy with a negligible overhead of runtime (0.13\%) and memory usage (1.2\%). Four numerical examples are carried out to validate the effectiveness of the proposed method.

preprint2022arXiv

Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.

preprint2022arXiv

Widening the $U(1)_{L_μ-L_τ}$ $Z^\prime$ mass range for resolving the muon $g-2$ anomaly

Exchanging a $Z^\prime$ gauge boson is a favored mechanism to solve the muon $(g-2)_μ$ anomaly. Among such models the $Z^\prime$ from $U(1)_{L_μ- L_τ}$ gauge group has been extensively studied. In this model the same interaction addressing $(g-2)_μ$, leads to an enhanced muon neutrino trident (MNT) process $ν_μN \to ν_μμ\bar μN$ constraining the $Z^\prime$ mass to be less than a few hundred MeV. Many other $Z^\prime$ models face the same problem. It has long been realized that the coupling of $Z^\prime$ in the model can admit $(\bar μγ^μτ+ \bar ν_μγ^μL ν_τ)Z^\prime_μ$ interaction which does not contribute to the MNT process. It can solve $(g-2)_μ$ anomaly for a much wider $Z^\prime$ mass range. However this new interaction induces $τ\to μ\barν_μν_τ$ which rules out it as a solution to $(g-2)_μ$ anomaly. Here we propose a mechanism by introducing type-II seesaw $SU(2)_L$ triplet scalars to evade constraints from all known data to allow a wide $Z^\prime$ mass range to solve the $(g-2)_μ$ anomaly. This mechanism opens a new window for $Z^\prime$ physics.

preprint2022arXiv

ZOOMER: Boosting Retrieval on Web-scale Graphs by Regions of Interest

We introduce ZOOMER, a system deployed at Taobao, the largest e-commerce platform in China, for training and serving GNN-based recommendations over web-scale graphs. ZOOMER is designed for tackling two challenges presented by the massive user data at Taobao: low training/serving efficiency due to the huge scale of the graphs, and low recommendation quality due to the information overload which distracts the recommendation model from specific user intentions. ZOOMER achieves this by introducing a key concept, Region of Interests (ROI) in GNNs for recommendations, i.e., a neighborhood region in the graph with significant relevance to a strong user intention. ZOOMER narrows the focus from the whole graph and "zooms in" on the more relevant ROIs, thereby reducing the training/serving cost and mitigating the information overload at the same time. With carefully designed mechanisms, ZOOMER identifies the interest expressed by each recommendation request, constructs an ROI subgraph by sampling with respect to the interest, and guides the GNN to reweigh different parts of the ROI towards the interest by a multi-level attention module. Deployed as a large-scale distributed system, ZOOMER supports graphs with billions of nodes for training and thousands of requests per second for serving. ZOOMER achieves up to 14x speedup when downsizing sampling scales with comparable (even better) AUC performance than baseline methods. Besides, both the offline evaluation and online A/B test demonstrate the effectiveness of ZOOMER.

preprint2021arXiv

Collider search of light dark matter model with dark sector decay

We explore the possibility that the dark matter relic density is not produced by thermal mechanism directly, but by the decay of other heavier dark sector particles which on the other hand can be produced by the thermal freeze-out mechanism. Using a concrete model with a light dark matter from dark sector decay, we study the collider signature of the dark sector particles in association with Higgs production processes. We find that the future lepton colliders can be a better place to probe the signature of this kind of light dark matter model than the hadron collider such as LHC. Meanwhile, it is found that a Higgs factory with center of mass energy 250 GeV has a better potential to resolve the signature of this kind of light dark matter model than the Higgs factory with center of mass energy 350 GeV.

preprint2021arXiv

Deep Co-Attention Network for Multi-View Subspace Learning

Many real-world applications involve data from multiple modalities and thus exhibit the view heterogeneity. For example, user modeling on social media might leverage both the topology of the underlying social network and the content of the users' posts; in the medical domain, multiple views could be X-ray images taken at different poses. To date, various techniques have been proposed to achieve promising results, such as canonical correlation analysis based methods, etc. In the meanwhile, it is critical for decision-makers to be able to understand the prediction results from these methods. For example, given the diagnostic result that a model provided based on the X-ray images of a patient at different poses, the doctor needs to know why the model made such a prediction. However, state-of-the-art techniques usually suffer from the inability to utilize the complementary information of each view and to explain the predictions in an interpretable manner. To address these issues, in this paper, we propose a deep co-attention network for multi-view subspace learning, which aims to extract both the common information and the complementary information in an adversarial setting and provide robust interpretations behind the prediction to the end-users via the co-attention mechanism. In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation by incorporating the classifier into our model. This improves the quality of latent representation and accelerates the convergence speed. Finally, we develop an efficient iterative algorithm to find the optimal encoders and discriminator, which are evaluated extensively on synthetic and real-world data sets. We also conduct a case study to demonstrate how the proposed method robustly interprets the predictions on an image data set.

preprint2021arXiv

Deuterium Chemodynamics of Massive Pre-Stellar Cores

High levels of deuterium fractionation of $\rm N_2H^+$ (i.e., $\rm D_{frac}^{N_2H^+} \gtrsim 0.1$) are often observed in pre-stellar cores (PSCs) and detection of $\rm N_2D^+$ is a promising method to identify elusive massive PSCs. However, the physical and chemical conditions required to reach such high levels of deuteration are still uncertain, as is the diagnostic utility of $\rm N_2H^+$ and $\rm N_2D^+$ observations of PSCs. We perform 3D magnetohydrodynamics simulations of a massive, turbulent, magnetised PSC, coupled with a sophisticated deuteration astrochemical network. Although the core has some magnetic/turbulent support, it collapses under gravity in about one freefall time, which marks the end of the simulations. Our fiducial model achieves relatively low $\rm D_{frac}^{N_2H^+} \sim 0.002$ during this time. We then investigate effects of initial ortho-para ratio of $\rm H_2$ ($\rm OPR^{H_2}$), temperature, cosmic ray (CR) ionization rate, CO and N-species depletion factors and prior PSC chemical evolution. We find that high CR ionization rates and high depletion factors allow the simulated $\rm D_{frac}^{N_2H^+}$ and absolute abundances to match observational values within one freefall time. For $\rm OPR^{H_2}$, while a lower initial value helps the growth of $\rm D_{frac}^{N_2H^+}$, the spatial structure of deuteration is too widespread compared to observed systems. For an example model with elevated CR ionization rates and significant heavy element depletion, we then study the kinematic and dynamic properties of the core as traced by its $\rm N_2D^+$ emission. The core, undergoing quite rapid collapse, exhibits disturbed kinematics in its average velocity map. Still, because of magnetic support, the core often appears kinematically sub-virial based on its $\rm N_2D^+$ velocity dispersion.

preprint2021arXiv

Efficient Robust Training via Backward Smoothing

Adversarial training is so far the most effective strategy in defending against adversarial examples. However, it suffers from high computational costs due to the iterative adversarial attacks in each training step. Recent studies show that it is possible to achieve fast Adversarial Training by performing a single-step attack with random initialization. However, such an approach still lags behind state-of-the-art adversarial training algorithms on both stability and model robustness. In this work, we develop a new understanding towards Fast Adversarial Training, by viewing random initialization as performing randomized smoothing for better optimization of the inner maximization problem. Following this new perspective, we also propose a new initialization strategy, backward smoothing, to further improve the stability and model robustness over single-step robust training methods. Experiments on multiple benchmarks demonstrate that our method achieves similar model robustness as the original TRADES method while using much less training time ($\sim$3x improvement with the same training schedule).

preprint2021arXiv

Electronic controllable broadband and robust terahertz surface plasmon-polaritons switch based on hybrid ITO waveguide coupler

The surface plasmon-polaritons (SPPs) switch is the key element of the integrated devices in optical computation and terahertz (THz) communications. In this paper, we propose a novel design of THz SPPs switch based on quantum engineering. Due to the robustness of coherent quantum control technique, our switch is very robust against with perturbations of geometrical parameters and presents a good performance at on-state (and off-state) from 0.5 THz to 0.7 THz. The on-state and off-state of our device can be controlled by the external voltage. We believe this finding will be the great improvement for the integrated optical computing and THz communications.

preprint2021arXiv

EnlightenGAN: Deep Light Enhancement without Paired Supervision

Deep learning-based methods have achieved remarkable success in image restoration and enhancement, but are they still competitive when there is a lack of paired training data? As one such example, this paper explores the low-light image enhancement problem, where in practice it is extremely challenging to simultaneously take a low-light and a normal-light photo of the same visual scene. We propose a highly effective unsupervised generative adversarial network, dubbed EnlightenGAN, that can be trained without low/normal-light image pairs, yet proves to generalize very well on various real-world test images. Instead of supervising the learning using ground truth data, we propose to regularize the unpaired training using the information extracted from the input itself, and benchmark a series of innovations for the low-light image enhancement problem, including a global-local discriminator structure, a self-regularized perceptual loss fusion, and attention mechanism. Through extensive experiments, our proposed approach outperforms recent methods under a variety of metrics in terms of visual quality and subjective user study. Thanks to the great flexibility brought by unpaired training, EnlightenGAN is demonstrated to be easily adaptable to enhancing real-world images from various domains. The code is available at \url{https://github.com/yueruchen/EnlightenGAN}

preprint2021arXiv

Fair for All: Best-effort Fairness Guarantees for Classification

Standard approaches to group-based notions of fairness, such as \emph{parity} and \emph{equalized odds}, try to equalize absolute measures of performance across known groups (based on race, gender, etc.). Consequently, a group that is inherently harder to classify may hold back the performance on other groups; and no guarantees can be provided for unforeseen groups. Instead, we propose a fairness notion whose guarantee, on each group $g$ in a class $\mathcal{G}$, is relative to the performance of the best classifier on $g$. We apply this notion to broad classes of groups, in particular, where (a) $\mathcal{G}$ consists of all possible groups (subsets) in the data, and (b) $\mathcal{G}$ is more streamlined. For the first setting, which is akin to groups being completely unknown, we devise the {\sc PF} (Proportional Fairness) classifier, which guarantees, on any possible group $g$, an accuracy that is proportional to that of the optimal classifier for $g$, scaled by the relative size of $g$ in the data set. Due to including all possible groups, some of which could be too complex to be relevant, the worst-case theoretical guarantees here have to be proportionally weaker for smaller subsets. For the second setting, we devise the {\sc BeFair} (Best-effort Fair) framework which seeks an accuracy, on every $g \in \mathcal{G}$, which approximates that of the optimal classifier on $g$, independent of the size of $g$. Aiming for such a guarantee results in a non-convex problem, and we design novel techniques to get around this difficulty when $\mathcal{G}$ is the set of linear hypotheses. We test our algorithms on real-world data sets, and present interesting comparative insights on their performance.

preprint2021arXiv

Surveys of Clumps, Cores, and Condensations in the Cygnus X: II. Radio Properties of the Massive Dense Cores

We have carried out a high-sensitivity and high-resolution radio continuum study towards a sample of 47 massive dense cores (MDCs) in the Cygnus X star-forming complex using the Karl G. Jansky Very Large Array, aiming to detect and characterize the radio emission associated with star-forming activities down to ~0.01 pc scales. We have detected 64 radio sources within or closely around the full width at half-maximum (FWHM) of the MDCs, of which 37 are reported for the first time. The majority of the detected radio sources are associated with dust condensations embedded within the MDCs, and they are mostly weak and compact. We are able to build spectral energy distributions for 8 sources. Two of them indicate non-thermal emission and the other six indicate thermal free-free emission. We have determined that most of the radio sources are ionized jets or winds originating from massive young stellar objects, whereas only a few sources are likely to be ultra-compact HII regions. Further quantitative analyses indicate that the radio luminosity of the detected radio sources increases along the evolution path of the MDCs.

preprint2021arXiv

Topology Aware Deep Learning for Wireless Network Optimization

Data-driven machine learning approaches have recently been proposed to facilitate wireless network optimization by learning latent knowledge from historical optimization instances. However, existing methods do not well handle the topology information that directly impacts the network optimization results. Directly operating on simple representations, e.g., adjacency matrices, results in poor generalization performance as the learned results depend on specific ordering of the network elements in the training data. To address this issue, we propose a two-stage topology-aware machine learning framework (TALF), which trains a graph embedding unit and a deep feed-forward network (DFN) jointly. By propagating and summarizing the underlying graph topological information, TALF encodes the topology in the vector representation of the optimization instance, which is used by the later DFN to infer critical structures of an optimal or near-optimal solution. The proposed approach is evaluated on a canonical wireless network flow problem with diverse network typologies and flow deployments. In-depth study on trade-off between efficiency and effectiveness of the inference results is also conducted, and we show that our approach is better at differentiate links by saving up to 60% computation time at over 90% solution quality.

preprint2020arXiv

3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training

Estimating 3D poses from a monocular video is still a challenging task, despite the significant progress that has been made in recent years. Generally, the performance of existing methods drops when the target person is too small/large, or the motion is too fast/slow relative to the scale and speed of the training data. Moreover, to our knowledge, many of these methods are not designed or trained under severe occlusion explicitly, making their performance on handling occlusion compromised. Addressing these problems, we introduce a spatio-temporal network for robust 3D human pose estimation. As humans in videos may appear in different scales and have various motion speeds, we apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multi-stride temporal convolutional net-works (TCNs) to estimate 3D joints or keypoints. Furthermore, we design a spatio-temporal discriminator based on body structures as well as limb motions to assess whether the predicted pose forms a valid pose and a valid movement. During training, we explicitly mask out some keypoints to simulate various occlusion cases, from minor to severe occlusion, so that our network can learn better and becomes robust to various degrees of occlusion. As there are limited 3D ground-truth data, we further utilize 2D video data to inject a semi-supervised learning capability to our network. Experiments on public datasets validate the effectiveness of our method, and our ablation studies show the strengths of our networkś individual submodules.

preprint2020arXiv

A Survey of Model Compression and Acceleration for Deep Neural Networks

Deep neural networks (DNNs) have recently achieved great success in many visual recognition tasks. However, existing deep neural network models are computationally expensive and memory intensive, hindering their deployment in devices with low memory resources or in applications with strict latency requirements. Therefore, a natural thought is to perform model compression and acceleration in deep networks without significantly decreasing the model performance. During the past five years, tremendous progress has been made in this area. In this paper, we review the recent techniques for compacting and accelerating DNN models. In general, these techniques are divided into four categories: parameter pruning and quantization, low-rank factorization, transferred/compact convolutional filters, and knowledge distillation. Methods of parameter pruning and quantization are described first, after that the other techniques are introduced. For each category, we also provide insightful analysis about the performance, related applications, advantages, and drawbacks. Then we go through some very recent successful methods, for example, dynamic capacity networks and stochastic depths networks. After that, we survey the evaluation matrices, the main datasets used for evaluating the model performance, and recent benchmark efforts. Finally, we conclude this paper, discuss remaining the challenges and possible directions for future work.

preprint2020arXiv

Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning

Pretrained models from self-supervision are prevalently used in fine-tuning downstream tasks faster or for better accuracy. However, gaining robustness from pretraining is left unexplored. We introduce adversarial training into self-supervision, to provide general-purpose robust pre-trained models for the first time. We find these robust pre-trained models can benefit the subsequent fine-tuning in two ways: i) boosting final model robustness; ii) saving the computation cost, if proceeding towards adversarial fine-tuning. We conduct extensive experiments to demonstrate that the proposed framework achieves large performance margins (eg, 3.83% on robust accuracy and 1.3% on standard accuracy, on the CIFAR-10 dataset), compared with the conventional end-to-end adversarial training baseline. Moreover, we find that different self-supervised pre-trained models have a diverse adversarial vulnerability. It inspires us to ensemble several pretraining tasks, which boosts robustness more. Our ensemble strategy contributes to a further improvement of 3.59% on robust accuracy, while maintaining a slightly higher standard accuracy on CIFAR-10. Our codes are available at https://github.com/TAMU-VITA/Adv-SS-Pretraining.

preprint2020arXiv

ALMA Observations of Massive Clouds in the Central Molecular Zone: Jeans Fragmentation and Cluster Formation

We report ALMA Band 6 continuum observations of 2000 AU resolution toward four massive molecular clouds in the Central Molecular Zone of the Galaxy. To study gas fragmentation, we use the dendrogram method to identify cores as traced by the dust continuum emission. The four clouds exhibit different fragmentation states at the observed resolution despite having similar masses at the cloud scale ($\sim$1--5 pc). Assuming a constant dust temperature of 20 K, we construct core mass functions of the clouds and find a slightly top-heavy shape as compared to the canonical initial mass function, but we note several significant uncertainties that may affect this result. The characteristic spatial separation between the cores as identified by the minimum spanning tree method, $\sim$$10^4$ AU, and the characteristic core mass, 1--7 $M_\odot$, are consistent with predictions of thermal Jeans fragmentation. The three clouds showing fragmentation may be forming OB associations (stellar mass $\sim$$10^3$ $M_\odot$). None of the four clouds under investigation seem to be currently able to form massive star clusters like the Arches and the Quintuplet ($\sim$$10^4$ $M_\odot$), but they may form such clusters by further gas accretion onto the cores.

preprint2020arXiv

BachGAN: High-Resolution Image Synthesis from Salient Object Layout

We propose a new task towards more practical application for image generation - high-quality image synthesis from salient object layout. This new setting allows users to provide the layout of salient objects only (i.e., foreground bounding boxes and categories), and lets the model complete the drawing with an invented background and a matching foreground. Two main challenges spring from this new task: (i) how to generate fine-grained details and realistic textures without segmentation map input; and (ii) how to create a background and weave it seamlessly into standalone objects. To tackle this, we propose Background Hallucination Generative Adversarial Network (BachGAN), which first selects a set of segmentation maps from a large candidate pool via a background retrieval module, then encodes these candidate layouts via a background fusion module to hallucinate a suitable background for the given objects. By generating the hallucinated background representation dynamically, our model can synthesize high-resolution images with both photo-realistic foreground and integral background. Experiments on Cityscapes and ADE20K datasets demonstrate the advantage of BachGAN over existing methods, measured on both visual fidelity of generated images and visual alignment between output images and input layouts.

preprint2020arXiv

Bayesian Cycle-Consistent Generative Adversarial Networks via Marginalizing Latent Sampling

Recent techniques built on Generative Adversarial Networks (GANs), such as Cycle-Consistent GANs, are able to learn mappings among different domains built from unpaired datasets, through min-max optimization games between generators and discriminators. However, it remains challenging to stabilize the training process and thus cyclic models fall into mode collapse accompanied by the success of discriminator. To address this problem, we propose an novel Bayesian cyclic model and an integrated cyclic framework for inter-domain mappings. The proposed method motivated by Bayesian GAN explores the full posteriors of cyclic model via sampling latent variables and optimizes the model with maximum a posteriori (MAP) estimation. Hence, we name it Bayesian CycleGAN. In addition, original CycleGAN cannot generate diversified results. But it is feasible for Bayesian framework to diversify generated images by replacing restricted latent variables in inference process. We evaluate the proposed Bayesian CycleGAN on multiple benchmark datasets, including Cityscapes, Maps, and Monet2photo. The proposed method improve the per-pixel accuracy by 15% for the Cityscapes semantic segmentation task within origin framework and improve 20% within the proposed integrated framework, showing better resilience to imbalance confrontation. The diversified results of Monet2Photo style transfer also demonstrate its superiority over original cyclic model. We provide codes for all of our experiments in https://github.com/ranery/Bayesian-CycleGAN.

preprint2020arXiv

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. Models such as ViLBERT, LXMERT and UNITER have significantly lifted state of the art across a wide range of V+L benchmarks with joint image-text pre-training. However, little is known about the inner mechanisms that destine their impressive success. To reveal the secrets behind the scene of these powerful models, we present VALUE (Vision-And-Language Understanding Evaluation), a set of meticulously designed probing tasks (e.g., Visual Coreference Resolution, Visual Relation Detection, Linguistic Probing Tasks) generalizable to standard pre-trained V+L models, aiming to decipher the inner workings of multimodal pre-training (e.g., the implicit knowledge garnered in individual attention heads, the inherent cross-modal alignment learned through contextualized multimodal embeddings). Through extensive analysis of each archetypal model architecture via these probing tasks, our key observations are: (i) Pre-trained models exhibit a propensity for attending over text rather than images during inference. (ii) There exists a subset of attention heads that are tailored for capturing cross-modal interactions. (iii) Learned attention matrix in pre-trained models demonstrates patterns coherent with the latent alignment between image regions and textual words. (iv) Plotted attention patterns reveal visually-interpretable relations among image regions. (v) Pure linguistic knowledge is also effectively encoded in the attention heads. These are valuable insights serving to guide future work towards designing better model architecture and objectives for multimodal pre-training.

preprint2020arXiv

Constrained Deep Reinforcement Learning for Energy Sustainable Multi-UAV based Random Access IoT Networks with NOMA

In this paper, we apply the Non-Orthogonal Multiple Access (NOMA) technique to improve the massive channel access of a wireless IoT network where solar-powered Unmanned Aerial Vehicles (UAVs) relay data from IoT devices to remote servers. Specifically, IoT devices contend for accessing the shared wireless channel using an adaptive $p$-persistent slotted Aloha protocol; and the solar-powered UAVs adopt Successive Interference Cancellation (SIC) to decode multiple received data from IoT devices to improve access efficiency. To enable an energy-sustainable capacity-optimal network, we study the joint problem of dynamic multi-UAV altitude control and multi-cell wireless channel access management of IoT devices as a stochastic control problem with multiple energy constraints. To learn an optimal control policy, we first formulate this problem as a Constrained Markov Decision Process (CMDP), and propose an online model-free Constrained Deep Reinforcement Learning (CDRL) algorithm based on Lagrangian primal-dual policy optimization to solve the CMDP. Extensive simulations demonstrate that our proposed algorithm learns a cooperative policy among UAVs in which the altitude of UAVs and channel access probability of IoT devices are dynamically and jointly controlled to attain the maximal long-term network capacity while maintaining energy sustainability of UAVs. The proposed algorithm outperforms Deep RL based solutions with reward shaping to account for energy costs, and achieves a temporal average system capacity which is $82.4\%$ higher than that of a feasible DRL based solution, and only $6.47\%$ lower compared to that of the energy-constraint-free system.

preprint2020arXiv

Contextual Text Style Transfer

We introduce a new task, Contextual Text Style Transfer - translating a sentence into a desired style with its surrounding context taken into account. This brings two key challenges to existing style transfer approaches: ($i$) how to preserve the semantic meaning of target sentence and its consistency with surrounding context during transfer; ($ii$) how to train a robust model with limited labeled data accompanied with context. To realize high-quality style transfer with natural context preservation, we propose a Context-Aware Style Transfer (CAST) model, which uses two separate encoders for each input sentence and its surrounding context. A classifier is further trained to ensure contextual consistency of the generated sentence. To compensate for the lack of parallel data, additional self-reconstruction and back-translation losses are introduced to leverage non-parallel data in a semi-supervised fashion. Two new benchmarks, Enron-Context and Reddit-Context, are introduced for formality and offensiveness style transfer. Experimental results on these datasets demonstrate the effectiveness of the proposed CAST model over state-of-the-art methods across style accuracy, content preservation and contextual consistency metrics.

preprint2020arXiv

Discourse-Aware Neural Extractive Text Summarization

Recently BERT has been adopted for document encoding in state-of-the-art text summarization models. However, sentence-based extractive models often result in redundant or uninformative phrases in the extracted summaries. Also, long-range dependencies throughout a document are not well captured by BERT, which is pre-trained on sentence pairs instead of documents. To address these issues, we present a discourse-aware neural summarization model - DiscoBert. DiscoBert extracts sub-sentential discourse units (instead of sentences) as candidates for extractive selection on a finer granularity. To capture the long-range dependencies among discourse units, structural discourse graphs are constructed based on RST trees and coreference mentions, encoded with Graph Convolutional Networks. Experiments show that the proposed model outperforms state-of-the-art methods by a significant margin on popular summarization benchmarks compared to other BERT-base models.

preprint2020arXiv

Distilling Knowledge Learned in BERT for Text Generation

Large-scale pre-trained language model such as BERT has achieved great success in language understanding tasks. However, it remains an open question how to utilize BERT for language generation. In this paper, we present a novel approach, Conditional Masked Language Modeling (C-MLM), to enable the finetuning of BERT on target generation tasks. The finetuned BERT (teacher) is exploited as extra supervision to improve conventional Seq2Seq models (student) for better text generation performance. By leveraging BERT's idiosyncratic bidirectional nature, distilling knowledge learned in BERT can encourage auto-regressive Seq2Seq models to plan ahead, imposing global sequence-level supervision for coherent text generation. Experiments show that the proposed approach significantly outperforms strong Transformer baselines on multiple language generation tasks such as machine translation and text summarization. Our proposed model also achieves new state of the art on IWSLT German-English and English-Vietnamese MT datasets. Code is available at https://github.com/ChenRocks/Distill-BERT-Textgen.

preprint2020arXiv

Fate of false vacuum in singlet-doublet fermion extension model with RG improved effective action

We study the effective potential and the Renormalization Group(RG) improvement to the effective potential of Higgs boson in a singlet-doublet fermion dark matter extension of the Standard Model(SM), and in general singlet-doublet fermion extension models with several copies of doublet fermions or singlet fermions. We study the stability of the electroweak vacuum with the RG improved effective potential in these models beyond the SM. We study the decay of the electroweak vacuum using the RG improved effective potential in these models beyond the SM. In this study we consider the quantum correction to the kinetic term in the effective action and consider the RG improvement of the kinetic term. Combining all these effects, we find that the decay rate of the false vacuum is slightly changed when calculated using the RG improved effective action in the singlet-doublet fermion dark matter model. In general singlet-doublet fermion extension models, we find that the presence of several copies of doublet fermions can make the electroweak vacuum stable if the new Yukawa couplings are not large. If the new Yukawa couplings are large, the electroweak vacuum can be turned into metastable or unstable again by the presence of extra fermions.

preprint2020arXiv

Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos

Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query. To tackle this task, designing an effective model to extract ground-ing information from both visual and textual modalities is crucial. However, most previous attempts in this field only focus on unidirectional interactions from video to query, which emphasizes which words to listen and attends to sentence information via vanilla soft attention, but clues from query-by-video interactions implying where to look are not taken into consideration. In this paper, we propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction. Specifically, in the iterative attention module, each word in the query is first enhanced by attending to each frame in the video through fine-grained attention, then video iteratively attends to the integrated query. Finally, both video and query information is utilized to provide robust cross-modal representation for further moment localization. In addition, to better predict the target segment, we propose a content-oriented localization strategy instead of applying recent anchor-based localization. We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA. FIAN significantly outperforms the state-of-the-art approaches.

preprint2020arXiv

FreeLB: Enhanced Adversarial Training for Natural Language Understanding

Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models. In this work, we propose a novel adversarial training algorithm, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples. To validate the effectiveness of the proposed approach, we apply it to Transformer-based models for natural language understanding and commonsense reasoning tasks. Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores of BERT-base model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88.8. In addition, the proposed approach achieves state-of-the-art single-model test accuracies of 85.44\% and 67.75\% on ARC-Easy and ARC-Challenge. Experiments on CommonsenseQA benchmark further demonstrate that FreeLB can be generalized and boost the performance of RoBERTa-large model on other tasks as well. Code is available at \url{https://github.com/zhuchen03/FreeLB .

preprint2020arXiv

Gas Kinematics of the Massive Protocluster G286.21+0.17 Revealed by ALMA

We study the gas kinematics and dynamics of the massive protocluster G286.21+0.17 with the Atacama Large Millimeter/submillimeter Array using spectral lines of $C^{18}O$(2-1), $N_2D^+$(3-2), $DCO^+$(3-2) and $DCN$(3-2). On the parsec clump scale, $C^{18}O$ emission appears highly filamentary around the systemic velocity. $N_2D^+$ and $DCO^+$ are more closely associated with the dust continuum. $DCN$ is strongly concentrated towards the protocluster center, where no or only weak detection is seen for $N_2D^+$ and $DCO^+$, possibly due to this region being at a relatively evolved evolutionary stage. Spectra of 76 continuum defined dense cores, typically a few 1000 AU in size, are analysed to measure their centroid velocities and internal velocity dispersions. There are no statistically significant velocity offsets of the cores among the different dense gas tracers. Furthermore, the majority (71\%) of the dense cores have subthermal velocity offsets with respect to their surrounding, lower density $C^{18}O$ emitting gas. Within the uncertainties, the dense cores in G286 show internal kinematics that are consistent with being in virial equilibrium. On clumps scales, the core to core velocity dispersion is also similar to that required for virial equilibrium in the protocluster potential. However, the distribution in velocity of the cores is largely composed of two spatially distinct groups, which indicates that the dense molecular gas has not yet relaxed to virial equilibrium, perhaps due to there being recent/continuous infall into the system.

preprint2020arXiv

Graph Optimal Transport for Cross-Domain Alignment

Cross-domain alignment between two sets of entities (e.g., objects in an image, words in a sentence) is fundamental to both computer vision and natural language processing. Existing methods mainly focus on designing advanced attention mechanisms to simulate soft alignment, with no training signals to explicitly encourage alignment. The learned attention matrices are also dense and lacks interpretability. We propose Graph Optimal Transport (GOT), a principled framework that germinates from recent advances in Optimal Transport (OT). In GOT, cross-domain alignment is formulated as a graph matching problem, by representing entities into a dynamically-constructed graph. Two types of OT distances are considered: (i) Wasserstein distance (WD) for node (entity) matching; and (ii) Gromov-Wasserstein distance (GWD) for edge (structure) matching. Both WD and GWD can be incorporated into existing neural network models, effectively acting as a drop-in regularizer. The inferred transport plan also yields sparse and self-normalized alignment, enhancing the interpretability of the learned model. Experiments show consistent outperformance of GOT over baselines across a wide range of tasks, including image-text retrieval, visual question answering, image captioning, machine translation, and text summarization.

preprint2020arXiv

High-Dimensional Robust Mean Estimation via Gradient Descent

We study the problem of high-dimensional robust mean estimation in the presence of a constant fraction of adversarial outliers. A recent line of work has provided sophisticated polynomial-time algorithms for this problem with dimension-independent error guarantees for a range of natural distribution families. In this work, we show that a natural non-convex formulation of the problem can be solved directly by gradient descent. Our approach leverages a novel structural lemma, roughly showing that any approximate stationary point of our non-convex objective gives a near-optimal solution to the underlying robust estimation task. Our work establishes an intriguing connection between algorithmic high-dimensional robust statistics and non-convex optimization, which may have broader applications to other robust estimation tasks.

preprint2020arXiv

INSET: Sentence Infilling with INter-SEntential Transformer

Missing sentence generation (or sentence infilling) fosters a wide range of applications in natural language generation, such as document auto-completion and meeting note expansion. This task asks the model to generate intermediate missing sentences that can syntactically and semantically bridge the surrounding context. Solving the sentence infilling task requires techniques in natural language processing ranging from understanding to discourse-level planning to generation. In this paper, we propose a framework to decouple the challenge and address these three aspects respectively, leveraging the power of existing large-scale pre-trained models such as BERT and GPT-2. We empirically demonstrate the effectiveness of our model in learning a sentence representation for generation and further generating a missing sentence that fits the context.

preprint2020arXiv

Investigation of Numerical Dispersion with Time Step of The FDTD Methods: Avoiding Erroneous Conclusions

It is widely thought that small time steps lead to small numerical errors in the finite-difference time-domain (FDTD) simulations. In this paper, we investigated how time steps impact on numerical dispersion of two FDTD methods including the FDTD(2,2) method and the FDTD(2,4) method. Through rigorously analytical and numerical analysis, it is found that small time steps of the FDTD methods do not always have small numerical errors. Our findings reveal that these two FDTD methods present different behaviors with respect to time steps: (1) for the FDTD(2,2) method, smaller time steps limited by the Courant-Friedrichs-Lewy (CFL) condition increase numerical dispersion and lead to larger simulation errors; (2) for the FDTD(2,4) method, as time step increases, numerical dispersion errors first decrease and then increase. Our findings are also comprehensively validated from one- to three-dimensional cases through several numerical examples including wave propagation, resonant frequencies of cavities and a practical electromagnetic compatibility (EMC) problem.

preprint2020arXiv

Optimizing Non-Orthogonal Multiple Access in Random Access Networks

Non-orthogonal multiple access (NOMA) has been considered as a promising solution for improving the spectrum efficiency of next-generation wireless networks. In this paper, the performance of a p-persistent slotted ALOHA system in support of NOMA transmissions is investigated. Specifically, wireless users can choose to use high or low power for data transmissions with certain probabilities. To achieve the maximum network throughput, an analytical framework is developed to analyze the successful transmission probability of NOMA and long term average throughput of users involved in the non-orthogonal transmissions. The feasible region of the maximum number of concurrent users using high and low power to ensure successful NOMA transmissions are quantified. Based on the analysis, an algorithm is proposed to find the optimal transmission probabilities for users to choose high and low power to achieve the maximum system throughput. In addition, the impact of power settings on the network performance is further investigated. Simulations are conducted to validate the analysis.

preprint2020arXiv

Sequential Attention GAN for Interactive Image Editing

Most existing text-to-image synthesis tasks are static single-turn generation, based on pre-defined textual descriptions of images. To explore more practical and interactive real-life applications, we introduce a new task - Interactive Image Editing, where users can guide an agent to edit images via multi-turn textual commands on-the-fly. In each session, the agent takes a natural language description from the user as the input and modifies the image generated in the previous turn to a new design, following the user description. The main challenges in this sequential and interactive image generation task are two-fold: 1) contextual consistency between a generated image and the provided textual description; 2) step-by-step region-level modification to maintain visual consistency across the generated image sequence in each session. To address these challenges, we propose a novel Sequential Attention Generative Adversarial Net-work (SeqAttnGAN), which applies a neural state tracker to encode the previous image and the textual description in each turn of the sequence, and uses a GAN framework to generate a modified version of the image that is consistent with the preceding images and coherent with the description. To achieve better region-specific refinement, we also introduce a sequential attention mechanism into the model. To benchmark on the new task, we introduce two new datasets, Zap-Seq and DeepFashion-Seq, which contain multi-turn sessions with image-description sequences in the fashion domain. Experiments on both datasets show that the proposed SeqAttnGANmodel outperforms state-of-the-art approaches on the interactive image editing task across all evaluation metrics including visual quality, image sequence coherence, and text-image consistency.

preprint2020arXiv

Stellar Variability in a Forming Massive Star Cluster

We present a near-infrared (NIR) variability analysis for an 6\arcmin $\times$ 6\arcmin region, which encompasses the massive protocluster G286.21+0.17. The total sample comprises more than 5000 objects, of which 562 show signs of a circumstellar disk based on their infrared colors. The data includes HST observations taken in two epochs separated by 3 years in the F110W and F160W bands. 363 objects (7% of the sample) exhibit NIR variability at a significant level (Stetson index >1.7), and a higher variability fraction (14%) is found for the young stellar objects (YSOs) with disk excesses. We identified 4 high amplitude (>0.6 mag) variables seen in both NIR bands. Follow up and archival observations of the most variable object in this survey (G286.2032+0.1740) reveal a rising light curve over 8 years from 2011 to 2019, with a K band brightening of 3.5 mag. Overall the temporal behavior of G286.2032+0.1740 resembles that of typical FU Ori objects, however its pre-burst luminosity indicates it has a very low mass ($<0.12\:M_\odot$), making it an extreme case of an outburst event that is still ongoing.

preprint2020arXiv

Towards Better Understanding of Disentangled Representations via Mutual Information

Most existing works on disentangled representation learning are solely built upon an marginal independence assumption: all factors in disentangled representations should be statistically independent. This assumption is necessary but definitely not sufficient for the disentangled representations without additional inductive biases in the modeling process, which is shown theoretically in recent studies. We argue in this work that disentangled representations should be characterized by their relation with observable data. In particular, we formulate such a relation through the concept of mutual information: the mutual information between each factor of the disentangled representations and data should be invariant conditioned on values of the other factors. Together with the widely accepted independence assumption, we further bridge it with the conditional independence of factors in representations conditioned on data. Moreover, we note that conditional independence of latent variables has been imposed on most VAE-type models and InfoGAN due to the artificial choice of factorized approximate posterior $q(\rvz|\rvx)$ in the encoders. Such an arrangement of encoders introduces a crucial inductive bias for disentangled representations. To demonstrate the importance of our proposed assumption and the related inductive bias, we show in experiments that violating the assumption leads to decline of disentanglement among factors in the learned representations.

preprint2020arXiv

UNITER: UNiversal Image-TExt Representation Learning

Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are simultaneously processed for joint visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design four pre-training tasks: Masked Language Modeling (MLM), Masked Region Modeling (MRM, with three variants), Image-Text Matching (ITM), and Word-Region Alignment (WRA). Different from previous work that applies joint random masking to both modalities, we use conditional masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). In addition to ITM for global image-text alignment, we also propose WRA via the use of Optimal Transport (OT) to explicitly encourage fine-grained alignment between words and image regions during pre-training. Comprehensive analysis shows that both conditional masking and OT-based WRA contribute to better pre-training. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks (over nine datasets), including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR$^2$. Code is available at https://github.com/ChenRocks/UNITER.

preprint2020arXiv

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

We introduce a new task, Video-and-Language Inference, for joint multimodal understanding of video and text. Given a video clip with aligned subtitles as premise, paired with a natural language hypothesis based on the video content, a model needs to infer whether the hypothesis is entailed or contradicted by the given video clip. A new large-scale dataset, named Violin (VIdeO-and-Language INference), is introduced for this task, which consists of 95,322 video-hypothesis pairs from 15,887 video clips, spanning over 582 hours of video. These video clips contain rich content with diverse temporal dynamics, event shifts, and people interactions, collected from two sources: (i) popular TV shows, and (ii) movie clips from YouTube channels. In order to address our new multimodal inference task, a model is required to possess sophisticated reasoning skills, from surface-level grounding (e.g., identifying objects and characters in the video) to in-depth commonsense reasoning (e.g., inferring causal relations of events in the video). We present a detailed analysis of the dataset and an extensive evaluation over many strong baselines, providing valuable insights on the challenges of this new task.

preprint2020arXiv

What Makes A Good Story? Designing Composite Rewards for Visual Storytelling

Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr. In this paper, we re-examine this problem from a different angle, by looking deep into what defines a realistically-natural and topically-coherent story. To this end, we propose three assessment criteria: relevance, coherence and expressiveness, which we observe through empirical analysis could constitute a &#34;high-quality&#34; story to the human eye. Following this quality guideline, we propose a reinforcement learning framework, ReCo-RL, with reward functions designed to capture the essence of these quality criteria. Experiments on the Visual Storytelling Dataset (VIST) with both automatic and human evaluations demonstrate that our ReCo-RL model achieves better performance than state-of-the-art baselines on both traditional metrics and the proposed new criteria.

preprint2019arXiv

Discovery of a Photoionized Bipolar Outflow towards the Massive Protostar G45.47+0.05

Massive protostars generate strong radiation feedback, which may help set the mass they achieve by the end of the accretion process. Studying such feedback is therefore crucial for understanding the formation of massive stars. We report the discovery of a photoionized bipolar outflow towards the massive protostar G45.47+0.05 using high-resolution observations at 1.3 mm with the Atacama Large Millimeter/Submillimeter Array (ALMA) and at 7 mm with the Karl G. Jansky Very Large Array (VLA). By modeling the free-free continuum, the ionized outflow is found to be a photoevaporation flow with an electron temperature of 10,000 K and an electron number density of ~1.5x10^7 cm^-3 at the center, launched from a disk of radius of 110 au. H30alpha hydrogen recombination line emission shows strong maser amplification, with G45 being one of very few sources to show such millimeter recombination line masers. The mass of the driving source is estimated to be 30-50 Msun based on the derived ionizing photon rate, or 30-40 Msun based on the H30alpha kinematics. The kinematics of the photoevaporated material is dominated by rotation close to the disk plane, while accelerated to outflowing motion above the disk plane. The mass loss rate of the photoevaporation outflow is estimated to be ~(2-3.5)x10^-5 Msun/yr. We also found hints of a possible jet embedded inside the wide-angle ionized outflow with non-thermal emissions. The possible co-existence of a jet and a massive photoevaporation outflow suggests that, in spite of the strong photoionization feedback, accretion is still on-going.