Source author record

Yu Cheng

Yu Cheng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Catalog footprint

What is connected

84works

34topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

While autoregressive Large Vision-Language Models (LVLMs) demonstrate remarkable proficiency in multimodal tasks, they face a "Visual Signal Dilution" phenomenon, where the accumulation of textual history expands the attention partition function, causing visual attention to decay inversely with generated sequence length. To counteract this, we propose Persistent Visual Memory (PVM), a lightweight learnable module designed to strengthen sustained, on-demand access to visual evidence. Integrated as a parallel branch alongside the Feed-Forward Network (FFN) in LVLMs, PVM establishes a distance-agnostic retrieval pathway that directly provides visual embeddings for enhanced visual perception, thereby structurally mitigating the signal suppression inherent to deep generation. Extensive experiments on Qwen3-VL models demonstrate that PVM brings notable improvements with negligible parameter overhead, delivering consistent average accuracy gains across both 4B and 8B scales, particularly in complex reasoning tasks that demand persistent visual perception. Furthermore, in-depth analysis reveals that PVM shows improved robustness in longer generations and accelerates internal prediction convergence.

preprint2025arXiv

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

Recent Large Reasoning Models (LRMs), such as DeepSeek-R1 and OpenAI o1, have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference. However, a growing concern lies in their tendency to produce excessively long reasoning traces, which are often filled with redundant content (e.g., repeated definitions), over-analysis of simple problems, and superficial exploration of multiple reasoning paths for harder tasks. This inefficiency introduces significant challenges for training, inference, and real-world deployment (e.g., in agent-based systems), where token economy is critical. In this survey, we provide a comprehensive overview of recent efforts aimed at improving reasoning efficiency in LRMs, with a particular focus on the unique challenges that arise in this new paradigm. We identify common patterns of inefficiency, examine methods proposed across the LRM lifecycle, i.e., from pretraining to inference, and discuss promising future directions for research. To support ongoing development, we also maintain a real-time GitHub repository tracking recent progress in the field. We hope this survey serves as a foundation for further exploration and inspires innovation in this rapidly evolving area.

preprint2025arXiv

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

While recent Multimodal Large Language Models (MLLMs) have attained significant strides in multimodal reasoning, their reasoning processes remain predominantly text-centric, leading to suboptimal performance in complex long-horizon, vision-centric tasks. In this paper, we establish a novel Generative Multimodal Reasoning paradigm and introduce DiffThinker, a diffusion-based reasoning framework. Conceptually, DiffThinker reformulates multimodal reasoning as a native generative image-to-image task, achieving superior logical consistency and spatial precision in vision-centric tasks. We perform a systematic comparison between DiffThinker and MLLMs, providing the first in-depth investigation into the intrinsic characteristics of this paradigm, revealing four core properties: efficiency, controllability, native parallelism, and collaboration. Extensive experiments across four domains (sequential planning, combinatorial optimization, constraint satisfaction, and spatial configuration) demonstrate that DiffThinker significantly outperforms leading closed source models including GPT-5 (+314.2\%) and Gemini-3-Flash (+111.6\%), as well as the fine-tuned Qwen3-VL-32B baseline (+39.0\%), highlighting generative multimodal reasoning as a promising approach for vision-centric reasoning.

preprint2025arXiv

MiMo-Audio: Audio Language Models are Few-Shot Learners

Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at https://github.com/XiaomiMiMo/MiMo-Audio.

preprint2025arXiv

Sommerfeld Enhancement from Background Force and the Galactic Center GeV Excess

We study the impact of background-induced forces on dark matter (DM) annihilation and their implications for indirect detection. In the presence of a finite number density of background particles, loop-level interactions can generate an effective force that is significantly enhanced relative to the vacuum case. We construct a two-component DM model in which the dominant component is a fermionic particle $χ$ and the subdominant component is an ultralight pseudoscalar particle $ϕ$. The annihilation of $χ$ proceeds through the p-wave channel and produces gamma-ray emission. The finite density of $ϕ$ particles induces a background-enhanced force between $χ$ particles, leading to a sizable Sommerfeld enhancement of the annihilation. We show that a viable region of parameter space in this model can account for the gamma-ray excess observed in the Galactic Center using Fermi-LAT data. The background-induced force substantially amplifies the Sommerfeld enhancement and thus enlarges the parameter space capable of explaining the excess, highlighting the importance of background effects in astrophysical environments.

preprint2023arXiv

Hypotheses Tree Building for One-Shot Temporal Sentence Localization

Given an untrimmed video, temporal sentence localization (TSL) aims to localize a specific segment according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on dense video frame annotations, which require a tremendous amount of human effort to collect. In this paper, we target another more practical and challenging setting: one-shot temporal sentence localization (one-shot TSL), which learns to retrieve the query information among the entire video with only one annotated frame. Particularly, we propose an effective and novel tree-structure baseline for one-shot TSL, called Multiple Hypotheses Segment Tree (MHST), to capture the query-aware discriminative frame-wise information under the insufficient annotations. Each video frame is taken as the leaf-node, and the adjacent frames sharing the same visual-linguistic semantics will be merged into the upper non-leaf node for tree building. At last, each root node is an individual segment hypothesis containing the consecutive frames of its leaf-nodes. During the tree construction, we also introduce a pruning strategy to eliminate the interference of query-irrelevant nodes. With our designed self-supervised loss functions, our MHST is able to generate high-quality segment hypotheses for ranking and selection with the query. Experiments on two challenging datasets demonstrate that MHST achieves competitive performance compared to existing methods.

preprint2023arXiv

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query. All existing works first utilize a sparse sampling strategy to extract a fixed number of video frames and then conduct multi-modal interactions with query sentence for reasoning. However, we argue that these methods have overlooked two indispensable issues: 1) Boundary-bias: The annotated target segment generally refers to two specific frames as corresponding start and end timestamps. The video downsampling process may lose these two frames and take the adjacent irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new boundary frames also lead to the reasoning bias during frame-query interaction, reducing the generalization ability of model. To alleviate above limitations, in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames to enrich and refine the new boundaries. Specifically, a reasoning strategy is developed to learn the inter-relationship among these frames and generate soft labels on boundaries for more accurate frame-query reasoning. Such mechanism is also able to supplement the absent consecutive visual semantics to the sampled sparse frames for fine-grained activity understanding. Extensive experiments demonstrate the effectiveness of SSRN on three challenging datasets.

preprint2022arXiv

A Deep Reinforcement Learning based Approach for NOMA-based Random Access Network with Truncated Channel Inversion Power Control

As a main use case of 5G and Beyond wireless network, the ever-increasing machine type communications (MTC) devices pose critical challenges over MTC network in recent years. It is imperative to support massive MTC devices with limited resources. To this end, Non-orthogonal multiple access (NOMA) based random access network has been deemed as a prospective candidate for MTC network. In this paper, we propose a deep reinforcement learning (RL) based approach for NOMA-based random access network with truncated channel inversion power control. Specifically, each MTC device randomly selects a pre-defined power level with a certain probability for data transmission. Devices are using channel inversion power control yet subject to the upper bound of the transmission power. Due to the stochastic feature of the channel fading and the limited transmission power, devices with different achievable power levels have been categorized as different types of devices. In order to achieve high throughput with considering the fairness between all devices, two objective functions are formulated. One is to maximize the minimum long-term expected throughput of all MTC devices, the other is to maximize the geometric mean of the long-term expected throughput for all MTC devices. A Policy based deep reinforcement learning approach is further applied to tune the transmission probabilities of each device to solve the formulated optimization problems. Extensive simulations are conducted to show the merits of our proposed approach.

preprint2022arXiv

A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models

Large pre-trained vision-language (VL) models can learn a new task with a handful of examples and generalize to a new task without fine-tuning. However, these VL models are hard to deploy for real-world applications due to their impractically huge sizes and slow inference speed. To solve this limitation, we study prompt-based low-resource learning of VL tasks with our proposed method, FewVLM, relatively smaller than recent few-shot learners. For FewVLM, we pre-train a sequence-to-sequence transformer model with prefix language modeling (PrefixLM) and masked language modeling (MaskedLM). Furthermore, we analyze the effect of diverse prompts for few-shot tasks. Experimental results on VQA show that FewVLM with prompt-based learning outperforms Frozen which is 31x larger than FewVLM by 18.2% point and achieves comparable results to a 246x larger model, PICa. In our analysis, we observe that (1) prompts significantly affect zero-shot performance but marginally affect few-shot performance, (2) models with noisy prompts learn as quickly as hand-crafted prompts given larger training data, and (3) MaskedLM helps VQA tasks while PrefixLM boosts captioning performance. Our code is publicly available at \url{https://github.com/woojeongjin/FewVLM}

preprint2022arXiv

A SBP-SAT FDTD Subgridding Method Using Staggered Yee's Grids Without Modifying Field Components

A summation-by-parts simultaneous approximation term (SBP-SAT) finite-difference time-domain (FDTD) subgridding method is proposed to model geometrically fine structures in this paper. Compared with our previous work, the proposed SBP-SAT FDTD method uses the staggered Yee's grid without adding or modifying any field components through field extrapolation on the boundaries to make the discrete operators satisfy the SBP property. The accuracy of extrapolation keeps consistency with that of the second-order finite-difference scheme near the boundaries. In addition, the SATs are used to weakly enforce the tangential boundary conditions between multiple mesh blocks with different mesh sizes. With carefully designed interpolation matrices and selected free parameters of the SATs, no dissipation occurs in the whole computational domain. Therefore, its long-time stability is theoretically guaranteed. Three numerical examples are carried out to validate its effectiveness. Results show that the proposed SBP-SAT FDTD subgridding method is stable, accurate, efficient, and easy to implement based on existing FDTD codes with only a few modifications.

preprint2022arXiv

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Large-scale pre-trained language models have achieved tremendous success across a wide range of natural language understanding (NLU) tasks, even surpassing human performance. However, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. While several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing. In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. Our findings are summarized as follows. (i) Most existing adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples, with around 90% of them either changing the original semantic meanings or misleading human annotators as well. Therefore, we perform a careful filtering process to curate a high-quality benchmark. (ii) All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy. We hope our work will motivate the development of new adversarial attacks that are more stealthy and semantic-preserving, as well as new robust language models against sophisticated adversarial attacks. AdvGLUE is available at https://adversarialglue.github.io.

preprint2022arXiv

Backdoor Attacks on Crowd Counting

Crowd counting is a regression task that estimates the number of people in a scene image, which plays a vital role in a range of safety-critical applications, such as video surveillance, traffic monitoring and flow control. In this paper, we investigate the vulnerability of deep learning based crowd counting models to backdoor attacks, a major security threat to deep learning. A backdoor attack implants a backdoor trigger into a target model via data poisoning so as to control the model's predictions at test time. Different from image classification models on which most of existing backdoor attacks have been developed and tested, crowd counting models are regression models that output multi-dimensional density maps, thus requiring different techniques to manipulate. In this paper, we propose two novel Density Manipulation Backdoor Attacks (DMBA$^{-}$ and DMBA$^{+}$) to attack the model to produce arbitrarily large or small density estimations. Experimental results demonstrate the effectiveness of our DMBA attacks on five classic crowd counting models and four types of datasets. We also provide an in-depth analysis of the unique challenges of backdooring crowd counting models and reveal two key elements of effective attacks: 1) full and dense triggers and 2) manipulation of the ground truth counts or density maps. Our work could help evaluate the vulnerability of crowd counting models to potential backdoor attacks.

preprint2022arXiv

CP violating dark photon kinetic mixing and type-III seesaw model

The hypothetical dark photon portal connecting the visible and dark sectors of the Universe has received considerable attention in recent years, with a focus on CP-conserving kinetic mixing between the Standard Model (SM) hypercharge gauge boson and a new U(1)$_X$ gauge boson. In the effective field theory context, one may write down non-renormalizable CP-violating kinetic mixing interactions involving the $X$ and SU(2)$_L$ gauge bosons. We construct for the first time a renormalizable model for CP-violating kinetic mixing that induces CP-violating non-Abelian kinetic mixing at mass dimension five. The model grows out of the type-III seesaw model, with the lepton triplets containing right-handed neutrinos playing a crucial role in making the model renormalizable and providing a bridge to the origin of neutrino mass. This scenario also accommodates electron electric dipole moments (EDM) as large as current experimental bound, making future EDM searches an important probe of this scenario.

preprint2022arXiv

Disks and Outflows in the Intermediate-mass Star Forming Region NGC 2071 IR

We present ALMA band 6/7 (1.3 mm/0.87 mm) and VLA Ka band (9 mm) observations toward NGC 2071 IR, an intermediate-mass star forming region. We characterize the continuum and associated molecular line emission towards the most luminous protostars, i.e., IRS1 and IRS3, on ~100 au (0. 2") scales. IRS1 is partly resolved in millimeter and centimeter continuum, which shows a potential disk. IRS3 has a well resolved disk appearance in millimeter continuum and is further resolved into a close binary system separated by ~40 au at 9 mm. Both sources exhibit clear velocity gradients across their disk major axes in multiple spectral lines including C18O, H2CO, SO, SO2, and complex organic molecules like CH3OH, 13CH3OH and CH3OCHO. We use an analytic method to fit the Keplerian rotation of the disks, and give constraints on physical parameters with a MCMC routine. The IRS3 binary system is estimated to have a total mass of 1.4-1.5$M_\odot$. IRS1 has a central mass of 3-5$M_\odot$ based on both kinematic modeling and its spectral energy distribution, assuming that it is dominated by a single protostar. For both IRS1 and IRS3, the inferred ejection directions from different tracers, including radio jet, water maser, molecular outflow, and H2 emission, are not always consistent, and for IRS1, these can be misaligned by ~50$^{\circ}$. IRS3 is better explained by a single precessing jet. A similar mechanism may be present in IRS1 as well but an unresolved multiple system in IRS1 is also possible.

preprint2022arXiv

Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. Our evaluations demonstrate the effectiveness of the proposed method. Code and models are available: https://github.com/3dpose/3D-Multi-Person-Pose.

preprint2022arXiv

Efficient Algorithms for Planning with Participation Constraints

We consider the problem of planning with participation constraints introduced in [Zhang et al., 2022]. In this problem, a principal chooses actions in a Markov decision process, resulting in separate utilities for the principal and the agent. However, the agent can and will choose to end the process whenever his expected onward utility becomes negative. The principal seeks to compute and commit to a policy that maximizes her expected utility, under the constraint that the agent should always want to continue participating. We provide the first polynomial-time exact algorithm for this problem for finite-horizon settings, where previously only an additive $\varepsilon$-approximation algorithm was known. Our approach can also be extended to the (discounted) infinite-horizon case, for which we give an algorithm that runs in time polynomial in the size of the input and $\log(1/\varepsilon)$, and returns a policy that is optimal up to an additive error of $\varepsilon$.

preprint2022arXiv

Flavor Specific $U(1)_{B_q-L_μ}$ Gauge Model for Muon $g-2$ and $b \to s \bar μμ$ Anomalies

The muon $(g-2)_μ$ and $b\to s \bar μμ$ induced $B$ anomalies as hints of new physics beyond the standard model (SM) have attracted much attention. These two anomalies indicate that there may exist new interaction specifically related to muon. A lot of theoretical ideas have been proposed to explain these anomalies. Gauged flavor specific $U(1)_{B_q-L_μ}$ is among the promising ones. The new gauge boson $Z'$ from $U(1)_{B_q-L_μ}$ interacts with muon and provides necessary ingredient to solve the $(g-2)_μ$ anomaly. The $Z'$-quark coupling can generate flavor changing interactions after diagonalization of quark mass matrix between weak eigen-state and mass eigen-state basis. We revisit challenges for such models attempting to explain the $(g-2)_μ$ and $B$ anomalies separately or simultaneously. We find although for $U(1)_{B_q-L_μ}$ models there is still parameter space to provide solutions for separately explaining the $(g-2)_μ$ and $B$ anomalies, there exists no parameter space for such models to solve both the anomalies simultaneously, after taking into account existing constraints from $τ\to μγ$, $τ\to 3 μ$, neutrino trident and $B_s - \bar B_s$ data. Among them leptonic processes restrict $Z^\prime$ mass to be less than a few hundred MeV if required to solve the $(g-2)_μ$ anomaly, which causes conflict between data from $B_s - \bar B_s$, $D^0 - \bar D^0$ mixing and also hadron decays with $Z^\prime$ in the final states. The effects of $U(1)_Y$ and $U(1)_{B_q-L_μ}$ kinetic mixing on these anomalies are also studied. We find that neither can these effects do much to bring the two anomalies together to be solved simultaneously.

preprint2022arXiv

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Large-scale multi-modal contrastive pre-training has demonstrated great utility to learn transferable features for a range of downstream tasks by mapping multiple modalities into a shared embedding space. Typically, this has employed separate encoders for each modality. However, recent work suggests that transformers can support learning across multiple modalities and allow knowledge sharing. Inspired by this, we investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks. More specifically, we question how many parameters of a transformer model can be shared across modalities during contrastive pre-training, and rigorously examine architectural design choices that position the proportion of parameters shared along a spectrum. In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters. Additionally, we find that light-weight modality-specific parallel modules further improve performance. Experimental results show that the proposed MS-CLIP approach outperforms vanilla CLIP by up to 13\% relative in zero-shot ImageNet classification (pre-trained on YFCC-100M), while simultaneously supporting a reduction of parameters. In addition, our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks. Furthermore, we discover that sharing parameters leads to semantic concepts from different modalities being encoded more closely in the embedding space, facilitating the transferring of common semantic structure (e.g., attention patterns) from language to vision. Code is available at \href{https://github.com/Hxyou/MSCLIP}{URL}.

preprint2022arXiv

Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

Temporal sentence grounding (TSG) is crucial and fundamental for video understanding. Although the existing methods train well-designed deep networks with a large amount of data, we find that they can easily forget the rarely appeared cases in the training stage due to the off-balance data distribution, which influences the model generalization and leads to undesirable performance. To tackle this issue, we propose a memory-augmented network, called Memory-Guided Semantic Learning Network (MGSL-Net), that learns and memorizes the rarely appeared content in TSG tasks. Specifically, MGSL-Net consists of three main parts: a cross-modal inter-action module, a memory augmentation module, and a heterogeneous attention module. We first align the given video-query pair by a cross-modal graph convolutional network, and then utilize a memory module to record the cross-modal shared semantic features in the domain-specific persistent memory. During training, the memory slots are dynamically associated with both common and rare cases, alleviating the forgetting issue. In testing, the rare cases can thus be enhanced by retrieving the stored memories, resulting in better generalization. At last, the heterogeneous attention module is utilized to integrate the enhanced multi-modal features in both video and query domains. Experimental results on three benchmarks show the superiority of our method on both effectiveness and efficiency, which substantially improves the accuracy not only on the entire dataset but also on rare cases.

preprint2022arXiv

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Vision transformers (ViTs) have gained increasing popularity as they are commonly believed to own higher modeling capacity and representation flexibility, than traditional convolutional networks. However, it is questionable whether such potential has been fully unleashed in practice, as the learned ViTs often suffer from over-smoothening, yielding likely redundant models. Recent works made preliminary attempts to identify and alleviate such redundancy, e.g., via regularizing embedding similarity or re-injecting convolution-like structures. However, a "head-to-toe assessment" regarding the extent of redundancy in ViTs, and how much we could gain by thoroughly mitigating such, has been absent for this field. This paper, for the first time, systematically studies the ubiquitous existence of redundancy at all three levels: patch embedding, attention map, and weight space. In view of them, we advocate a principle of diversity for training ViTs, by presenting corresponding regularizers that encourage the representation diversity and coverage at each of those levels, that enabling capturing more discriminative information. Extensive experiments on ImageNet with a number of ViT backbones validate the effectiveness of our proposals, largely eliminating the observed ViT redundancy and significantly boosting the model generalization. For example, our diversified DeiT obtains 0.70%~1.76% accuracy boosts on ImageNet with highly reduced similarity. Our codes are fully available in https://github.com/VITA-Group/Diverse-ViT.

preprint2022arXiv

Towards the Development of A Three-Dimensional SBP-SAT FDTD Method: Theory and Validation

To enhance the scalability and performance of the traditional finite-difference time-domain (FDTD) methods, a three-dimensional summation-by-parts simultaneous approximation term (SBP-SAT) FDTD method is developed to solve complex electromagnetic problems. It is theoretically stable and can be further used for multiple mesh blocks with different mesh sizes. This paper mainly focuses on the fundamental theoretical aspects upon its three-dimensional implementation, the SAT for various boundary conditions, and the numerical dispersion properties and the comparison with the FDTD method. The proposed SBP-SAT FDTD method inherits all the merits of the FDTD method, which is matrix-free, easy to implement, and has the same level of accuracy with a negligible overhead of runtime (0.13\%) and memory usage (1.2\%). Four numerical examples are carried out to validate the effectiveness of the proposed method.

preprint2022arXiv

Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Temporal video grounding (TVG) aims to localize a target segment in a video according to a given sentence query. Though respectable works have made decent achievements in this task, they severely rely on abundant video-query paired data, which is expensive and time-consuming to collect in real-world scenarios. In this paper, we explore whether a video grounding model can be learned without any paired annotations. To the best of our knowledge, this paper is the first work trying to address TVG in an unsupervised setting. Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set to compose the possible activity in each video for grounding. Specifically, we first develop a language semantic mining module, which extracts implicit semantic features from the whole query set. Then, these language semantic features serve as the guidance to compose the activity in video via a video-based semantic aggregation module. Finally, we utilize a foreground attention branch to filter out the redundant background activities and refine the grounding results. To validate the effectiveness of our DSCNet, we conduct experiments on both ActivityNet Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.

preprint2022arXiv

Widening the $U(1)_{L_μ-L_τ}$ $Z^\prime$ mass range for resolving the muon $g-2$ anomaly

Exchanging a $Z^\prime$ gauge boson is a favored mechanism to solve the muon $(g-2)_μ$ anomaly. Among such models the $Z^\prime$ from $U(1)_{L_μ- L_τ}$ gauge group has been extensively studied. In this model the same interaction addressing $(g-2)_μ$, leads to an enhanced muon neutrino trident (MNT) process $ν_μN \to ν_μμ\bar μN$ constraining the $Z^\prime$ mass to be less than a few hundred MeV. Many other $Z^\prime$ models face the same problem. It has long been realized that the coupling of $Z^\prime$ in the model can admit $(\bar μγ^μτ+ \bar ν_μγ^μL ν_τ)Z^\prime_μ$ interaction which does not contribute to the MNT process. It can solve $(g-2)_μ$ anomaly for a much wider $Z^\prime$ mass range. However this new interaction induces $τ\to μ\barν_μν_τ$ which rules out it as a solution to $(g-2)_μ$ anomaly. Here we propose a mechanism by introducing type-II seesaw $SU(2)_L$ triplet scalars to evade constraints from all known data to allow a wide $Z^\prime$ mass range to solve the $(g-2)_μ$ anomaly. This mechanism opens a new window for $Z^\prime$ physics.

preprint2022arXiv

ZOOMER: Boosting Retrieval on Web-scale Graphs by Regions of Interest

We introduce ZOOMER, a system deployed at Taobao, the largest e-commerce platform in China, for training and serving GNN-based recommendations over web-scale graphs. ZOOMER is designed for tackling two challenges presented by the massive user data at Taobao: low training/serving efficiency due to the huge scale of the graphs, and low recommendation quality due to the information overload which distracts the recommendation model from specific user intentions. ZOOMER achieves this by introducing a key concept, Region of Interests (ROI) in GNNs for recommendations, i.e., a neighborhood region in the graph with significant relevance to a strong user intention. ZOOMER narrows the focus from the whole graph and "zooms in" on the more relevant ROIs, thereby reducing the training/serving cost and mitigating the information overload at the same time. With carefully designed mechanisms, ZOOMER identifies the interest expressed by each recommendation request, constructs an ROI subgraph by sampling with respect to the interest, and guides the GNN to reweigh different parts of the ROI towards the interest by a multi-level attention module. Deployed as a large-scale distributed system, ZOOMER supports graphs with billions of nodes for training and thousands of requests per second for serving. ZOOMER achieves up to 14x speedup when downsizing sampling scales with comparable (even better) AUC performance than baseline methods. Besides, both the offline evaluation and online A/B test demonstrate the effectiveness of ZOOMER.

preprint2021arXiv

Collider search of light dark matter model with dark sector decay

We explore the possibility that the dark matter relic density is not produced by thermal mechanism directly, but by the decay of other heavier dark sector particles which on the other hand can be produced by the thermal freeze-out mechanism. Using a concrete model with a light dark matter from dark sector decay, we study the collider signature of the dark sector particles in association with Higgs production processes. We find that the future lepton colliders can be a better place to probe the signature of this kind of light dark matter model than the hadron collider such as LHC. Meanwhile, it is found that a Higgs factory with center of mass energy 250 GeV has a better potential to resolve the signature of this kind of light dark matter model than the Higgs factory with center of mass energy 350 GeV.

preprint2021arXiv

Deep Co-Attention Network for Multi-View Subspace Learning

Many real-world applications involve data from multiple modalities and thus exhibit the view heterogeneity. For example, user modeling on social media might leverage both the topology of the underlying social network and the content of the users' posts; in the medical domain, multiple views could be X-ray images taken at different poses. To date, various techniques have been proposed to achieve promising results, such as canonical correlation analysis based methods, etc. In the meanwhile, it is critical for decision-makers to be able to understand the prediction results from these methods. For example, given the diagnostic result that a model provided based on the X-ray images of a patient at different poses, the doctor needs to know why the model made such a prediction. However, state-of-the-art techniques usually suffer from the inability to utilize the complementary information of each view and to explain the predictions in an interpretable manner. To address these issues, in this paper, we propose a deep co-attention network for multi-view subspace learning, which aims to extract both the common information and the complementary information in an adversarial setting and provide robust interpretations behind the prediction to the end-users via the co-attention mechanism. In particular, it uses a novel cross reconstruction loss and leverages the label information to guide the construction of the latent representation by incorporating the classifier into our model. This improves the quality of latent representation and accelerates the convergence speed. Finally, we develop an efficient iterative algorithm to find the optimal encoders and discriminator, which are evaluated extensively on synthetic and real-world data sets. We also conduct a case study to demonstrate how the proposed method robustly interprets the predictions on an image data set.

preprint2021arXiv

Deuterium Chemodynamics of Massive Pre-Stellar Cores

High levels of deuterium fractionation of $\rm N_2H^+$ (i.e., $\rm D_{frac}^{N_2H^+} \gtrsim 0.1$) are often observed in pre-stellar cores (PSCs) and detection of $\rm N_2D^+$ is a promising method to identify elusive massive PSCs. However, the physical and chemical conditions required to reach such high levels of deuteration are still uncertain, as is the diagnostic utility of $\rm N_2H^+$ and $\rm N_2D^+$ observations of PSCs. We perform 3D magnetohydrodynamics simulations of a massive, turbulent, magnetised PSC, coupled with a sophisticated deuteration astrochemical network. Although the core has some magnetic/turbulent support, it collapses under gravity in about one freefall time, which marks the end of the simulations. Our fiducial model achieves relatively low $\rm D_{frac}^{N_2H^+} \sim 0.002$ during this time. We then investigate effects of initial ortho-para ratio of $\rm H_2$ ($\rm OPR^{H_2}$), temperature, cosmic ray (CR) ionization rate, CO and N-species depletion factors and prior PSC chemical evolution. We find that high CR ionization rates and high depletion factors allow the simulated $\rm D_{frac}^{N_2H^+}$ and absolute abundances to match observational values within one freefall time. For $\rm OPR^{H_2}$, while a lower initial value helps the growth of $\rm D_{frac}^{N_2H^+}$, the spatial structure of deuteration is too widespread compared to observed systems. For an example model with elevated CR ionization rates and significant heavy element depletion, we then study the kinematic and dynamic properties of the core as traced by its $\rm N_2D^+$ emission. The core, undergoing quite rapid collapse, exhibits disturbed kinematics in its average velocity map. Still, because of magnetic support, the core often appears kinematically sub-virial based on its $\rm N_2D^+$ velocity dispersion.

preprint2021arXiv

Efficient Robust Training via Backward Smoothing

Adversarial training is so far the most effective strategy in defending against adversarial examples. However, it suffers from high computational costs due to the iterative adversarial attacks in each training step. Recent studies show that it is possible to achieve fast Adversarial Training by performing a single-step attack with random initialization. However, such an approach still lags behind state-of-the-art adversarial training algorithms on both stability and model robustness. In this work, we develop a new understanding towards Fast Adversarial Training, by viewing random initialization as performing randomized smoothing for better optimization of the inner maximization problem. Following this new perspective, we also propose a new initialization strategy, backward smoothing, to further improve the stability and model robustness over single-step robust training methods. Experiments on multiple benchmarks demonstrate that our method achieves similar model robustness as the original TRADES method while using much less training time ($\sim$3x improvement with the same training schedule).

preprint2021arXiv

Electronic controllable broadband and robust terahertz surface plasmon-polaritons switch based on hybrid ITO waveguide coupler

The surface plasmon-polaritons (SPPs) switch is the key element of the integrated devices in optical computation and terahertz (THz) communications. In this paper, we propose a novel design of THz SPPs switch based on quantum engineering. Due to the robustness of coherent quantum control technique, our switch is very robust against with perturbations of geometrical parameters and presents a good performance at on-state (and off-state) from 0.5 THz to 0.7 THz. The on-state and off-state of our device can be controlled by the external voltage. We believe this finding will be the great improvement for the integrated optical computing and THz communications.

preprint2021arXiv

EnlightenGAN: Deep Light Enhancement without Paired Supervision

Deep learning-based methods have achieved remarkable success in image restoration and enhancement, but are they still competitive when there is a lack of paired training data? As one such example, this paper explores the low-light image enhancement problem, where in practice it is extremely challenging to simultaneously take a low-light and a normal-light photo of the same visual scene. We propose a highly effective unsupervised generative adversarial network, dubbed EnlightenGAN, that can be trained without low/normal-light image pairs, yet proves to generalize very well on various real-world test images. Instead of supervising the learning using ground truth data, we propose to regularize the unpaired training using the information extracted from the input itself, and benchmark a series of innovations for the low-light image enhancement problem, including a global-local discriminator structure, a self-regularized perceptual loss fusion, and attention mechanism. Through extensive experiments, our proposed approach outperforms recent methods under a variety of metrics in terms of visual quality and subjective user study. Thanks to the great flexibility brought by unpaired training, EnlightenGAN is demonstrated to be easily adaptable to enhancing real-world images from various domains. The code is available at \url{https://github.com/yueruchen/EnlightenGAN}

preprint2021arXiv

Fair for All: Best-effort Fairness Guarantees for Classification

Standard approaches to group-based notions of fairness, such as \emph{parity} and \emph{equalized odds}, try to equalize absolute measures of performance across known groups (based on race, gender, etc.). Consequently, a group that is inherently harder to classify may hold back the performance on other groups; and no guarantees can be provided for unforeseen groups. Instead, we propose a fairness notion whose guarantee, on each group $g$ in a class $\mathcal{G}$, is relative to the performance of the best classifier on $g$. We apply this notion to broad classes of groups, in particular, where (a) $\mathcal{G}$ consists of all possible groups (subsets) in the data, and (b) $\mathcal{G}$ is more streamlined. For the first setting, which is akin to groups being completely unknown, we devise the {\sc PF} (Proportional Fairness) classifier, which guarantees, on any possible group $g$, an accuracy that is proportional to that of the optimal classifier for $g$, scaled by the relative size of $g$ in the data set. Due to including all possible groups, some of which could be too complex to be relevant, the worst-case theoretical guarantees here have to be proportionally weaker for smaller subsets. For the second setting, we devise the {\sc BeFair} (Best-effort Fair) framework which seeks an accuracy, on every $g \in \mathcal{G}$, which approximates that of the optimal classifier on $g$, independent of the size of $g$. Aiming for such a guarantee results in a non-convex problem, and we design novel techniques to get around this difficulty when $\mathcal{G}$ is the set of linear hypotheses. We test our algorithms on real-world data sets, and present interesting comparative insights on their performance.

preprint2021arXiv

Surveys of Clumps, Cores, and Condensations in the Cygnus X: II. Radio Properties of the Massive Dense Cores

We have carried out a high-sensitivity and high-resolution radio continuum study towards a sample of 47 massive dense cores (MDCs) in the Cygnus X star-forming complex using the Karl G. Jansky Very Large Array, aiming to detect and characterize the radio emission associated with star-forming activities down to ~0.01 pc scales. We have detected 64 radio sources within or closely around the full width at half-maximum (FWHM) of the MDCs, of which 37 are reported for the first time. The majority of the detected radio sources are associated with dust condensations embedded within the MDCs, and they are mostly weak and compact. We are able to build spectral energy distributions for 8 sources. Two of them indicate non-thermal emission and the other six indicate thermal free-free emission. We have determined that most of the radio sources are ionized jets or winds originating from massive young stellar objects, whereas only a few sources are likely to be ultra-compact HII regions. Further quantitative analyses indicate that the radio luminosity of the detected radio sources increases along the evolution path of the MDCs.

preprint2021arXiv

Topology Aware Deep Learning for Wireless Network Optimization

Data-driven machine learning approaches have recently been proposed to facilitate wireless network optimization by learning latent knowledge from historical optimization instances. However, existing methods do not well handle the topology information that directly impacts the network optimization results. Directly operating on simple representations, e.g., adjacency matrices, results in poor generalization performance as the learned results depend on specific ordering of the network elements in the training data. To address this issue, we propose a two-stage topology-aware machine learning framework (TALF), which trains a graph embedding unit and a deep feed-forward network (DFN) jointly. By propagating and summarizing the underlying graph topological information, TALF encodes the topology in the vector representation of the optimization instance, which is used by the later DFN to infer critical structures of an optimal or near-optimal solution. The proposed approach is evaluated on a canonical wireless network flow problem with diverse network typologies and flow deployments. In-depth study on trade-off between efficiency and effectiveness of the inference results is also conducted, and we show that our approach is better at differentiate links by saving up to 60% computation time at over 90% solution quality.

Yu Cheng

What is connected

Connect this record

See the researcher in context

Building this map preview

84 published item(s)

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

DiffThinker: Towards Generative Multimodal Reasoning with Diffusion Models

MiMo-Audio: Audio Language Models are Few-Shot Learners

Sommerfeld Enhancement from Background Force and the Galactic Center GeV Excess

Hypotheses Tree Building for One-Shot Temporal Sentence Localization

Rethinking the Video Sampling and Reasoning Strategies for Temporal Sentence Grounding

A Deep Reinforcement Learning based Approach for NOMA-based Random Access Network with Truncated Channel Inversion Power Control

A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models

A SBP-SAT FDTD Subgridding Method Using Staggered Yee's Grids Without Modifying Field Components

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

Backdoor Attacks on Crowd Counting

CP violating dark photon kinetic mixing and type-III seesaw model

Disks and Outflows in the Intermediate-mass Star Forming Region NGC 2071 IR

Dual networks based 3D Multi-Person Pose Estimation from Monocular Video

Efficient Algorithms for Planning with Participation Constraints

Flavor Specific $U(1)_{B_q-L_μ}$ Gauge Model for Muon $g-2$ and $b \to s \bar μμ$ Anomalies

Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training

Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

The Principle of Diversity: Training Stronger Vision Transformers Calls for Reducing All Levels of Redundancy

Towards the Development of A Three-Dimensional SBP-SAT FDTD Method: Theory and Validation

Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Widening the $U(1)_{L_μ-L_τ}$ $Z^\prime$ mass range for resolving the muon $g-2$ anomaly

ZOOMER: Boosting Retrieval on Web-scale Graphs by Regions of Interest

Collider search of light dark matter model with dark sector decay

Deep Co-Attention Network for Multi-View Subspace Learning

Deuterium Chemodynamics of Massive Pre-Stellar Cores

Efficient Robust Training via Backward Smoothing

Electronic controllable broadband and robust terahertz surface plasmon-polaritons switch based on hybrid ITO waveguide coupler

EnlightenGAN: Deep Light Enhancement without Paired Supervision

Fair for All: Best-effort Fairness Guarantees for Classification

Surveys of Clumps, Cores, and Condensations in the Cygnus X: II. Radio Properties of the Massive Dense Cores

Topology Aware Deep Learning for Wireless Network Optimization

3D Human Pose Estimation using Spatio-Temporal Networks with Explicit Occlusion Training

A Survey of Model Compression and Acceleration for Deep Neural Networks

Adversarial Robustness: From Self-Supervised Pre-Training to Fine-Tuning

ALMA Observations of Massive Clouds in the Central Molecular Zone: Jeans Fragmentation and Cluster Formation

BachGAN: High-Resolution Image Synthesis from Salient Object Layout

Bayesian Cycle-Consistent Generative Adversarial Networks via Marginalizing Latent Sampling

Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models

Constrained Deep Reinforcement Learning for Energy Sustainable Multi-UAV based Random Access IoT Networks with NOMA

Contextual Text Style Transfer

Discourse-Aware Neural Extractive Text Summarization

Distilling Knowledge Learned in BERT for Text Generation

Fate of false vacuum in singlet-doublet fermion extension model with RG improved effective action

Fine-grained Iterative Attention Network for TemporalLanguage Localization in Videos

FreeLB: Enhanced Adversarial Training for Natural Language Understanding

Gas Kinematics of the Massive Protocluster G286.21+0.17 Revealed by ALMA

Graph Optimal Transport for Cross-Domain Alignment

High-Dimensional Robust Mean Estimation via Gradient Descent

INSET: Sentence Infilling with INter-SEntential Transformer

Investigation of Numerical Dispersion with Time Step of The FDTD Methods: Avoiding Erroneous Conclusions

Optimizing Non-Orthogonal Multiple Access in Random Access Networks

Sequential Attention GAN for Interactive Image Editing

Stellar Variability in a Forming Massive Star Cluster

Towards Better Understanding of Disentangled Representations via Mutual Information

UNITER: UNiversal Image-TExt Representation Learning

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference

What Makes A Good Story? Designing Composite Rewards for Visual Storytelling

Discovery of a Photoionized Bipolar Outflow towards the Massive Protostar G45.47+0.05

A Distributed Secure Outsourcing Scheme for Solving Linear Algebraic Equations in Ad Hoc Clouds

Deep Structured Energy Based Models for Anomaly Detection

Doubly Convolutional Neural Networks

Fully-adaptive Feature Sharing in Multi-Task Networks with Applications in Person Attribute Classification

Generative Adversarial Networks as Variational Training of Energy Based Models

Hardness Results for Signaling in Bayesian Zero-Sum and Network Routing Games

Playing Anonymous Games using Simple Strategies

S3Pool: Pooling with Stochastic Spatial Sampling

Walk and Learn: Facial Attribute Representation Learning from Egocentric Video and Contextual Data

An exploration of parameter redundancy in deep networks with circulant projections

Mixture Selection, Mechanism Design, and Signaling

Spectral Sparsification of Random-Walk Matrix Polynomials

Well-Supported versus Approximate Nash Equilibria: Query Complexity of Large Games

Workload-Driven Vertical Partitioning for Effective Query Processing over Raw Data