Source author record

Ying Cheng

Ying Cheng appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Artificial Intelligence Machine Learning Multimedia Applications Computation and Language cond-mat.mes-hall physics.ao-ph physics.optics physics.soc-ph quant-ph

Catalog footprint

What is connected

9works

11topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

Large Language Models (LLMs) alignment is constantly evolving. Machine-Generated Text (MGT) is becoming increasingly difficult to distinguish from Human-Written Text (HWT). This has exacerbated abuse issues such as fake news and online fraud. Fine-tuned detectors' generalization ability is highly dependent on dataset quality, and simply expanding the sources of MGT is insufficient. Further augment of generation process is required. According to HC-Var's theory, enhancing the alignment of generated text can not only facilitate attacks on existing detectors to test their robustness, but also help improve the generalization ability of detectors fine-tuned on it. Therefore, we propose \textbf{M}achine-\textbf{A}ugment-\textbf{G}enerated Text via \textbf{A}lignment (MAGA). MAGA's pipeline achieves comprehensive alignment from prompt construction to reasoning process, among which \textbf{R}einforced \textbf{L}earning from \textbf{D}etectors \textbf{F}eedback (RLDF), systematically proposed by us, serves as a key component. In our experiments, the RoBERTa detector fine-tuned on MAGA training set achieved an average improvement of 4.60\% in generalization detection AUC. MAGA Dataset caused an average decrease of 8.13\% in the AUC of the selected detectors, expecting to provide indicative significance for future research on the generalization detection ability of detectors.

preprint2025arXiv

Entanglement dynamics driven by topology and non-Hermiticity

The interplay between topology and non-Hermiticity gives rise to exotic dynamic phenomena that challenge conventional wave-packet propagation and entanglement dynamics. While recent studies have established the non-Hermitian skin effect (NHSE) as a key mechanism for anomalous wave dynamics, a unified framework for characterizing and controlling entanglement evolution in non-Hermitian topological systems remains underdeveloped. Here, by combining theory and experiments, we demonstrate that entanglement entropy (EE) and transport currents serve as robust dynamic probes to distinguish various non-Hermitian topological regimes. Using a generalized non-Hermitian Su-Schrieffer-Heeger model implemented in an acoustic analog platform, we identify three dynamic phases, bulk-like, edge-like, and skin-like regimes, each exhibiting unique EE signatures and transport characteristics. In particular, skin-like dynamics exhibit periodic information shuttling with finite, oscillatory EE, while edge-like dynamics lead to complete EE suppression. We further map the dynamic phase diagram and show that EE scaling and temporal profiles directly reflect the competition between coherent delocalization and NHSE-driven localization. Our results establish a programmable approach to steering entanglement and transport via tailored non-Hermitian couplings, offering a pathway for engineering quantum information dynamics in synthetic phononic, photonic, and quantum simulators.

preprint2022arXiv

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

Vision-Language Pre-training (VLP) with large-scale image-text pairs has demonstrated superior performance in various fields. However, the image-text pairs co-occurrent on the Internet typically lack explicit alignment information, which is suboptimal for VLP. Existing methods proposed to adopt an off-the-shelf object detector to utilize additional image tag information. However, the object detector is time-consuming and can only identify the pre-defined object categories, limiting the model capacity. Inspired by the observation that the texts incorporate incomplete fine-grained image information, we introduce IDEA, which stands for increasing text diversity via online multi-label recognition for VLP. IDEA shows that multi-label learning with image tags extracted from the texts can be jointly optimized during VLP. Moreover, IDEA can identify valuable image tags online to provide more explicit textual supervision. Comprehensive experiments demonstrate that IDEA can significantly boost the performance on multiple downstream datasets with a small extra computational cost.

preprint2022arXiv

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Recognizing and localizing events in videos is a fundamental task for video understanding. Since events may occur in auditory and visual modalities, multimodal detailed perception is essential for complete scene comprehension. Most previous works attempted to analyze videos from a holistic perspective. However, they do not consider semantic information at multiple scales, which makes the model difficult to localize events in different lengths. In this paper, we present a Multimodal Pyramid Attentional Network (\textbf{MM-Pyramid}) for event localization. Specifically, we first propose the attentive feature pyramid module. This module captures temporal pyramid features via several stacking pyramid units, each of them is composed of a fixed-size attention block and dilated convolution block. We also design an adaptive semantic fusion module, which leverages a unit-level attention block and a selective fusion block to integrate pyramid features interactively. Extensive experiments on audio-visual event localization and weakly-supervised audio-visual video parsing tasks verify the effectiveness of our approach.

preprint2022arXiv

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

Weakly-supervised audio-visual violence detection aims to distinguish snippets containing multimodal violence events with video-level labels. Many prior works perform audio-visual integration and interaction in an early or intermediate manner, yet overlooking the modality heterogeneousness over the weakly-supervised setting. In this paper, we analyze the modality asynchrony and undifferentiated instances phenomena of the multiple instance learning (MIL) procedure, and further investigate its negative impact on weakly-supervised audio-visual learning. To address these issues, we propose a modality-aware contrastive instance learning with self-distillation (MACIL-SD) strategy. Specifically, we leverage a lightweight two-stream network to generate audio and visual bags, in which unimodal background, violent, and normal instances are clustered into semi-bags in an unsupervised way. Then audio and visual violent semi-bag representations are assembled as positive pairs, and violent semi-bags are combined with background and normal instances in the opposite modality as contrastive negative pairs. Furthermore, a self-distillation module is applied to transfer unimodal visual knowledge to the audio-visual model, which alleviates noises and closes the semantic gap between unimodal and multimodal features. Experiments show that our framework outperforms previous methods with lower complexity on the large-scale XD-Violence dataset. Results also demonstrate that our proposed approach can be used as plug-in modules to enhance other networks. Codes are available at https://github.com/JustinYuu/MACIL_SD.

preprint2022arXiv

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

Visual-only self-supervised learning has achieved significant improvement in video representation learning. Existing related methods encourage models to learn video representations by utilizing contrastive learning or designing specific pretext tasks. However, some models are likely to focus on the background, which is unimportant for learning video representations. To alleviate this problem, we propose a new view called long-range residual frame to obtain more motion-specific information. Based on this, we propose the Motion-Contrastive Perception Network (MCPNet), which consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP), to learn generic video representations by focusing on the changing areas in videos. Specifically, the MIP branch aims to learn fine-grained motion features, and the CIP branch performs contrastive learning to learn overall semantics information for each instance. Experiments on two benchmark datasets UCF-101 and HMDB-51 show that our method outperforms current state-of-the-art visual-only self-supervised approaches.

preprint2020arXiv

Applying the Network Item Response Model to Student Assessment Data

This study discusses an alternative tool for modeling student assessment data. The model constructs networks from a matrix item responses and attempts to represent these data in low dimensional Euclidean space. This procedure has advantages over common methods used for modeling student assessment data such as Item Response Theory because it relaxes the highly restrictive local-independence assumption. This article provides a deep discussion of the model and the steps one must take to estimate it. To enable extending a present model by adding data, two methods for estimating the positions of new individuals in the network are discussed. Then, a real data analysis is then provided as a case study on using the model and how to interpret the results. Finally, the model is compared and contrasted to other popular models in psychological and educational measurement: Item response theory (IRT) and network psychometric Ising model for binary data.

preprint2020arXiv

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

When watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be utilized as free supervised information to train a neural network by solving the pretext task of audio-visual synchronization. In this paper, we propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos in the wild, and further benefit downstream tasks. Specifically, we explore three different co-attention modules to focus on discriminative visual regions correlated to the sounds and introduce the interactions between them. Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods. To further evaluate the generalizability and transferability of our approach, we apply the pre-trained model on two downstream tasks, i.e., sound source localization and action recognition. Extensive experiments demonstrate that our model provides competitive results with other self-supervised methods, and also indicate that our approach can tackle the challenging scenes which contain multiple sound sources.

preprint2020arXiv

Significant reduced traffic in Beijing failed to relieve haze pollution during the COVID-19 lockdown: implications for haze mitigation

The COVID-19 outbreak greatly limited human activities and reduced primary emissions particularly from urban on-road vehicles, but coincided with Beijing experiencing pandemic haze, raising the public concerns of the validity and effectiveness of the imposed traffic policies to improve the air pollution. Here, we explored the relationship between local vehicle emissions and the winter haze in Beijing before and during the COVID-19 lockdown period based on an integrated analysis framework, which combines a real-time on-road emission inventory, in-situ air quality observations and a localized chemical transport modeling system. We found that traffic emissions decreased substantially affected by the pandemic, with a higher reduction for NOx (75.9%, 125.3 Mg/day) compared to VOCs (53.1%, 52.9 Mg/day). Unexpectedly, our results show that the imbalanced emission abatement of NOx and VOCs from vehicles led to a significant rise of the atmospheric oxidizing capacity in urban areas, but only resulting in modest increases in secondary aerosols due to the inadequate precursors. However, the enhanced oxidizing capacity in the surrounding regions greatly increased the secondary particles with relatively abundant precursors, which is mainly responsible for Beijing haze during the lockdown period. Our results indicate that the winter haze in Beijing was insensitive to the local vehicular emissions reduction due to the complicated nonlinear response of the fine particle and air pollutant emissions. We suggest mitigation policies should focus on accelerating VOC and NH3 emissions reduction and synchronously controlling regional sources to release the benefits on local traffic emission control.

Ying Cheng

What is connected

Connect this record

See the researcher in context

Building this map preview

9 published item(s)

MAGA-Bench: Machine-Augment-Generated Text via Alignment Detection Benchmark

Entanglement dynamics driven by topology and non-Hermiticity

IDEA: Increasing Text Diversity via Online Multi-Label Recognition for Vision-Language Pre-training

MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing

Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection

Self-Supervised Video Representation Learning with Motion-Contrastive Perception

Applying the Network Item Response Model to Student Assessment Data

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Significant reduced traffic in Beijing failed to relieve haze pollution during the COVID-19 lockdown: implications for haze mitigation