Source author record

Fengyun Rao

Fengyun Rao appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision astro-ph.HE

Catalog footprint

What is connected

7works

2topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding

Omni-proactive streaming video understanding, i.e., autonomously deciding when to speak and what to say from continuous audio-visual streams, is an emerging capability of omni-modal large language models. Existing benchmarks fall short in three key aspects: they rely primarily on visual signals, adopt polling or fixed-timestamp protocols instead of true proactive evaluation, and cover only a limited range of tasks, preventing reliable assessment and differentiation of omni-proactive streaming models. We present OmniPro, the first benchmark to jointly evaluate omni-modal perception, proactive responding, and diverse video understanding tasks. It comprises 2,700 human-verified samples spanning 9 sub-tasks and 3 cognitive levels, covering 6 basic video understanding capabilities. Notably, 84% of samples require audio signals (speech or non-speech), and each sample is annotated with modality-isolation labels to enable fine-grained multimodal analysis. We further introduce a dual-mode evaluation protocol: Probe mode assesses content understanding by querying the model before and after each ground-truth trigger, while Online mode evaluates full proactive ability by requiring models to autonomously decide when to respond in streaming input. Evaluating 11 representative models reveals three key findings: (1) audio provides consistent gains but with highly variable utilization across models, (2) performance degrades significantly over time, indicating limited long-horizon robustness, and (3) non-speech audio perception remains the weakest dimension.

preprint2026arXiv

Semantic-Enriched Latent Visual Reasoning

Multimodal latent-space reasoning aims to replace explicit thinking with images by performing visual reasoning directly in a compact latent space. However, existing approaches largely rely on visual supervision and produce latent representations that lack sufficient semantic richness, limiting their ability to support diverse region-level reasoning tasks. In this work, we introduce Semantic-Enriched Latent Visual Reasoning (SLVR), a two-stage learning framework that enriches latent representations with attribute-level visual semantics and aligns them with diverse reasoning objectives. In the first stage, SLVR learns semantically enriched region-centric latents under fine-grained attribute supervision. In the second stage, we design Multi-query Group Relative Policy Optimization (M-GRPO) to align latent representations across multiple queries grounded in the same region. To support this framework, we construct SLV-Set, comprising approximately 400K region-level attribute annotations and 800K multi-query question answering samples, and introduce SV-QA, a benchmark that evaluates latent reasoning under semantic variation. Experiments demonstrate that SLVR improves the robustness and semantic consistency of latent visual reasoning compared to existing baselines.

preprint2026arXiv

Stage-adaptive Token Selection for Efficient Omni-modal LLMs

Omni-modal large language models (om-LLMs) achieve unified audio-visual understanding by encoding video and audio into temporally aligned token sequences interleaved at the window level. However, processing these dense non-textual tokens throughout the LLM incurs substantial computational overhead. Although training-free token selection can reduce this cost, existing methods either focus on visual-only inputs or prune om-LLM tokens only before the LLM with fixed per-modality ratios, failing to capture how cross-modal token importance evolves across layers. To address this limitation, we first analyze the layer-wise token dependency of om-LLMs. We find that visual and audio dependencies follow a block-wise pattern and gradually weaken with depth, indicating that many late-layer non-textual tokens become redundant after cross-modal fusion. Motivated by this observation, we propose SEATS, a training-free, stage-adaptive token selection method for efficient om-LLM inference. Before the LLM, SEATS removes spatiotemporal redundancy via attention-weighted diversity selection. Inside the LLM, it progressively prunes tokens across blocks and dynamically allocates the retention budget from temporal windows to modalities using query relevance scores. In late layers, it removes all remaining non-textual tokens once cross-modal fusion is complete. Experiments on Qwen2.5-Omni and Qwen3-Omni demonstrate that SEATS effectively improves inference efficiency. Retaining only 10% of visual and audio tokens, it achieves a 9.3x FLOPs reduction and a 4.8x prefill speedup while preserving 96.3% of the original performance.

preprint2022arXiv

CA-SSL: Class-Agnostic Semi-Supervised Learning for Detection and Segmentation

To improve instance-level detection/segmentation performance, existing self-supervised and semi-supervised methods extract either task-unrelated or task-specific training signals from unlabeled data. We show that these two approaches, at the two extreme ends of the task-specificity spectrum, are suboptimal for the task performance. Utilizing too little task-specific training signals causes underfitting to the ground-truth labels of downstream tasks, while the opposite causes overfitting to the ground-truth labels. To this end, we propose a novel Class-Agnostic Semi-Supervised Learning (CA-SSL) framework to achieve a more favorable task-specificity balance in extracting training signals from unlabeled data. CA-SSL has three training stages that act on either ground-truth labels (labeled data) or pseudo labels (unlabeled data). This decoupling strategy avoids the complicated scheme in traditional SSL methods that balances the contributions from both data types. Especially, we introduce a warmup training stage to achieve a more optimal balance in task specificity by ignoring class information in the pseudo labels, while preserving localization training signals. As a result, our warmup model can better avoid underfitting/overfitting when fine-tuned on the ground-truth labels in detection and segmentation tasks. Using 3.6M unlabeled data, we achieve a significant performance gain of 4.7% over ImageNet-pretrained baseline on FCOS object detection. In addition, our warmup model demonstrates excellent transferability to other detection and segmentation frameworks.

preprint2010arXiv

Detection of Strong Short-Term Variability in NGC 6946 X-1

Using two archival XMM-Newton observations, we identify strong X-ray flux variations in NGC 6946 X-1 indicating it is the most variable ultraluminous X-ray source (ULX) on mHz time scales known so far. The 1-10 keV lightcurve exhibits variability with a fractional rms amplitude of 60% integrated in the frequency range of 1-100 mHz. The power spectral density of the source shows a flat-topped spectrum that breaks at about 3 mHz with possible quasi-periodic oscillations (QPOs) near 8.5 mHz. Black hole binaries usually produce strong fast variability in the hard or intermediate state. The energy spectrum of NGC 6946 X-1 is dominated by two components, a 0.18~keV thermal disk and a power law with a photon index of ~2.2, which is consistent with the intermediate state. The characteristic time scales of the X-ray emission suggests that the ULX may contain a black hole with a mass on the order of 10^3.

preprint2010arXiv

Discovery of mHz X-ray Oscillations in a Transient Ultraluminous X-ray Source in M82

We report the discovery of X-ray quasi-periodic oscillations (QPOs) at frequencies of 3-4 mHz from a transient ultraluminous X-ray source (ULX) X42.3+59 in M82. The QPOs are strong and broad and appear with weak or absent red noise, and are detected only in Chandra observations when the source is brighter than 10^40 ergs/s. The QPO behavior is similar to the type A-I QPOs found in XTE J1550-564, which is a subclass of low frequency QPOs with properties in between type A and B. Therefore, we identify the QPOs in X42.3+59 as of type A or B, and rule out the possibility of type C. With this identification, the mass of the black hole in X42.3+59 can be inferred as in the range of 12,000-43,000 solar masses by scaling the QPO frequency to that of the type A/B QPOs in stellar mass black holes. Cool disk emission is detected in one Chandra observation, and the disk inner radius suggests a similar black hole mass range. Black holes of such a high mass are able to produce an energy output in a manner similar to X42.3+59 by accreting from the interstellar medium directly.

preprint2010arXiv

Low-frequency oscillations in XTE J1550-564

We present the results of timing analysis of the low-frequency Quasi-Periodic Oscillation (QPO) in the Rossi X-Ray Timing Explorer data of the black hole binary XTE J1550--564 during its 1998 outburst. The QPO frequency is observed to vary on timescales between $\sim$100 s and days, correlated with the count rate contribution from the optically thick accretion disk: we studied this correlation and discuss its influence on the QPO width. In all observations, the quality factors ($ν_0$/FWHM) of the fundamental and second harmonic peaks were observed to be consistent, suggesting that the quasi-periodic nature of the oscillation is due to frequency modulation. In addition to the QPO and its harmonic peaks, a new 1.5$ν$ component was detected in the power spectra. This component is broad, with a quality factor of $\sim$0.6. From this, we argue what the peak observed at half the QPO frequency, usually referred to as "sub-harmonic" could be the fundamental frequency, leading to the sequence 1:2:3:4. We also studied the energy dependence of the timing features and conclude that the two continuum components observed in the power spectrum, although both more intense at high energies, show a different dependence on energy. At low energies, the lowest-frequency component dominates, while at high energies the higher-frequency one has a higher fractional rms. An interplay between these two components was also observed as a function of their characteristic frequency. In this source, the transition between low/hard state and hard-intermediate state appears to be a smooth process.

Institution

Affiliation not imported yet

This author record came from a source that does not expose affiliation metadata. Once the author claims the profile or we enrich the record from another provider, this section will link to the concrete institution.

Topic footprint