Source author record

Yitong Wang

Yitong Wang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Artificial Intelligence eess.SP

Catalog footprint

What is connected

4works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

DreamSR: Towards Ultra-High-Resolution Image Super-Resolution via a Receptive-Field Enhanced Diffusion Transformer

Large-scale pre-trained diffusion models have been extensively adopted for real-world image Super-Resolution because of their powerful generative priors through textual guidance. However, when super-resolving high-resolution images with patch-wise inference strategy, most existing diffusion-based SR methods tend to suffer from over-generation, due to the misalignment between the global prompt from LR image and the incomplete semantic information of local patches during each inference step. On the other hand, most existing methods also failed to generate detailed texture in local patches due to the overemphasis on global generation capabilities in network designs and training strategies. To address this issue, we present DreamSR, a novel SR model that suppresses local over-generation and improves fine-detail synthesis, thereby achieving visually faithful results with ultra-high-quality details. Specifically, we propose a dual-branch MM-ControlNet, where the ControlNet generates local textual feature with patch-level prompts while the pre-trained DiT provides global textual feature with global prompts, thereby mitigating over-generation and ensuring semantic consistency across patches. We also design a comprehensive training strategy with stage-specific data processing pipelines and a Receptive-Field Enhancement strategy, enhancing the model's capability to capture patch information and effectively restore local textures. Extensive experiments demonstrate that DreamSR outperforms state-of-the-art methods, providing high-quality SR results. Code and model are available at https://github.com/jerrydong0219/DreamSR.

preprint2026arXiv

NextFlow: Unified Sequential Modeling Activates Multimodal Understanding and Generation

We present NextFlow, a unified decoder-only autoregressive transformer trained on 6 trillion interleaved text-image discrete tokens. By leveraging a unified vision representation within a unified autoregressive architecture, NextFlow natively activates multimodal understanding and generation capabilities, unlocking abilities of image editing, interleaved content and video generation. Motivated by the distinct nature of modalities - where text is strictly sequential and images are inherently hierarchical - we retain next-token prediction for text but adopt next-scale prediction for visual generation. This departs from traditional raster-scan methods, enabling the generation of 1024x1024 images in just 5 seconds - orders of magnitude faster than comparable AR models. We address the instabilities of multi-scale generation through a robust training recipe. Furthermore, we introduce a prefix-tuning strategy for reinforcement learning. Experiments demonstrate that NextFlow achieves state-of-the-art performance among unified models and rivals specialized diffusion baselines in visual quality.

preprint2024arXiv

1st Place Solution for 5th LSVOS Challenge: Referring Video Object Segmentation

The recent transformer-based models have dominated the Referring Video Object Segmentation (RVOS) task due to the superior performance. Most prior works adopt unified DETR framework to generate segmentation masks in query-to-instance manner. In this work, we integrate strengths of that leading RVOS models to build up an effective paradigm. We first obtain binary mask sequences from the RVOS models. To improve the consistency and quality of masks, we propose Two-Stage Multi-Model Fusion strategy. Each stage rationally ensembles RVOS models based on framework design as well as training strategy, and leverages different video object segmentation (VOS) models to enhance mask coherence by object propagation mechanism. Our method achieves 75.7% J&F on Ref-Youtube-VOS validation set and 70% J&F on test set, which ranks 1st place on 5th Large-scale Video Object Segmentation Challenge (ICCV 2023) track 3. Code is available at https://github.com/RobertLuo1/iccv2023_RVOS_Challenge.

preprint2020arXiv

Experimental Demonstration of Millimeter-Wave Radio-over-Fiber System with Convolutional Neural Network (CNN) and Binary Convolutional Neural Network (BCNN)

The millimeter-wave (mm-wave) radio-over-fiber (RoF) systems have been widely studied as promising solutions to deliver high-speed wireless signals to end users, and neural networks have been studied to solve various linear and nonlinear impairments. However, high computation cost and large amounts of training data are required to effectively improve the system performance. In this paper, we propose and demonstrate highly computation efficient convolutional neural network (CNN) and binary convolutional neural network (BCNN) based decision schemes to solve these limitations. The proposed CNN and BCNN based decision schemes are demonstrated in a 5 Gbps 60 GHz RoF system for up to 20 km fiber distance. Compared with previously demonstrated neural networks, results show that the bit error rate (BER) performance and the computation intensive training process are improved. The number of training iterations needed is reduced by about 50 % and the amount of required training data is reduced by over 30 %. In addition, only one training is required for the entire measured received optical power range over 3.5 dB in the proposed CNN and BCNN schemes, to further reduce the computation cost of implementing neural networks decision schemes in mm-wave RoF systems.