Source author record

Ying Shan

Ying Shan appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Artificial Intelligence Machine Learning eess.AS eess.IV Multimedia Sound Computation and Language

Catalog footprint

What is connected

33works

8topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Pixal3D: Pixel-Aligned 3D Generation from Images

Recent advances in 3D generative models have rapidly improved image-to-3D synthesis quality, enabling higher-resolution geometry and more realistic appearance. Yet fidelity, which measures pixel-level faithfulness of the generated 3D asset to the input image, still remains a central bottleneck. We argue this stems from an implicit 2D-3D correspondence issue: most 3D-native generators synthesize shape in canonical space and inject image cues via attention, leaving pixel-to-3D associations ambiguous. To tackle this issue, we draw inspiration from 3D reconstruction and propose Pixal3D, a pixel-aligned 3D generation paradigm for high-fidelity 3D asset creation from images. Instead of generating in a canonical pose, Pixal3D directly generates 3D in a pixel-aligned way, consistent with the input view. To enable this, we introduce a pixel back-projection conditioning scheme that explicitly lifts multi-scale image features into a 3D feature volume, establishing direct pixel-to-3D correspondence without ambiguity. We show that Pixal3D is not only scalable and capable of producing high-quality 3D assets, but also substantially improves fidelity, approaching the fidelity level of reconstruction. Furthermore, Pixal3D naturally extends to multi-view generation by aggregating back-projected feature volumes across views. Finally, we show pixel-aligned generation benefits scene synthesis, and present a modular pipeline that produces high-fidelity, object-separated 3D scenes from images. Pixal3D for the first time demonstrates 3D-native pixel-aligned generation at scale, and provides a new inspiring way towards high-fidelity 3D generation of object or scene from single or multi-view images. Project page: https://ldyang694.github.io/projects/pixal3d/

preprint2026arXiv

Semantic Generative Tuning for Unified Multimodal Models

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.

preprint2024arXiv

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

Generating text-editable and pose-controllable character videos have an imperious demand in creating various digital human. Nevertheless, this task has been restricted by the absence of a comprehensive dataset featuring paired video-pose captions and the generative prior models for videos. In this work, we design a novel two-stage training scheme that can utilize easily obtained datasets (i.e.,image pose pair and pose-free video) and the pre-trained text-to-image (T2I) model to obtain the pose-controllable character videos. Specifically, in the first stage, only the keypoint-image pairs are used only for a controllable text-to-image generation. We learn a zero-initialized convolutional encoder to encode the pose information. In the second stage, we finetune the motion of the above network via a pose-free video dataset by adding the learnable temporal self-attention and reformed cross-frame self-attention blocks. Powered by our new designs, our method successfully generates continuously pose-controllable character videos while keeps the editing and concept composition ability of the pre-trained T2I model. The code and models will be made publicly available.

preprint2024arXiv

Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion

Any-to-any singing voice conversion (SVC) is confronted with the challenge of ``timbre leakage'' issue caused by inadequate disentanglement between the content and the speaker timbre. To address this issue, this study introduces NeuCoSVC, a novel neural concatenative SVC framework. It consists of a self-supervised learning (SSL) representation extractor, a neural harmonic signal generator, and a waveform synthesizer. The SSL extractor condenses audio into fixed-dimensional SSL features, while the harmonic signal generator leverages linear time-varying filters to produce both raw and filtered harmonic signals for pitch information. The synthesizer reconstructs waveforms using SSL features, harmonic signals, and loudness information. During inference, voice conversion is performed by substituting source SSL features with their nearest counterparts from a matching pool which comprises SSL features extracted from the reference audio, while preserving raw harmonic signals and loudness from the source audio. By directly utilizing SSL features from the reference audio, the proposed framework effectively resolves the ``timbre leakage" issue caused by previous disentanglement-based approaches. Experimental results demonstrate that the proposed NeuCoSVC system outperforms the disentanglement-based speaker embedding approach in one-shot SVC across intra-language, cross-language, and cross-domain evaluations.

preprint2023arXiv

AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos

This paper studies the problem of real-world video super-resolution (VSR) for animation videos, and reveals three key improvements for practical animation VSR. First, recent real-world super-resolution approaches typically rely on degradation simulation using basic operators without any learning capability, such as blur, noise, and compression. In this work, we propose to learn such basic operators from real low-quality animation videos, and incorporate the learned ones into the degradation generation pipeline. Such neural-network-based basic operators could help to better capture the distribution of real degradations. Second, a large-scale high-quality animation video dataset, AVC, is built to facilitate comprehensive training and evaluations for animation VSR. Third, we further investigate an efficient multi-scale network structure. It takes advantage of the efficiency of unidirectional recurrent networks and the effectiveness of sliding-window-based methods. Thanks to the above delicate designs, our method, AnimeSR, is capable of restoring real-world low-quality animation videos effectively and efficiently, achieving superior performance to previous state-of-the-art methods. Codes and models are available at https://github.com/TencentARC/AnimeSR.

Ying Shan

What is connected

Connect this record

See the researcher in context

Building this map preview

33 published item(s)

Pixal3D: Pixel-Aligned 3D Generation from Images

Semantic Generative Tuning for Unified Multimodal Models

Follow Your Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos

Neural Concatenative Singing Voice Conversion: Rethinking Concatenation-Based Approach for One-Shot Singing Voice Conversion

AnimeSR: Learning Real-World Super-Resolution Models for Animation Videos

A Hierarchical Speaker Representation Framework for One-shot Singing Voice Conversion

Accelerating the Training of Video Super-Resolution Models

All in One: Exploring Unified Video-Language Pre-training

Audio-to-symbolic Arrangement via Cross-modal Music Representation Learning

Bridging Video-text Retrieval with Multiple Choice Questions

CREATE: A Benchmark for Chinese Short Video Retrieval and Title Generation

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes

Hot-Refresh Model Upgrades with Regression-Alleviating Compatible Training in Image Retrieval

mc-BEiT: Multi-choice Discretization for Image BERT Pre-training

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

MM-RealSR: Metric Learning based Interactive Modulation for Real-World Super-Resolution

Music-driven Dance Regeneration with Controllable Key Pose Constraints

Not All Models Are Equal: Predicting Model Transferability in a Self-challenging Fisher Space

Object-aware Video-language Pre-training for Retrieval

Privacy-Preserving Model Upgrades with Bidirectional Compatible Training in Image Retrieval

RepSR: Training Efficient VGG-style Super-Resolution Networks with Structural Re-Parameterization and Batch Normalization

Temporally Efficient Vision Transformer for Video Instance Segmentation

Towards Universal Backward-Compatible Representation Learning

Towards Vivid and Diverse Image Colorization with Generative Color Prior

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

Uncertainty Modeling for Out-of-Distribution Generalization

Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection

VFHQ: A High-Quality Dataset and Benchmark for Video Face Super-Resolution

VQFR: Blind Face Restoration with Vector-Quantized Dictionary and Parallel Decoder

A Generic Object Re-identification System for Short Videos

Open-book Video Captioning with Retrieve-Copy-Generate Network

Dual Semantic Fusion Network for Video Object Detection

Fast Video Object Segmentation using the Global Context Module