Source author record

Xuanchi Ren

Xuanchi Ren appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Machine Learning Artificial Intelligence eess.AS Sound

Catalog footprint

What is connected

5works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2022arXiv

Learning Disentangled Representation by Exploiting Pretrained Generative Models: A Contrastive Learning View

From the intuitive notion of disentanglement, the image variations corresponding to different factors should be distinct from each other, and the disentangled representation should reflect those variations with separate dimensions. To discover the factors and learn disentangled representation, previous methods typically leverage an extra regularization term when learning to generate realistic images. However, the term usually results in a trade-off between disentanglement and generation quality. For the generative models pretrained without any disentanglement term, the generated images show semantically meaningful variations when traversing along different directions in the latent space. Based on this observation, we argue that it is possible to mitigate the trade-off by $(i)$ leveraging the pretrained generative models with high generation quality, $(ii)$ focusing on discovering the traversal directions as factors for disentangled representation learning. To achieve this, we propose Disentaglement via Contrast (DisCo) as a framework to model the variations based on the target disentangled representations, and contrast the variations to jointly discover disentangled directions and learn disentangled representations. DisCo achieves the state-of-the-art disentangled representation learning and distinct direction discovering, given pretrained non-disentangled generative models including GAN, VAE, and Flow. Source code is at https://github.com/xrenaa/DisCo.

preprint2022arXiv

Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image

Novel view synthesis from a single image has recently attracted a lot of attention, and it has been primarily advanced by 3D deep learning and rendering techniques. However, most work is still limited by synthesizing new views within relatively small camera motions. In this paper, we propose a novel approach to synthesize a consistent long-term video given a single scene image and a trajectory of large camera motions. Our approach utilizes an autoregressive Transformer to perform sequential modeling of multiple frames, which reasons the relations between multiple frames and the corresponding cameras to predict the next frame. To facilitate learning and ensure consistency among generated frames, we introduce a locality constraint based on the input cameras to guide self-attention among a large number of patches across space and time. Our method outperforms state-of-the-art view synthesis approaches by a large margin, especially when synthesizing long-term future in indoor 3D scenes. Project page at https://xrenaa.github.io/look-outside-room/.

preprint2022arXiv

Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

This paper addresses the unsupervised learning of content-style decomposed representation. We first give a definition of style and then model the content-style representation as a token-level bipartite graph. An unsupervised framework, named Retriever, is proposed to learn such representations. First, a cross-attention module is employed to retrieve permutation invariant (P.I.) information, defined as style, from the input data. Second, a vector quantization (VQ) module is used, together with man-induced constraints, to produce interpretable content tokens. Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys. Being modal-agnostic, the proposed Retriever is evaluated in both speech and image domains. The state-of-the-art zero-shot voice conversion performance confirms the disentangling ability of our framework. Top performance is also achieved in the part discovery task for images, verifying the interpretability of our representation. In addition, the vivid part-based style transfer quality demonstrates the potential of Retriever to support various fascinating generative tasks. Project page at https://ydcustc.github.io/retriever-demo/.

preprint2022arXiv

Towards Building A Group-based Unsupervised Representation Disentanglement Framework

Disentangled representation learning is one of the major goals of deep learning, and is a key step for achieving explainable and generalizable models. A well-defined theoretical guarantee still lacks for the VAE-based unsupervised methods, which are a set of popular methods to achieve unsupervised disentanglement. The Group Theory based definition of representation disentanglement mathematically connects the data transformations to the representations using the formalism of group. In this paper, built on the group-based definition and inspired by the n-th dihedral group, we first propose a theoretical framework towards achieving unsupervised representation disentanglement. We then propose a model, based on existing VAE-based methods, to tackle the unsupervised learning problem of the framework. In the theoretical framework, we prove three sufficient conditions on model, group structure, and data respectively in an effort to achieve, in an unsupervised way, disentangled representation per group-based definition. With the first two of the conditions satisfied and a necessary condition derived for the third one, we offer additional constraints, from the perspective of the group-based definition, for the existing VAE-based models. Experimentally, we train 1800 models covering the most prominent VAE-based methods on five datasets to verify the effectiveness of our theoretical framework. Compared to the original VAE-based methods, these Groupified VAEs consistently achieve better mean performance with smaller variances.

preprint2021arXiv

Video Deblurring by Fitting to Test Data

Motion blur in videos captured by autonomous vehicles and robots can degrade their perception capability. In this work, we present a novel approach to video deblurring by fitting a deep network to the test video. Our key observation is that some frames in a video with motion blur are much sharper than others, and thus we can transfer the texture information in those sharp frames to blurry frames. Our approach heuristically selects sharp frames from a video and then trains a convolutional neural network on these sharp frames. The trained network often absorbs enough details in the scene to perform deblurring on all the video frames. As an internal learning method, our approach has no domain gap between training and test data, which is a problematic issue for existing video deblurring approaches. The conducted experiments on real-world video data show that our model can reconstruct clearer and sharper videos than state-of-the-art video deblurring approaches. Code and data are available at https://github.com/xrenaa/Deblur-by-Fitting.

Xuanchi Ren

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

Learning Disentangled Representation by Exploiting Pretrained Generative Models: A Contrastive Learning View

Look Outside the Room: Synthesizing A Consistent Long-Term 3D Scene Video from A Single Image

Retriever: Learning Content-Style Representation as a Token-Level Bipartite Graph

Towards Building A Group-based Unsupervised Representation Disentanglement Framework

Video Deblurring by Fitting to Test Data