Researcher profile

Cheng Sun

Cheng Sun contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
8works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

8 published item(s)

preprint2026arXiv

OpenVoxel: Training-Free Grouping and Captioning Voxels for Open-Vocabulary 3D Scene Understanding

We propose OpenVoxel, a training-free algorithm for grouping and captioning sparse voxels for the open-vocabulary 3D scene understanding tasks. Given the sparse voxel rasterization (SVR) model obtained from multi-view images of a 3D scene, our OpenVoxel is able to produce meaningful groups that describe different objects in the scene. Also, by leveraging powerful Vision Language Models (VLMs) and Multi-modal Large Language Models (MLLMs), our OpenVoxel successfully build an informative scene map by captioning each group, enabling further 3D scene understanding tasks such as open-vocabulary segmentation (OVS) or referring expression segmentation (RES). Unlike previous methods, our method is training-free and does not introduce embeddings from a CLIP/BERT text encoder. Instead, we directly proceed with text-to-text search using MLLMs. Through extensive experiments, our method demonstrates superior performance compared to recent studies, particularly in complex referring expression segmentation (RES) tasks. The code will be open.

preprint2022arXiv

A bijection between the sets of $(a,b,b^2)$-Generalized Motzkin paths avoiding $\mathbf{uvv}$-patterns and $\mathbf{uvu}$-patterns

A generalized Motzkin path, called G-Motzkin path for short, of length $n$ is a lattice path from $(0, 0)$ to $(n, 0)$ in the first quadrant of the XOY-plane that consists of up steps $\mathbf{u}=(1, 1)$, down steps $\mathbf{d}=(1, -1)$, horizontal steps $\mathbf{h}=(1, 0)$ and vertical steps $\mathbf{v}=(0, -1)$. An $(a,b,c)$-G-Motzkin path is a weighted G-Motzkin path such that the $\mathbf{u}$-steps, $\mathbf{h}$-steps, $\mathbf{v}$-steps and $\mathbf{d}$-steps are weighted respectively by $1, a, b$ and $c$. Let $τ$ be a word on $\{\mathbf{u}, \mathbf{d}, \mathbf{v}, \mathbf{d}\}$, denoted by $\mathcal{G}_n^τ(a,b,c)$ the set of $τ$-avoiding $(a,b,c)$-G-Motzkin paths of length $n$ for a pattern $τ$. In this paper, we consider the $\mathbf{uvv}$-avoiding $(a,b,c)$-G-Motzkin paths and provide a direct bijection $σ$ between $\mathcal{G}_n^{\mathbf{uvv}}(a,b,b^2)$ and $\mathcal{G}_n^{\mathbf{uvu}}(a,b,b^2)$. Finally, the set of fixed points of $σ$ is also described and counted.

preprint2022arXiv

Direct Voxel Grid Optimization: Super-fast Convergence for Radiance Fields Reconstruction

We present a super-fast convergence approach to reconstructing the per-scene radiance field from a set of images that capture the scene with known poses. This task, which is often applied to novel view synthesis, is recently revolutionized by Neural Radiance Field (NeRF) for its state-of-the-art quality and flexibility. However, NeRF and its variants require a lengthy training time ranging from hours to days for a single scene. In contrast, our approach achieves NeRF-comparable quality and converges rapidly from scratch in less than 15 minutes with a single GPU. We adopt a representation consisting of a density voxel grid for scene geometry and a feature voxel grid with a shallow network for complex view-dependent appearance. Modeling with explicit and discretized volume representations is not new, but we propose two simple yet non-trivial techniques that contribute to fast convergence speed and high-quality output. First, we introduce the post-activation interpolation on voxel density, which is capable of producing sharp surfaces in lower grid resolution. Second, direct voxel density optimization is prone to suboptimal geometry solutions, so we robustify the optimization process by imposing several priors. Finally, evaluation on five inward-facing benchmarks shows that our method matches, if not surpasses, NeRF's quality, yet it only takes about 15 minutes to train from scratch for a new scene.

preprint2022arXiv

Improved Direct Voxel Grid Optimization for Radiance Fields Reconstruction

In this technical report, we improve the DVGO framework (called DVGOv2), which is based on Pytorch and uses the simplest dense grid representation. First, we re-implement part of the Pytorch operations with cuda, achieving 2-3x speedup. The cuda extension is automatically compiled just in time. Second, we extend DVGO to support Forward-facing and Unbounded Inward-facing capturing. Third, we improve the space time complexity of the distortion loss proposed by mip-NeRF 360 from O(N^2) to O(N). The distortion loss improves our quality and training speed. Our efficient implementation could allow more future works to benefit from the loss.

preprint2022arXiv

Multiview Regenerative Morphing with Dual Flows

This paper aims to address a new task of image morphing under a multiview setting, which takes two sets of multiview images as the input and generates intermediate renderings that not only exhibit smooth transitions between the two input sets but also ensure visual consistency across different views at any transition state. To achieve this goal, we propose a novel approach called Multiview Regenerative Morphing that formulates the morphing process as an optimization to solve for rigid transformation and optimal-transport interpolation. Given the multiview input images of the source and target scenes, we first learn a volumetric representation that models the geometry and appearance for each scene to enable the rendering of novel views. Then, the morphing between the two scenes is obtained by solving optimal transport between the two volumetric representations in Wasserstein metrics. Our approach does not rely on user-specified correspondences or 2D/3D input meshes, and we do not assume any predefined categories of the source and target scenes. The proposed view-consistent interpolation scheme directly works on multiview images to yield a novel and visually plausible effect of multiview free-form morphing.

preprint2022arXiv

Self-supervised 360$^{\circ}$ Room Layout Estimation

We present the first self-supervised method to train panoramic room layout estimation models without any labeled data. Unlike per-pixel dense depth that provides abundant correspondence constraints, layout representation is sparse and topological, hindering the use of self-supervised reprojection consistency on images. To address this issue, we propose Differentiable Layout View Rendering, which can warp a source image to the target camera pose given the estimated layout from the target image. As each rendered pixel is differentiable with respect to the estimated layout, we can now train the layout estimation model by minimizing reprojection loss. Besides, we introduce regularization losses to encourage Manhattan alignment, ceiling-floor alignment, cycle consistency, and layout stretch consistency, which further improve our predictions. Finally, we present the first self-supervised results on ZilloIndoor and MatterportLayout datasets. Our approach also shows promising solutions in data-scarce scenarios and active learning, which would have an immediate value in the real estate virtual tour software. Code is available at https://github.com/joshua049/Stereo-360-Layout.

preprint2022arXiv

SVMAC: Unsupervised 3D Human Pose Estimation from a Single Image with Single-view-multi-angle Consistency

Recovering 3D human pose from 2D joints is still a challenging problem, especially without any 3D annotation, video information, or multi-view information. In this paper, we present an unsupervised GAN-based model consisting of multiple weight-sharing generators to estimate a 3D human pose from a single image without 3D annotations. In our model, we introduce single-view-multi-angle consistency (SVMAC) to significantly improve the estimation performance. With 2D joint locations as input, our model estimates a 3D pose and a camera simultaneously. During training, the estimated 3D pose is rotated by random angles and the estimated camera projects the rotated 3D poses back to 2D. The 2D reprojections will be fed into weight-sharing generators to estimate the corresponding 3D poses and cameras, which are then mixed to impose SVMAC constraints to self-supervise the training process. The experimental results show that our method outperforms the state-of-the-art unsupervised methods on Human 3.6M and MPI-INF-3DHP. Moreover, qualitative results on MPII and LSP show that our method can generalize well to unknown data.

preprint2022arXiv

The $\mathbf{uvu}$-avoiding $(a,b,c)$-Generalized Motzkin paths with vertical steps: bijections and statistic enumerations

A generalized Motzkin path, called G-Motzkin path for short, of length $n$ is a lattice path from $(0, 0)$ to $(n, 0)$ in the first quadrant of the XOY-plane that consists of up steps $\mathbf{u}=(1, 1)$, down steps $\mathbf{d}=(1, -1)$, horizontal steps $\mathbf{h}=(1, 0)$ and vertical steps $\mathbf{v}=(0, -1)$. An $(a,b,c)$-G-Motzkin path is a weighted G-Motzkin path such that the $\mathbf{u}$-steps, $\mathbf{h}$-steps, $\mathbf{v}$-steps and $\mathbf{d}$-steps are weighted respectively by $1, a, b$ and $c$. In this paper, we first give bijections between the set of $\mathbf{uvu}$-avoiding $(a,b,b^2)$-G-Motzkin paths of length $n$ and the set of $(a,b)$-Schröder paths as well as the set of $(a+b,b)$-Dyck paths of length $2n$, between the set of $\{\mathbf{uvu, uu}\}$-avoiding $(a,b,b^2)$-G-Motzkin paths of length $n$ and the set of $(a+b,ab)$-Motzkin paths of length $n$, between the set of $\{\mathbf{uvu,uu}\}$-avoiding $(a,b,b^2)$-G-Motzkin paths of length $n+1$ beginning with an $\mathbf{h}$-step weighted by $a$ and the set of $(a,b)$-Dyck paths of length $2n+2$. In the last section, we focus on the enumeration of statistics "number of $\mathbf{z}$-steps" for $\mathbf{z}\in \{\mathbf{u}, \mathbf{h}, \mathbf{v}, \mathbf{d}\}$ and "number of points" at given level in $\mathbf{uvu}$-avoiding G-Motzkin paths. These counting results are linked with Riordan arrays.