Researcher profile

Gang Zeng

Gang Zeng contributes to research discovery and scholarly infrastructure.

ResearcherAffiliation not importedOpen to collaborate

Trust snapshot

Quick read

Trust 21 - EmergingVerification L1Unclaimed author
11works
0followers
3topics
4close collaborators

Actions

Decide how to stay connected

Follow researcher0

Identity and collaboration

How to connect with this researcher

Claiming links this public author record to a researcher profile and unlocks direct collaboration workflows.

Log in to claim

Direct collaboration

Open a focused conversation when the fit is right

Claim this author entity first to unlock direct invitations.

Research graph

See the researcher in context

Open full explorer

Inspect adjacent work, topics, institutions and collaborators without jumping out to a separate graph page.

Building this graph slice

BZPEER is loading the nearby papers, people, topics and institutions for this page.

Published work

11 published item(s)

preprint2026arXiv

GaussianFluent: Gaussian Simulation for Dynamic Scenes with Mixed Materials

3D Gaussian Splatting (3DGS) has emerged as a prominent 3D representation for high-fidelity and real-time rendering. Prior work has coupled physics simulation with Gaussians, but predominantly targets soft, deformable materials, leaving brittle fracture largely unresolved. This stems from two key obstacles: the lack of volumetric interiors with coherent textures in GS representation, and the absence of fracture-aware simulation methods for Gaussians. To address these challenges, we introduce GaussianFluent, a unified framework for realistic simulation and rendering of dynamic object states. First, it synthesizes photorealistic interiors by densifying internal Gaussians guided by generative models. Second, it integrates an optimized Continuum Damage Material Point Method (CD-MPM) to enable brittle fracture simulation at remarkably high speed. Our approach handles complex scenarios including mixed-material objects and multi-stage fracture propagation, achieving results infeasible with previous methods. Experiments clearly demonstrate GaussianFluent's capability for photo-realistic, real-time rendering with structurally consistent interiors, highlighting its potential for downstream application, such as VR and Robotics.

preprint2022arXiv

Conditional DETR V2: Efficient Detection Transformer with Box Queries

In this paper, we are interested in Detection Transformer (DETR), an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, that presented box queries (originally called spatial queries) for internal decoder layers, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point and the transformation of the box with respect to the reference point. This reformulation indicates the connection between the object query in DETR and the anchor box that is widely studied in Faster R-CNN. Furthermore, we learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence. In addition, we adopt the idea of axial self-attention to save the memory cost and accelerate the encoder. The resulting detector, called Conditional DETR V2, achieves better results than Conditional DETR, saves the memory cost and runs more efficiently. For example, for the DC$5$-ResNet-$50$ backbone, our approach achieves $44.8$ AP with $16.4$ FPS on the COCO $val$ set and compared to Conditional DETR, it runs $1.6\times$ faster, saves $74$\% of the overall memory cost, and improves $1.0$ AP score.

preprint2022arXiv

JVLDLoc: a Joint Optimization of Visual-LiDAR Constraints and Direction Priors for Localization in Driving Scenario

The ability for a moving agent to localize itself in environment is the basic demand for emerging applications, such as autonomous driving, etc. Many existing methods based on multiple sensors still suffer from drift. We propose a scheme that fuses map prior and vanishing points from images, which can establish an energy term that is only constrained on rotation, called the direction projection error. Then we embed these direction priors into a visual-LiDAR SLAM system that integrates camera and LiDAR measurements in a tightly-coupled way at backend. Specifically, our method generates visual reprojection error and point to Implicit Moving Least Square(IMLS) surface of scan constraints, and solves them jointly along with direction projection error at global optimization. Experiments on KITTI, KITTI-360 and Oxford Radar Robotcar show that we achieve lower localization error or Absolute Pose Error (APE) than prior map, which validates our method is effective.

preprint2022arXiv

MaskGroup: Hierarchical Point Grouping and Masking for 3D Instance Segmentation

This paper studies the 3D instance segmentation problem, which has a variety of real-world applications such as robotics and augmented reality. Since the surroundings of 3D objects are of high complexity, the separating of different objects is very difficult. To address this challenging problem, we propose a novel framework to group and refine the 3D instances. In practice, we first learn an offset vector for each point and shift it to its predicted instance center. To better group these points, we propose a Hierarchical Point Grouping algorithm to merge the centrally aggregated points progressively. All points are grouped into small clusters, which further gradually undergo another clustering procedure to merge into larger groups. These multi-scale groups are exploited for instance prediction, which is beneficial for predicting instances with different scales. In addition, a novel MaskScoreNet is developed to produce binary point masks of these groups for further refining the segmentation results. Extensive experiments conducted on the ScanNetV2 and S3DIS benchmarks demonstrate the effectiveness of the proposed method. For instance, our approach achieves a 66.4\% mAP with the 0.5 IoU threshold on the ScanNetV2 test set, which is 1.9\% higher than the state-of-the-art method.

preprint2022arXiv

Point Scene Understanding via Disentangled Instance Mesh Reconstruction

Semantic scene reconstruction from point cloud is an essential and challenging task for 3D scene understanding. This task requires not only to recognize each instance in the scene, but also to recover their geometries based on the partial observed point cloud. Existing methods usually attempt to directly predict occupancy values of the complete object based on incomplete point cloud proposals from a detection-based backbone. However, this framework always fails to reconstruct high fidelity mesh due to the obstruction of various detected false positive object proposals and the ambiguity of incomplete point observations for learning occupancy values of complete objects. To circumvent the hurdle, we propose a Disentangled Instance Mesh Reconstruction (DIMR) framework for effective point scene understanding. A segmentation-based backbone is applied to reduce false positive object proposals, which further benefits our exploration on the relationship between recognition and reconstruction. Based on the accurate proposals, we leverage a mesh-aware latent code space to disentangle the processes of shape completion and mesh generation, relieving the ambiguity caused by the incomplete point observations. Furthermore, with access to the CAD model pool at test time, our model can also be used to improve the reconstruction quality by performing mesh retrieval without extra training. We thoroughly evaluate the reconstructed mesh quality with multiple metrics, and demonstrate the superiority of our method on the challenging ScanNet dataset. Code is available at \url{https://github.com/ashawkey/dimr}.

preprint2020arXiv

3D Sketch-aware Semantic Scene Completion via Semi-supervised Structure Prior

The goal of the Semantic Scene Completion (SSC) task is to simultaneously predict a completed 3D voxel representation of volumetric occupancy and semantic labels of objects in the scene from a single-view observation. Since the computational cost generally increases explosively along with the growth of voxel resolution, most current state-of-the-arts have to tailor their framework into a low-resolution representation with the sacrifice of detail prediction. Thus, voxel resolution becomes one of the crucial difficulties that lead to the performance bottleneck. In this paper, we propose to devise a new geometry-based strategy to embed depth information with low-resolution voxel representation, which could still be able to encode sufficient geometric information, e.g., room layout, object's sizes and shapes, to infer the invisible areas of the scene with well structure-preserving details. To this end, we first propose a novel 3D sketch-aware feature embedding to explicitly encode geometric information effectively and efficiently. With the 3D sketch in hand, we further devise a simple yet effective semantic scene completion framework that incorporates a light-weight 3D Sketch Hallucination module to guide the inference of occupancy and the semantic labels via a semi-supervised structure prior learning strategy. We demonstrate that our proposed geometric embedding works better than the depth feature learning from habitual SSC frameworks. Our final model surpasses state-of-the-arts consistently on three public benchmarks, which only requires 3D volumes of 60 x 36 x 60 resolution for both input and output. The code and the supplementary material will be available at https://charlesCXK.github.io.

preprint2020arXiv

Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation

Depth information has proven to be a useful cue in the semantic segmentation of RGB-D images for providing a geometric counterpart to the RGB representation. Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion to obtain better feature representations to achieve more accurate segmentation. This, however, may not lead to satisfactory results as actual depth data are generally noisy, which might worsen the accuracy as the networks go deeper. In this paper, we propose a unified and efficient Cross-modality Guided Encoder to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively. The key of the proposed architecture is a novel Separation-and-Aggregation Gating operation that jointly filters and recalibrates both representations before cross-modality aggregation. Meanwhile, a Bi-direction Multi-step Propagation strategy is introduced, on the one hand, to help to propagate and fuse information between the two modalities, and on the other hand, to preserve their specificity along the long-term propagation process. Besides, our proposed encoder can be easily injected into the previous encoder-decoder structures to boost their performance on RGB-D semantic segmentation. Our model outperforms state-of-the-arts consistently on both in-door and out-door challenging datasets. Code of this work is available at https://charlescxk.github.io/

preprint2020arXiv

Malleable 2.5D Convolution: Learning Receptive Fields along the Depth-axis for RGB-D Scene Parsing

Depth data provide geometric information that can bring progress in RGB-D scene parsing tasks. Several recent works propose RGB-D convolution operators that construct receptive fields along the depth-axis to handle 3D neighborhood relations between pixels. However, these methods pre-define depth receptive fields by hyperparameters, making them rely on parameter selection. In this paper, we propose a novel operator called malleable 2.5D convolution to learn the receptive field along the depth-axis. A malleable 2.5D convolution has one or more 2D convolution kernels. Our method assigns each pixel to one of the kernels or none of them according to their relative depth differences, and the assigning process is formulated as a differentiable form so that it can be learnt by gradient descent. The proposed operator runs on standard 2D feature maps and can be seamlessly incorporated into pre-trained CNNs. We conduct extensive experiments on two challenging RGB-D semantic segmentation dataset NYUDv2 and Cityscapes to validate the effectiveness and the generalization ability of our method.

preprint2013arXiv

Fast Approximate $K$-Means via Cluster Closures

$K$-means, a simple and effective clustering algorithm, is one of the most widely used algorithms in multimedia and computer vision community. Traditional $k$-means is an iterative algorithm---in each iteration new cluster centers are computed and each data point is re-assigned to its nearest center. The cluster re-assignment step becomes prohibitively expensive when the number of data points and cluster centers are large. In this paper, we propose a novel approximate $k$-means algorithm to greatly reduce the computational complexity in the assignment step. Our approach is motivated by the observation that most active points changing their cluster assignments at each iteration are located on or near cluster boundaries. The idea is to efficiently identify those active points by pre-assembling the data into groups of neighboring points using multiple random spatial partition trees, and to use the neighborhood information to construct a closure for each cluster, in such a way only a small number of cluster candidates need to be considered when assigning a data point to its nearest cluster. Using complexity analysis, image data clustering, and applications to image retrieval, we show that our approach out-performs state-of-the-art approximate $k$-means algorithms in terms of clustering quality and efficiency.

preprint2013arXiv

Fast Neighborhood Graph Search using Cartesian Concatenation

In this paper, we propose a new data structure for approximate nearest neighbor search. This structure augments the neighborhood graph with a bridge graph. We propose to exploit Cartesian concatenation to produce a large set of vectors, called bridge vectors, from several small sets of subvectors. Each bridge vector is connected with a few reference vectors near to it, forming a bridge graph. Our approach finds nearest neighbors by simultaneously traversing the neighborhood graph and the bridge graph in the best-first strategy. The success of our approach stems from two factors: the exact nearest neighbor search over a large number of bridge vectors can be done quickly, and the reference vectors connected to a bridge (reference) vector near the query are also likely to be near the query. Experimental results on searching over large scale datasets (SIFT, GIST and HOG) show that our approach outperforms state-of-the-art ANN search algorithms in terms of efficiency and accuracy. The combination of our approach with the IVFADC system also shows superior performance over the BIGANN dataset of $1$ billion SIFT features compared with the best previously published result.

preprint2013arXiv

Scalable $k$-NN graph construction

The $k$-NN graph has played a central role in increasingly popular data-driven techniques for various learning and vision tasks; yet, finding an efficient and effective way to construct $k$-NN graphs remains a challenge, especially for large-scale high-dimensional data. In this paper, we propose a new approach to construct approximate $k$-NN graphs with emphasis in: efficiency and accuracy. We hierarchically and randomly divide the data points into subsets and build an exact neighborhood graph over each subset, achieving a base approximate neighborhood graph; we then repeat this process for several times to generate multiple neighborhood graphs, which are combined to yield a more accurate approximate neighborhood graph. Furthermore, we propose a neighborhood propagation scheme to further enhance the accuracy. We show both theoretical and empirical accuracy and efficiency of our approach to $k$-NN graph construction and demonstrate significant speed-up in dealing with large scale visual data.