Source author record

Lei Ke

Lei Ke appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Computer Vision Information Theory math.IT

Catalog footprint

What is connected

7works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Unlocking Dense Metric Depth Estimation in VLMs

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.

preprint2022arXiv

Video Mask Transfiner for High-Quality Video Instance Segmentation

While Video Instance Segmentation (VIS) has seen rapid progress, current approaches struggle to predict high-quality masks with accurate boundary details. Moreover, the predicted segmentations often fluctuate over time, suggesting that temporal consistency cues are neglected or not fully utilized. In this paper, we set out to tackle these issues, with the aim of achieving highly detailed and more temporally stable mask predictions for VIS. We first propose the Video Mask Transfiner (VMT) method, capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Our VMT detects and groups sparse error-prone spatio-temporal regions of each tracklet in the video segment, which are then refined using both local and instance-level cues. Second, we identify that the coarse boundary annotations of the popular YouTube-VIS dataset constitute a major limiting factor. Based on our VMT architecture, we therefore design an automated annotation refinement approach by iterative training and self-correction. To benchmark high-quality mask predictions for VIS, we introduce the HQ-YTVIS dataset, consisting of a manually re-annotated test set and our automatically refined training data. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS benchmarks. Experimental results clearly demonstrate the efficacy and effectiveness of our method on segmenting complex and dynamic objects, by capturing precise details.

preprint2020arXiv

Commonality-Parsing Network across Shape and Appearance for Partially Supervised Instance Segmentation

Partially supervised instance segmentation aims to perform learning on limited mask-annotated categories of data thus eliminating expensive and exhaustive mask annotation. The learned models are expected to be generalizable to novel categories. Existing methods either learn a transfer function from detection to segmentation, or cluster shape priors for segmenting novel categories. We propose to learn the underlying class-agnostic commonalities that can be generalized from mask-annotated categories to novel categories. Specifically, we parse two types of commonalities: 1) shape commonalities which are learned by performing supervised learning on instance boundary prediction; and 2) appearance commonalities which are captured by modeling pairwise affinities among pixels of feature maps to optimize the separability between instance and the background. Incorporating both the shape and appearance commonalities, our model significantly outperforms the state-of-the-art methods on both partially supervised setting and few-shot setting for instance segmentation on COCO dataset.

preprint2020arXiv

GSNet: Joint Vehicle Pose and Shape Reconstruction with Geometrical and Scene-aware Supervision

We present a novel end-to-end framework named as GSNet (Geometric and Scene-aware Network), which jointly estimates 6DoF poses and reconstructs detailed 3D car shapes from single urban street view. GSNet utilizes a unique four-way feature extraction and fusion scheme and directly regresses 6DoF poses and shapes in a single forward pass. Extensive experiments show that our diverse feature extraction and fusion scheme can greatly improve model performance. Based on a divide-and-conquer 3D shape representation strategy, GSNet reconstructs 3D vehicle shape with great detail (1352 vertices and 2700 faces). This dense mesh representation further leads us to consider geometrical consistency and scene context, and inspires a new multi-objective loss function to regularize network training, which in turn improves the accuracy of 6D pose estimation and validates the merit of jointly performing both tasks. We evaluate GSNet on the largest multi-task ApolloCar3D benchmark and achieve state-of-the-art performance both quantitatively and qualitatively. Project page is available at https://lkeab.github.io/gsnet/.

preprint2011arXiv

Degrees of Freedom Region for an Interference Network with General Message Demands

We consider a single hop interference network with $K$ transmitters and $J$ receivers, all having $M$ antennas. Each transmitter emits an independent message and each receiver requests an arbitrary subset of the messages. This generalizes the well-known $K$-user $M$-antenna interference channel, where each message is requested by a unique receiver. For our setup, we derive the degrees of freedom (DoF) region. The achievability scheme generalizes the interference alignment schemes proposed by Cadambe and Jafar. In particular, we achieve general points in the DoF region by using multiple base vectors and aligning all interferers at a given receiver to the interferer with the largest DoF. As a byproduct, we obtain the DoF region for the original interference channel. We also discuss extensions of our approach where the same region can be achieved by considering a reduced set of interference alignment constraints, thus reducing the time-expansion duration needed. The DoF region for the considered system depends only on a subset of receivers whose demands meet certain characteristics. The geometric shape of the DoF region is also discussed.

preprint2010arXiv

Degrees of Freedom Regions of Two-User MIMO Z and Full Interference Channels: The Benefit of Reconfigurable Antennas

We study the degrees of freedom (DoF) regions of two-user multiple-input multiple-output (MIMO) Z and full interference channels in this paper. We assume that the receivers always have perfect channel state information. We first derive the DoF region of Z interference channel with channel state information at transmitter (CSIT). For full interference channel without CSIT, the DoF region has been fully characterized recently and it is shown that the previously known outer bound is not achievable. In this work, we investigate the no-CSIT case further by assuming that the transmitter has the ability of antenna mode switching. We obtain the DoF region as a function of the number of available antenna modes and reveal the incremental gain in DoF that each extra antenna mode can bring. It is shown that in certain cases the reconfigurable antennas can bring extra DoF gains. In these cases, the DoF region is maximized when the number of modes is at least equal to the number of receive antennas at the corresponding receiver, in which case the previously outer bound is achieved. In all cases, we propose systematic constructions of the beamforming and nulling matrices for achieving the DoF region. The constructions bear an interesting space-frequency interpretation.