Source author record

Kuang Gong

Kuang Gong appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

physics.med-ph eess.IV Computer Vision Machine Learning Artificial Intelligence

Catalog footprint

What is connected

5works

5topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation

Recent progress in medical vision-language models (VLMs) has achieved strong performance on image-level text-centric tasks such as report generation and visual question answering (VQA). However, achieving fine-grained visual grounding and volumetric spatial reasoning in 3D medical VLMs remains challenging, particularly when aiming to unify these capabilities within a single, generalizable framework. To address this challenge, we proposed MedVL-SAM2, a unified 3D medical multimodal model that concurrently supports report generation, VQA, and multi-paradigm segmentation, including semantic, referring, and interactive segmentation. MedVL-SAM2 integrates image-level reasoning and pixel-level perception through a cohesive architecture tailored for 3D medical imaging, and incorporates a SAM2-based volumetric segmentation module to enable precise multi-granular spatial reasoning. The model is trained in a multi-stage pipeline: it is first pre-trained on a large-scale corpus of 3D CT image-text pairs to align volumetric visual features with radiology-language embeddings. It is then jointly optimized with both language-understanding and segmentation objectives using a comprehensive 3D CT segmentation dataset. This joint training enables flexible interaction via language, point, or box prompts, thereby unifying high-level visual reasoning with spatially precise localization. Our unified architecture delivers state-of-the-art performance across report generation, VQA, and multiple 3D segmentation tasks. Extensive analyses further show that the model provides reliable 3D visual grounding, controllable interactive segmentation, and robust cross-modal reasoning, demonstrating that high-level semantic reasoning and precise 3D localization can be jointly achieved within a unified 3D medical VLM.

preprint2022arXiv

PET image denoising based on denoising diffusion probabilistic models

Due to various physical degradation factors and limited counts received, PET image quality needs further improvements. The denoising diffusion probabilistic models (DDPM) are distribution learning-based models, which try to transform a normal distribution into a specific data distribution based on iterative refinements. In this work, we proposed and evaluated different DDPM-based methods for PET image denoising. Under the DDPM framework, one way to perform PET image denoising is to provide the PET image and/or the prior image as the network input. Another way is to supply the prior image as the input with the PET image included in the refinement steps, which can fit for scenarios of different noise levels. 120 18F-FDG datasets and 140 18F-MK-6240 datasets were utilized to evaluate the proposed DDPM-based methods. Quantification show that the DDPM-based frameworks with PET information included can generate better results than the nonlocal mean and Unet-based denoising methods. Adding additional MR prior in the model can help achieve better performance and further reduce the uncertainty during image denoising. Solely relying on MR prior while ignoring the PET information can result in large bias. Regional and surface quantification shows that employing MR prior as the network input while embedding PET image as a data-consistency constraint during inference can achieve the best performance. In summary, DDPM-based PET image denoising is a flexible framework, which can efficiently utilize prior information and achieve better performance than the nonlocal mean and Unet-based denoising methods.

preprint2020arXiv

Clinically Translatable Direct Patlak Reconstruction from Dynamic PET with Motion Correction Using Convolutional Neural Network

Patlak model is widely used in 18F-FDG dynamic positron emission tomography (PET) imaging, where the estimated parametric images reveal important biochemical and physiology information. Because of better noise modeling and more information extracted from raw sinogram, direct Patlak reconstruction gains its popularity over the indirect approach which utilizes reconstructed dynamic PET images alone. As the prerequisite of direct Patlak methods, raw data from dynamic PET are rarely stored in clinics and difficult to obtain. In addition, the direct reconstruction is time-consuming due to the bottleneck of multiple-frame reconstruction. All of these impede the clinical adoption of direct Patlak reconstruction.In this work, we proposed a data-driven framework which maps the dynamic PET images to the high-quality motion-corrected direct Patlak images through a convolutional neural network. For the patient motion during the long period of dynamic PET scan, we combined the correction with the backward/forward projection in direct reconstruction to better fit the statistical model. Results based on fifteen clinical 18F-FDG dynamic brain PET datasets demonstrates the superiority of the proposed framework over Gaussian, nonlocal mean and BM4D denoising, regarding the image bias and contrast-to-noise ratio.

preprint2020arXiv

MR-Based PET Attenuation Correction using a Combined Ultrashort Echo Time/Multi-Echo Dixon Acquisition

We propose a magnetic resonance (MR)-based method for estimation of continuous linear attenuation coefficients (LAC) in positron emission tomography (PET) using a physical compartmental model and ultrashort echo time (UTE)/multi-echo Dixon (mUTE) acquisitions. Specifically, we propose a three-dimensional (3D) mUTE sequence to acquire signals from water, fat, and short-T2 components (e.g., bones) simultaneously in a single acquisition. The proposed mUTE sequence integrates 3D UTE with multi-echo Dixon acquisitions and uses sparse radial trajectories to accelerate imaging speed. Errors in the radial k-space trajectories are measured using a special k-space trajectory mapping sequence and corrected for image reconstruction. A physical compartmental model is used to fit the measured multi-echo MR signals to obtain fractions of water, fat and bone components for each voxel, which are then used to estimate the continuous LAC map for PET attenuation correction. The performance of the proposed method was evaluated via phantom and in vivo human studies, using LACs from Computed Tomography (CT) as reference. Compared to Dixon- and atlas-based MRAC methods, the proposed method yielded PET images with higher correlation and similarity in relation to the reference. The relative absolute errors of PET activity values reconstructed by the proposed method were below 5% in all of the four lobes (frontal, temporal, parietal, occipital), cerebellum, whole white matter and gray matter regions across all subjects (n=6). The proposed mUTE method can generate subject-specific, continuous LAC map for PET attenuation correction in PET/MR.

preprint2020arXiv

Super Resolution of Arterial Spin Labeling MR Imaging Using Unsupervised Multi-Scale Generative Adversarial Network

Arterial spin labeling (ASL) magnetic resonance imaging (MRI) is a powerful imaging technology that can measure cerebral blood flow (CBF) quantitatively. However, since only a small portion of blood is labeled compared to the whole tissue volume, conventional ASL suffers from low signal-to-noise ratio (SNR), poor spatial resolution, and long acquisition time. In this paper, we proposed a super-resolution method based on a multi-scale generative adversarial network (GAN) through unsupervised training. The network only needs the low-resolution (LR) ASL image itself for training and the T1-weighted image as the anatomical prior. No training pairs or pre-training are needed. A low-pass filter guided item was added as an additional loss to suppress the noise interference from the LR ASL image. After the network was trained, the super-resolution (SR) image was generated by supplying the upsampled LR ASL image and corresponding T1-weighted image to the generator of the last layer. Performance of the proposed method was evaluated by comparing the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) using normal-resolution (NR) ASL image (5.5 min acquisition) and high-resolution (HR) ASL image (44 min acquisition) as the ground truth. Compared to the nearest, linear, and spline interpolation methods, the proposed method recovers more detailed structure information, reduces the image noise visually, and achieves the highest PSNR and SSIM when using HR ASL image as the ground-truth.

Kuang Gong

What is connected

Connect this record

See the researcher in context

Building this map preview

5 published item(s)

MedVL-SAM2: A unified 3D medical vision-language model for multimodal reasoning and prompt-driven segmentation

PET image denoising based on denoising diffusion probabilistic models

Clinically Translatable Direct Patlak Reconstruction from Dynamic PET with Motion Correction Using Convolutional Neural Network

MR-Based PET Attenuation Correction using a Combined Ultrashort Echo Time/Multi-Echo Dixon Acquisition

Super Resolution of Arterial Spin Labeling MR Imaging Using Unsupervised Multi-Scale Generative Adversarial Network