Source author record

Jincai Huang

Jincai Huang appears in the imported research catalog. Authorship, coauthor and topic links are available while profile ownership is still unclaimed.

ResearcherUnclaimed source record

Artificial Intelligence Machine Learning Computer Vision

Catalog footprint

What is connected

3works

3topics

4close collaborators

Actions

Connect this record

Open graph Browse works

Inspect adjacent papers, topics, institutions and collaborators without losing the researcher page.

Building this map preview

BZPEER is loading the nearby papers, people, topics and institutions for this page.

preprint2026arXiv

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs

Surgical scene understanding is a cornerstone of computer-assisted intervention. While recent advances, particularly in surgical image segmentation, have driven progress, real-world clinical applications require a more holistic understanding that jointly captures procedural context, semantic reasoning, and precise visual grounding. However, existing approaches typically address these components in isolation, leading to fragmented representations and limited semantic consistency. To address this limitation, we propose SurgMLLM, a unified surgical scene understanding framework that bridges high-level reasoning and low-level visual grounding within a single model. Given surgical videos, SurgMLLM fine-tunes a multimodal large language model (MLLM) to support structured interpretability reasoning, which is used to jointly model phases, instrument-verb-target (IVT) triplets, and triplet-entity segmentation tokens. These tokens are then temporally aggregated and serve as prompts for a segmentation network, enabling accurate pixel-wise grounding of triplet instruments and targets. The entire framework is trained end-to-end with a unified objective that couples language-based reasoning supervision with visual grounding losses, promoting coherent cross-task learning and clinically consistent scene representations. To facilitate unified evaluation, we introduce CholecT45-Scene, extending CholecT45 dataset with 64,299 frames of pixel-level mask annotations for instruments and targets, aligned with existing triplet labels. Extensive experiments show that SurgMLLM significantly advances surgical scene understanding, improving the primary triplet recognition metric AP_IVT from 40.7% to 46.0% and consistently outperforming prior methods in phase recognition and segmentation. These results highlight the effectiveness of unified reasoning-and-grounding for reliable, context-aware surgical assistance.

preprint2022arXiv

Automated Dilated Spatio-Temporal Synchronous Graph Modeling for Traffic Prediction

Accurate traffic prediction is a challenging task in intelligent transportation systems because of the complex spatio-temporal dependencies in transportation networks. Many existing works utilize sophisticated temporal modeling approaches to incorporate with graph convolution networks (GCNs) for capturing short-term and long-term spatio-temporal dependencies. However, these separated modules with complicated designs could restrict effectiveness and efficiency of spatio-temporal representation learning. Furthermore, most previous works adopt the fixed graph construction methods to characterize the global spatio-temporal relations, which limits the learning capability of the model for different time periods and even different data scenarios. To overcome these limitations, we propose an automated dilated spatio-temporal synchronous graph network, named Auto-DSTSGN for traffic prediction. Specifically, we design an automated dilated spatio-temporal synchronous graph (Auto-DSTSG) module to capture the short-term and long-term spatio-temporal correlations by stacking deeper layers with dilation factors in an increasing order. Further, we propose a graph structure search approach to automatically construct the spatio-temporal synchronous graph that can adapt to different data scenarios. Extensive experiments on four real-world datasets demonstrate that our model can achieve about 10% improvements compared with the state-of-art methods. Source codes are available at https://github.com/jinguangyin/Auto-DSTSGN.

preprint2020arXiv

Deep Multi-View Spatiotemporal Virtual Graph Neural Network for Significant Citywide Ride-hailing Demand Prediction

Urban ride-hailing demand prediction is a crucial but challenging task for intelligent transportation system construction. Predictable ride-hailing demand can facilitate more reasonable vehicle scheduling and online car-hailing platform dispatch. Conventional deep learning methods with no external structured data can be accomplished via hybrid models of CNNs and RNNs by meshing plentiful pixel-level labeled data, but spatial data sparsity and limited learning capabilities on temporal long-term dependencies are still two striking bottlenecks. To address these limitations, we propose a new virtual graph modeling method to focus on significant demand regions and a novel Deep Multi-View Spatiotemporal Virtual Graph Neural Network (DMVST-VGNN) to strengthen learning capabilities of spatial dynamics and temporal long-term dependencies. Specifically, DMVST-VGNN integrates the structures of 1D Convolutional Neural Network, Multi Graph Attention Neural Network and Transformer layer, which correspond to short-term temporal dynamics view, spatial dynamics view and long-term temporal dynamics view respectively. In this paper, experiments are conducted on two large-scale New York City datasets in fine-grained prediction scenes. And the experimental results demonstrate effectiveness and superiority of DMVST-VGNN framework in significant citywide ride-hailing demand prediction.