Related papers: SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery

SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery

URL: http://arxiv.org/abs/2503.09474v1
Date: Wed, 12 Mar 2025 15:30:39 GMT
Title: SurgicalVLM-Agent: Towards an Interactive AI Co-Pilot for Pituitary Surgery
Authors: Jiayuan Huang, Runlong He, Danyal Z. Khan, Evangelos Mazomenos, Danail Stoyanov, Hani J. Marcus, Matthew J. Clarkson, Mobarakol Islam,
Abstract summary: Large vision-language models (VLMs) offer a promising solution by enabling dynamic task planning and predictive decision support.<n>We introduce SurgicalVLM-Agent, an AI co-pilot for image-guided pituitary surgery, capable of conversation, planning, and task execution.
Score: 15.263720052126853
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image-guided surgery demands adaptive, real-time decision support, yet static AI models struggle with structured task planning and providing interactive guidance. Large vision-language models (VLMs) offer a promising solution by enabling dynamic task planning and predictive decision support. We introduce SurgicalVLM-Agent, an AI co-pilot for image-guided pituitary surgery, capable of conversation, planning, and task execution. The agent dynamically processes surgeon queries and plans the tasks such as MRI tumor segmentation, endoscope anatomy segmentation, overlaying preoperative imaging with intraoperative views, instrument tracking, and surgical visual question answering (VQA). To enable structured task planning, we develop the PitAgent dataset, a surgical context-aware dataset covering segmentation, overlaying, instrument localization, tool tracking, tool-tissue interactions, phase identification, and surgical activity recognition. Additionally, we propose FFT-GaLore, a fast Fourier transform (FFT)-based gradient projection technique for efficient low-rank adaptation, optimizing fine-tuning for LLaMA 3.2 in surgical environments. We validate SurgicalVLM-Agent by assessing task planning and prompt generation on our PitAgent dataset and evaluating zero-shot VQA using a public pituitary dataset. Results demonstrate state-of-the-art performance in task planning and query interpretation, with highly semantically meaningful VQA responses, advancing AI-driven surgical assistance.

Related papers

Surgeons vs. Computer Vision: A comparative analysis on surgical phase recognition capabilities [65.66373425605278]
Automated Surgical Phase Recognition (SPR) uses Artificial Intelligence (AI) to segment the surgical workflow into its key events. Previous research has focused on short and linear surgical procedures and has not explored if temporal context influences experts' ability to better classify surgical phases. This research addresses these gaps, focusing on Robot-Assisted Partial Nephrectomy (RAPN) as a highly non-linear procedure.
arXiv Detail & Related papers (2025-04-26T15:37:22Z)
SurgRAW: Multi-Agent Workflow with Chain-of-Thought Reasoning for Surgical Intelligence [16.584722724845182]
Integration of Vision-Language Models in surgical intelligence is hindered by hallucinations, domain knowledge gaps, and limited understanding of task interdependencies. We present SurgRAW, a CoT-driven multi-agent framework that delivers transparent, interpretable insights for most tasks in robotic-assisted surgery.
arXiv Detail & Related papers (2025-03-13T11:23:13Z)
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data. We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z)
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery [15.47190687192761]
We introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios.<n>We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset.
arXiv Detail & Related papers (2024-03-22T08:38:27Z)
Hypergraph-Transformer (HGT) for Interactive Event Prediction in Laparoscopic and Robotic Surgery [47.47211257890948]
We propose a predictive neural network that is capable of understanding and predicting critical interactive aspects of surgical workflow from intra-abdominal video.<n>We verify our approach on established surgical datasets and applications, including the detection and prediction of action triplets.<n>Our results demonstrate the superiority of our approach compared to unstructured alternatives.
arXiv Detail & Related papers (2024-02-03T00:58:05Z)
Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery [18.248882845789353]
We develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos. Most of the existing VQA methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation. We propose Visual Question Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific surgical area during the answer prediction.
arXiv Detail & Related papers (2023-05-19T14:13:47Z)
Robotic Navigation Autonomy for Subretinal Injection via Intelligent Real-Time Virtual iOCT Volume Slicing [88.99939660183881]
We propose a framework for autonomous robotic navigation for subretinal injection. Our method consists of an instrument pose estimation method, an online registration between the robotic and the i OCT system, and trajectory planning tailored for navigation to an injection target. Our experiments on ex-vivo porcine eyes demonstrate the precision and repeatability of the method.
arXiv Detail & Related papers (2023-01-17T21:41:21Z)
CholecTriplet2021: A benchmark challenge for surgical action triplet recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos. We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge. A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z)
Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views. We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z)
Temporal Segmentation of Surgical Sub-tasks through Deep Learning with Multiple Data Sources [14.677001578868872]
We propose a unified surgical state estimation model based on the actions performed or events occurred as the task progresses. We evaluate our model on the JHU-ISI Gesture and Skill Assessment Working Set (JIGSAWS) and a more complex dataset involving robotic intra-operative ultrasound (RIOUS) imaging. Our model achieves a superior frame-wise state estimation accuracy up to 89.4%, which improves the state-of-the-art surgical state estimation models.
arXiv Detail & Related papers (2020-02-07T17:49:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.