VS-Assistant: Versatile Surgery Assistant on the Demand of Surgeons
- URL: http://arxiv.org/abs/2405.08272v1
- Date: Tue, 14 May 2024 02:05:36 GMT
- Title: VS-Assistant: Versatile Surgery Assistant on the Demand of Surgeons
- Authors: Zhen Chen, Xingjian Luo, Jinlin Wu, Danny T. M. Chan, Zhen Lei, Jinqiao Wang, Sebastien Ourselin, Hongbin Liu,
- Abstract summary: We propose a Versatile Surgery Assistant (VS-Assistant) that can accurately understand the surgeon's intention.
We devise a surgical-Calling Tuning strategy to enable the VS-Assistant to understand surgical intentions.
- Score: 29.783300422432763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The surgical intervention is crucial to patient healthcare, and many studies have developed advanced algorithms to provide understanding and decision-making assistance for surgeons. Despite great progress, these algorithms are developed for a single specific task and scenario, and in practice require the manual combination of different functions, thus limiting the applicability. Thus, an intelligent and versatile surgical assistant is expected to accurately understand the surgeon's intentions and accordingly conduct the specific tasks to support the surgical process. In this work, by leveraging advanced multimodal large language models (MLLMs), we propose a Versatile Surgery Assistant (VS-Assistant) that can accurately understand the surgeon's intention and complete a series of surgical understanding tasks, e.g., surgical scene analysis, surgical instrument detection, and segmentation on demand. Specifically, to achieve superior surgical multimodal understanding, we devise a mixture of projectors (MOP) module to align the surgical MLLM in VS-Assistant to balance the natural and surgical knowledge. Moreover, we devise a surgical Function-Calling Tuning strategy to enable the VS-Assistant to understand surgical intentions, and thus make a series of surgical function calls on demand to meet the needs of the surgeons. Extensive experiments on neurosurgery data confirm that our VS-Assistant can understand the surgeon's intention more accurately than the existing MLLM, resulting in overwhelming performance in textual analysis and visual tasks. Source code and models will be made public.
Related papers
- OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining [55.15365161143354]
OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding.
OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles.
Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
arXiv Detail & Related papers (2024-11-23T02:53:08Z) - Demystifying Large Language Models for Medicine: A Primer [50.83806796466396]
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare.
This tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice.
arXiv Detail & Related papers (2024-10-24T15:41:56Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.
We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - GP-VLS: A general-purpose vision language model for surgery [0.5249805590164902]
GP-VLS is a general-purpose vision language model for surgery.
It integrates medical and surgical knowledge with visual scene understanding.
We show GP-VLS significantly outperforms open- and closed-source models on surgical vision-language tasks.
arXiv Detail & Related papers (2024-07-27T17:27:05Z) - Toward a Surgeon-in-the-Loop Ophthalmic Robotic Apprentice using Reinforcement and Imitation Learning [18.72371138886818]
We propose an image-guided approach for surgeon-centered autonomous agents during ophthalmic cataract surgery.
By integrating the surgeon's actions and preferences into the training process, our approach enables the robot to implicitly learn and adapt to the individual surgeon's unique techniques.
arXiv Detail & Related papers (2023-11-29T15:00:06Z) - SAMSNeRF: Segment Anything Model (SAM) Guides Dynamic Surgical Scene
Reconstruction by Neural Radiance Field (NeRF) [4.740415113160021]
We propose a novel approach called SAMSNeRF that combines Segment Anything Model (SAM) and Neural Radiance Field (NeRF) techniques.
Our experimental results on public endoscopy surgical videos demonstrate that our approach successfully reconstructs high-fidelity dynamic surgical scenes.
arXiv Detail & Related papers (2023-08-22T20:31:00Z) - Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [51.78027546947034]
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics.
We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals.
We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions.
arXiv Detail & Related papers (2023-07-27T22:38:12Z) - CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual
Question Localized-Answering in Robotic Surgery [14.52406034300867]
A surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos.
We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios.
The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training.
arXiv Detail & Related papers (2023-07-11T11:35:40Z) - Quantification of Robotic Surgeries with Vision-Based Deep Learning [45.165919577877695]
We propose a unified deep learning framework, entitled Roboformer, which operates exclusively on videos recorded during surgery.
We validated our framework on four video-based datasets of two commonly-encountered types of steps within minimally-invasive robotic surgeries.
arXiv Detail & Related papers (2022-05-06T06:08:35Z) - CholecTriplet2021: A benchmark challenge for surgical action triplet
recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos.
We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge.
A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.