Related papers: Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models

Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models

URL: http://arxiv.org/abs/2410.09750v1
Date: Sun, 13 Oct 2024 07:12:35 GMT
Title: Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models
Authors: Juseong Jin, Chang Wook Jeong,
Abstract summary: We introduce an LVLM specifically designed for surgical scenarios. We establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios. Experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts.
Score: 1.4042211166197214
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Conversation agents powered by large language models are revolutionizing the way we interact with visual data. Recently, large vision-language models (LVLMs) have been extensively studied for both images and videos. However, these studies typically focus on common scenarios. In this work, we introduce an LVLM specifically designed for surgical scenarios. We integrate visual representations of surgical images and videos into the language feature space. Consequently, we establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios. Our experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts, occasionally displaying multi-modal behaviors on unseen instructions. We conduct a quantitative evaluation of visual question-answering datasets for surgical scenarios. The results show superior performance compared to previous works, indicating the potential of our model to tackle more complex surgery scenarios.

Related papers

SurgLLM: A Versatile Large Multimodal Model with Spatial Focus and Temporal Awareness for Surgical Video Understanding [75.00667948967848]
The SurgLLM framework is a large multimodal model tailored for versatile surgical video understanding tasks.<n>To empower the spatial focus of surgical videos, we first devise Surgical Context-aware Multimodal Pretraining (Surg-Pretrain) for the video encoder of SurgLLM.<n>To incorporate surgical temporal knowledge into SurgLLM, we further propose Temporal-aware Multimodal Tuning (TM-Tuning) to enhance temporal reasoning with interleaved multimodal embeddings.
arXiv Detail & Related papers (2025-08-30T04:36:41Z)
SurgVidLM: Towards Multi-grained Surgical Video Understanding with Large Language Model [55.13206879750197]
SurgVidLM is the first video language model designed to address both full and fine-grained surgical video comprehension.<n>We introduce the StageFocus mechanism which is a two-stage framework performing the multi-grained, progressive understanding of surgical videos.<n> Experimental results demonstrate that SurgVidLM significantly outperforms state-of-the-art Vid-LLMs in both full and fine-grained video understanding tasks.
arXiv Detail & Related papers (2025-06-22T02:16:18Z)
Challenging Vision-Language Models with Surgical Data: A New Dataset and Broad Benchmarking Study [0.6120768859742071]
We present the first large-scale study assessing the capabilities of Vision Language Models (VLMs) for endoscopic tasks.<n>Using a diverse set of state-of-the-art models, multiple surgical datasets, and extensive human reference annotations, we address three key research questions.<n>Our results reveal that VLMs can effectively perform basic surgical perception tasks, such as object counting and localization, with performance levels comparable to general domain tasks.
arXiv Detail & Related papers (2025-06-06T16:53:12Z)
SurgVLM: A Large Vision-Language Model and Systematic Evaluation Benchmark for Surgical Intelligence [72.10889173696928]
We propose SurgVLM, one of the first large vision-language foundation models for surgical intelligence.<n>We construct a large-scale multimodal surgical database, SurgVLM-DB, spanning more than 16 surgical types and 18 anatomical structures.<n>Building upon this comprehensive dataset, we propose SurgVLM, which is built upon Qwen2.5-VL, and undergoes instruction tuning to 10+ surgical tasks.
arXiv Detail & Related papers (2025-06-03T07:44:41Z)
EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery [52.992415247012296]
We introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding. Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks.
arXiv Detail & Related papers (2025-01-20T09:12:06Z)
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining [55.15365161143354]
OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding. OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles. Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
arXiv Detail & Related papers (2024-11-23T02:53:08Z)
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data. We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z)
LLaVA-Surg: Towards Multimodal Surgical Assistant via Structured Surgical Video Learning [15.646322352232819]
We create a new dataset, Surg-QA, consisting of 102,000 surgical video-instruction pairs. We propose a novel two-stage question-answer generation pipeline with LLM to learn surgical knowledge. We train LLaVA-Surg, a novel vision-language conversational assistant capable of answering open-ended questions about surgical videos.
arXiv Detail & Related papers (2024-08-15T07:00:20Z)
HecVL: Hierarchical Video-Language Pretraining for Zero-shot Surgical Phase Recognition [51.222684687924215]
HecVL is a novel hierarchical video-language pretraining approach for building a generalist surgical model. We propose a novel fine-to-coarse contrastive learning framework that learns separate embedding spaces for the three video-text hierarchies. By disentangling embedding spaces of different hierarchical levels, the learned multi-modal representations encode short-term and long-term surgical concepts in the same model.
arXiv Detail & Related papers (2024-05-16T13:14:43Z)
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery [15.47190687192761]
We introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios. We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset.
arXiv Detail & Related papers (2024-03-22T08:38:27Z)
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [51.78027546947034]
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics. We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals. We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions.
arXiv Detail & Related papers (2023-07-27T22:38:12Z)
CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery [14.52406034300867]
A surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios. The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training.
arXiv Detail & Related papers (2023-07-11T11:35:40Z)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model. It can analyze and answer open-ended questions about chest radiographs. We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z)
Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical Procedures [70.69948035469467]
We take advantage of the latest computer vision methodologies for generating 3D graphs from camera views. We then introduce the Multimodal Semantic Graph Scene (MSSG) which aims at providing unified symbolic and semantic representation of surgical procedures.
arXiv Detail & Related papers (2021-06-09T14:35:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.