Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery
- URL: http://arxiv.org/abs/2503.23130v3
- Date: Fri, 04 Apr 2025 02:45:12 GMT
- Title: Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery
- Authors: Boyi Ma, Yanguang Zhao, Jie Wang, Guankun Wang, Kun Yuan, Tong Chen, Long Bai, Hongliang Ren,
- Abstract summary: We investigate the dialogue capabilities of the DeepSeek model in robotic surgery scenarios.<n>Our empirical study shows that, compared to existing general-purpose multimodal large language models, DeepSeek-VL2 performs better on complex understanding tasks.<n>Although DeepSeek-V3 is purely a language model, we find that when image tokens are directly inputted, the model demonstrates better performance on single-sentence QA tasks.
- Score: 17.728772280544444
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The DeepSeek models have shown exceptional performance in general scene understanding, question-answering (QA), and text generation tasks, owing to their efficient training paradigm and strong reasoning capabilities. In this study, we investigate the dialogue capabilities of the DeepSeek model in robotic surgery scenarios, focusing on tasks such as Single Phrase QA, Visual QA, and Detailed Description. The Single Phrase QA tasks further include sub-tasks such as surgical instrument recognition, action understanding, and spatial position analysis. We conduct extensive evaluations using publicly available datasets, including EndoVis18 and CholecT50, along with their corresponding dialogue data. Our empirical study shows that, compared to existing general-purpose multimodal large language models, DeepSeek-VL2 performs better on complex understanding tasks in surgical scenes. Additionally, although DeepSeek-V3 is purely a language model, we find that when image tokens are directly inputted, the model demonstrates better performance on single-sentence QA tasks. However, overall, the DeepSeek models still fall short of meeting the clinical requirements for understanding surgical scenes. Under general prompts, DeepSeek models lack the ability to effectively analyze global surgical concepts and fail to provide detailed insights into surgical scenarios. Based on our observations, we argue that the DeepSeek models are not ready for vision-language tasks in surgical contexts without fine-tuning on surgery-specific datasets.
Related papers
- Surgeons vs. Computer Vision: A comparative analysis on surgical phase recognition capabilities [65.66373425605278]
Automated Surgical Phase Recognition (SPR) uses Artificial Intelligence (AI) to segment the surgical workflow into its key events.
Previous research has focused on short and linear surgical procedures and has not explored if temporal context influences experts' ability to better classify surgical phases.
This research addresses these gaps, focusing on Robot-Assisted Partial Nephrectomy (RAPN) as a highly non-linear procedure.
arXiv Detail & Related papers (2025-04-26T15:37:22Z) - Systematic Evaluation of Large Vision-Language Models for Surgical Artificial Intelligence [1.1765603103920352]
Large Vision-Language Models offer a new paradigm for AI-driven image understanding.
This flexibility holds particular promise across medicine, where expert-annotated data is scarce.
Here, we present a comprehensive analysis of 11 state-of-the-art VLMs across 17 key visual understanding tasks in surgical AI.
arXiv Detail & Related papers (2025-04-03T17:42:56Z) - EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery [52.992415247012296]
We introduce EndoChat to address various dialogue paradigms and subtasks in surgical scene understanding.<n>Our model achieves state-of-the-art performance across five dialogue paradigms and eight surgical scene understanding tasks.
arXiv Detail & Related papers (2025-01-20T09:12:06Z) - OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining [60.75854609803651]
OphCLIP is a hierarchical retrieval-augmented vision-language pretraining framework for ophthalmic surgical workflow understanding.<n>OphCLIP learns both fine-grained and long-term visual representations by aligning short video clips with detailed narrative descriptions and full videos with structured titles.<n>Our OphCLIP also designs a retrieval-augmented pretraining framework to leverage the underexplored large-scale silent surgical procedure videos.
arXiv Detail & Related papers (2024-11-23T02:53:08Z) - Surgical-LLaVA: Toward Surgical Scenario Understanding via Large Language and Vision Models [1.4042211166197214]
We introduce an LVLM specifically designed for surgical scenarios.
We establish a LVLM model, Surgical-LLaVA, fine-tuned on instruction following data of surgical scenarios.
Experiments demonstrate that Surgical-LLaVA exhibits impressive multi-modal chat abilities in surgical contexts.
arXiv Detail & Related papers (2024-10-13T07:12:35Z) - Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.<n>We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery [12.21083362663014]
Medical visual question answering (VQA) bridges the gap between visual information and clinical decision-making.
In particular, surgical VQA can enhance the interpretation of surgical data, aiding in accurate diagnoses, effective education, and clinical interventions.
We propose the surgical visual question localized-answering (VQLA) for precise and context-aware responses to specific queries regarding surgical images.
arXiv Detail & Related papers (2024-08-09T09:23:07Z) - GP-VLS: A general-purpose vision language model for surgery [0.5249805590164902]
GP-VLS is a general-purpose vision language model for surgery.
It integrates medical and surgical knowledge with visual scene understanding.
We show GP-VLS significantly outperforms open- and closed-source models on surgical vision-language tasks.
arXiv Detail & Related papers (2024-07-27T17:27:05Z) - Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery [15.47190687192761]
We introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios.
We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset.
arXiv Detail & Related papers (2024-03-22T08:38:27Z) - Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [50.09187683845788]
Recent advancements in surgical computer vision applications have been driven by vision-only models.
These methods rely on manually annotated surgical videos to predict a fixed set of object categories.
In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals.
arXiv Detail & Related papers (2023-07-27T22:38:12Z) - Surgical-VQLA: Transformer with Gated Vision-Language Embedding for
Visual Question Localized-Answering in Robotic Surgery [18.248882845789353]
We develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos.
Most of the existing VQA methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation.
We propose Visual Question Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific surgical area during the answer prediction.
arXiv Detail & Related papers (2023-05-19T14:13:47Z) - CholecTriplet2021: A benchmark challenge for surgical action triplet
recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos.
We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge.
A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.