Related papers: Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer

URL: http://arxiv.org/abs/2206.11053v1
Date: Wed, 22 Jun 2022 13:21:31 GMT
Title: Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer
Authors: Lalithkumar Seenivasan, Mobarakol Islam, Adithya Krishna and Hongliang Ren
Abstract summary: Expert surgeons are often overloaded with clinical and academic workload. Having a Surgical-VQA system as a reliable'second opinion' could act as a backup and ease the load on the medical experts. We design a Surgical-VQA task that answers questionnaires on surgical procedures based on the surgical scene.
Score: 15.490603884631764
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual question answering (VQA) in surgery is largely unexplored. Expert surgeons are scarce and are often overloaded with clinical and academic workloads. This overload often limits their time answering questionnaires from patients, medical students or junior residents related to surgical procedures. At times, students and junior residents also refrain from asking too many questions during classes to reduce disruption. While computer-aided simulators and recording of past surgical procedures have been made available for them to observe and improve their skills, they still hugely rely on medical experts to answer their questions. Having a Surgical-VQA system as a reliable 'second opinion' could act as a backup and ease the load on the medical experts in answering these questions. The lack of annotated medical data and the presence of domain-specific terms has limited the exploration of VQA for surgical procedures. In this work, we design a Surgical-VQA task that answers questionnaires on surgical procedures based on the surgical scene. Extending the MICCAI endoscopic vision challenge 2018 dataset and workflow recognition dataset further, we introduce two Surgical-VQA datasets with classification and sentence-based answers. To perform Surgical-VQA, we employ vision-text transformers models. We further introduce a residual MLP-based VisualBert encoder model that enforces interaction between visual and text tokens, improving performance in classification-based answering. Furthermore, we study the influence of the number of input image patches and temporal visual features on the model performance in both classification and sentence-based answering.

Related papers

Surgeons vs. Computer Vision: A comparative analysis on surgical phase recognition capabilities [65.66373425605278]
Automated Surgical Phase Recognition (SPR) uses Artificial Intelligence (AI) to segment the surgical workflow into its key events. Previous research has focused on short and linear surgical procedures and has not explored if temporal context influences experts' ability to better classify surgical phases. This research addresses these gaps, focusing on Robot-Assisted Partial Nephrectomy (RAPN) as a highly non-linear procedure.
arXiv Detail & Related papers (2025-04-26T15:37:22Z)
Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry [14.479606737135045]
We propose SCAN, a memory-augmented framework to improve surgical context comprehension via Self-Contained Inquiry. SCAN operates autonomously, generating two types of memory for context augmentation: Direct Memory (DM), which provides multiple candidates (or hints) to the final answer, and Indirect Memory (IM), which consists of self-contained question-hint pairs to capture broader scene context. Experiments on three publicly available Surgical VQA datasets demonstrate that SCAN achieves state-of-the-art performance, offering improved accuracy and robustness across various surgical scenarios.
arXiv Detail & Related papers (2024-11-17T02:23:45Z)
Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data. We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z)
Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery [12.21083362663014]
Medical visual question answering (VQA) bridges the gap between visual information and clinical decision-making. In particular, surgical VQA can enhance the interpretation of surgical data, aiding in accurate diagnoses, effective education, and clinical interventions. We propose the surgical visual question localized-answering (VQLA) for precise and context-aware responses to specific queries regarding surgical images.
arXiv Detail & Related papers (2024-08-09T09:23:07Z)
VS-Assistant: Versatile Surgery Assistant on the Demand of Surgeons [29.783300422432763]
We propose a Versatile Surgery Assistant (VS-Assistant) that can accurately understand the surgeon's intention. We devise a surgical-Calling Tuning strategy to enable the VS-Assistant to understand surgical intentions.
arXiv Detail & Related papers (2024-05-14T02:05:36Z)
Advancing Surgical VQA with Scene Graph Knowledge [45.05847978115387]
We aim to advance Visual Question Answering in the surgical context with scene graph knowledge. We build surgical scene graphs using spatial and action information of instruments and anatomies. We propose SSG-QA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module (SIM)
arXiv Detail & Related papers (2023-12-15T22:50:12Z)
Deep Multimodal Fusion for Surgical Feedback Classification [70.53297887843802]
We leverage a clinically-validated five-category classification of surgical feedback. We then develop a multi-label machine learning model to classify these five categories of surgical feedback from inputs of text, audio, and video modalities. The ultimate goal of our work is to help automate the annotation of real-time contextual surgical feedback at scale.
arXiv Detail & Related papers (2023-12-06T01:59:47Z)
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [51.78027546947034]
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics. We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals. We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions.
arXiv Detail & Related papers (2023-07-27T22:38:12Z)
CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery [14.52406034300867]
A surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos. We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios. The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training.
arXiv Detail & Related papers (2023-07-11T11:35:40Z)
Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery [18.248882845789353]
We develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos. Most of the existing VQA methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation. We propose Visual Question Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific surgical area during the answer prediction.
arXiv Detail & Related papers (2023-05-19T14:13:47Z)
CholecTriplet2021: A benchmark challenge for surgical action triplet recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos. We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge. A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z)
Medical Visual Question Answering: A Survey [55.53205317089564]
Medical Visual Question Answering(VQA) is a combination of medical artificial intelligence and popular VQA challenges. Given a medical image and a clinically relevant question in natural language, the medical VQA system is expected to predict a plausible and convincing answer.
arXiv Detail & Related papers (2021-11-19T05:55:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.