Surgical-VQLA: Transformer with Gated Vision-Language Embedding for
Visual Question Localized-Answering in Robotic Surgery
- URL: http://arxiv.org/abs/2305.11692v1
- Date: Fri, 19 May 2023 14:13:47 GMT
- Title: Surgical-VQLA: Transformer with Gated Vision-Language Embedding for
Visual Question Localized-Answering in Robotic Surgery
- Authors: Long Bai, Mobarakol Islam, Lalithkumar Seenivasan, Hongliang Ren
- Abstract summary: We develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos.
Most of the existing VQA methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation.
We propose Visual Question Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific surgical area during the answer prediction.
- Score: 18.248882845789353
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the availability of computer-aided simulators and recorded videos of
surgical procedures, junior residents still heavily rely on experts to answer
their queries. However, expert surgeons are often overloaded with clinical and
academic workloads and limit their time in answering. For this purpose, we
develop a surgical question-answering system to facilitate robot-assisted
surgical scene and activity understanding from recorded videos. Most of the
existing VQA methods require an object detector and regions based feature
extractor to extract visual features and fuse them with the embedded text of
the question for answer generation. However, (1) surgical object detection
model is scarce due to smaller datasets and lack of bounding box annotation;
(2) current fusion strategy of heterogeneous modalities like text and image is
naive; (3) the localized answering is missing, which is crucial in complex
surgical scenarios. In this paper, we propose Visual Question
Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific
surgical area during the answer prediction. To deal with the fusion of the
heterogeneous modalities, we design gated vision-language embedding (GVLE) to
build input patches for the Language Vision Transformer (LViT) to predict the
answer. To get localization, we add the detection head in parallel with the
prediction head of the LViT. We also integrate GIoU loss to boost localization
performance by preserving the accuracy of the question-answering model. We
annotate two datasets of VQLA by utilizing publicly available surgical videos
from MICCAI challenges EndoVis-17 and 18. Our validation results suggest that
Surgical-VQLA can better understand the surgical scene and localize the
specific area related to the question-answering. GVLE presents an efficient
language-vision embedding technique by showing superior performance over the
existing benchmarks.
Related papers
- Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge Augmentation [51.222684687924215]
Surgical video-language pretraining faces unique challenges due to the knowledge domain gap and the scarcity of multi-modal data.
We propose a hierarchical knowledge augmentation approach and a novel Procedure-Encoded Surgical Knowledge-Augmented Video-Language Pretraining framework to tackle these issues.
arXiv Detail & Related papers (2024-09-30T22:21:05Z) - Surgical-VQLA++: Adversarial Contrastive Learning for Calibrated Robust Visual Question-Localized Answering in Robotic Surgery [12.21083362663014]
Medical visual question answering (VQA) bridges the gap between visual information and clinical decision-making.
In particular, surgical VQA can enhance the interpretation of surgical data, aiding in accurate diagnoses, effective education, and clinical interventions.
We propose the surgical visual question localized-answering (VQLA) for precise and context-aware responses to specific queries regarding surgical images.
arXiv Detail & Related papers (2024-08-09T09:23:07Z) - PitVQA: Image-grounded Text Embedding LLM for Visual Question Answering in Pituitary Surgery [16.341966752582096]
This paper introduces PitVQA, a dataset specifically designed for Visual Question Answering (VQA) in endonasal pituitary surgery and PitVQA-Net, an adaptation of the GPT2 with a novel image-grounded text embedding for surgical VQA.
PitVQA comprises 25 procedural videos and a rich collection of question-answer pairs spanning crucial surgical aspects such as phase and step recognition, context understanding, tool detection and localization, and tool-tissue interactions.
PitVQA-Net consists of a novel image-grounded text embedding that projects image and text features into a shared embedding space and GPT2
arXiv Detail & Related papers (2024-05-22T19:30:24Z) - Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery [15.47190687192761]
We introduce Surgical-LVLM, a novel personalized large vision-language model tailored for complex surgical scenarios.
We demonstrate the effectiveness of Surgical-LVLM on several benchmarks, including EndoVis-17-VQLA, EndoVis-18-VQLA, and a newly introduced EndoVis Conversations dataset.
arXiv Detail & Related papers (2024-03-22T08:38:27Z) - Hypergraph-Transformer (HGT) for Interactive Event Prediction in
Laparoscopic and Robotic Surgery [50.3022015601057]
We propose a predictive neural network that is capable of understanding and predicting critical interactive aspects of surgical workflow from intra-abdominal video.
We verify our approach on established surgical datasets and applications, including the detection and prediction of action triplets.
Our results demonstrate the superiority of our approach compared to unstructured alternatives.
arXiv Detail & Related papers (2024-02-03T00:58:05Z) - Advancing Surgical VQA with Scene Graph Knowledge [45.05847978115387]
We aim to advance Visual Question Answering in the surgical context with scene graph knowledge.
We build surgical scene graphs using spatial and action information of instruments and anatomies.
We propose SSG-QA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module (SIM)
arXiv Detail & Related papers (2023-12-15T22:50:12Z) - Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures [51.78027546947034]
Recent advancements in surgical computer vision have been driven by vision-only models, which lack language semantics.
We propose leveraging surgical video lectures from e-learning platforms to provide effective vision and language supervisory signals.
We address surgery-specific linguistic challenges using multiple automatic speech recognition systems for text transcriptions.
arXiv Detail & Related papers (2023-07-27T22:38:12Z) - CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual
Question Localized-Answering in Robotic Surgery [14.52406034300867]
A surgical Visual Question Localized-Answering (VQLA) system would be helpful for medical students and junior surgeons to learn and understand from recorded surgical videos.
We propose an end-to-end Transformer with the Co-Attention gaTed Vision-Language (CAT-ViL) embedding for VQLA in surgical scenarios.
The proposed method provides a promising solution for surgical scene understanding, and opens up a primary step in the Artificial Intelligence (AI)-based VQLA system for surgical training.
arXiv Detail & Related papers (2023-07-11T11:35:40Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - Surgical-VQA: Visual Question Answering in Surgical Scenes using
Transformer [15.490603884631764]
Expert surgeons are often overloaded with clinical and academic workload.
Having a Surgical-VQA system as a reliable'second opinion' could act as a backup and ease the load on the medical experts.
We design a Surgical-VQA task that answers questionnaires on surgical procedures based on the surgical scene.
arXiv Detail & Related papers (2022-06-22T13:21:31Z) - CholecTriplet2021: A benchmark challenge for surgical action triplet
recognition [66.51610049869393]
This paper presents CholecTriplet 2021: an endoscopic vision challenge organized at MICCAI 2021 for the recognition of surgical action triplets in laparoscopic videos.
We present the challenge setup and assessment of the state-of-the-art deep learning methods proposed by the participants during the challenge.
A total of 4 baseline methods and 19 new deep learning algorithms are presented to recognize surgical action triplets directly from surgical videos, achieving mean average precision (mAP) ranging from 4.2% to 38.1%.
arXiv Detail & Related papers (2022-04-10T18:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.