SurgicalGPT: End-to-End Language-Vision GPT for Visual Question
Answering in Surgery
- URL: http://arxiv.org/abs/2304.09974v2
- Date: Sat, 22 Jul 2023 15:43:46 GMT
- Title: SurgicalGPT: End-to-End Language-Vision GPT for Visual Question
Answering in Surgery
- Authors: Lalithkumar Seenivasan, Mobarakol Islam, Gokul Kannan and Hongliang
Ren
- Abstract summary: We develop an end-to-end trainable Language-Vision GPT model that expands the GPT2 model to include vision input (image)
We prove that the LV-GPT model outperforms other state-of-the-art VQA models on two publically available surgical-VQA datasets.
- Score: 15.490603884631764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in GPT-based large language models (LLMs) are revolutionizing
natural language processing, exponentially increasing its use across various
domains. Incorporating uni-directional attention, these autoregressive LLMs can
generate long and coherent paragraphs. However, for visual question answering
(VQA) tasks that require both vision and language processing, models with
bi-directional attention or models employing fusion techniques are often
employed to capture the context of multiple modalities all at once. As GPT does
not natively process vision tokens, to exploit the advancements in GPT models
for VQA in robotic surgery, we design an end-to-end trainable Language-Vision
GPT (LV-GPT) model that expands the GPT2 model to include vision input (image).
The proposed LV-GPT incorporates a feature extractor (vision tokenizer) and
vision token embedding (token type and pose). Given the limitations of
unidirectional attention in GPT models and their ability to generate coherent
long paragraphs, we carefully sequence the word tokens before vision tokens,
mimicking the human thought process of understanding the question to infer an
answer from an image. Quantitatively, we prove that the LV-GPT model
outperforms other state-of-the-art VQA models on two publically available
surgical-VQA datasets (based on endoscopic vision challenge robotic scene
segmentation 2018 and CholecTriplet2021) and on our newly annotated dataset
(based on the holistic surgical scene dataset). We further annotate all three
datasets to include question-type annotations to allow sub-type analysis.
Furthermore, we extensively study and present the effects of token sequencing,
token type and pose embedding for vision tokens in the LV-GPT model.
Related papers
- Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models [24.076565048125975]
Cross-vision, encompassing Vision-Language Perception and Image-to-Image, has attracted significant attention.
Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to generate disruptive outputs semantically related to those words.
Visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of generative tasks when injected into images.
arXiv Detail & Related papers (2025-03-14T15:42:42Z) - OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [95.6266030753644]
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions.
Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies.
We propose OTTER, a novel VLA architecture that leverages existing alignments through explicit, text-aware visual feature extraction.
arXiv Detail & Related papers (2025-03-05T18:44:48Z) - MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification [19.29480118378639]
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels.
This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification.
arXiv Detail & Related papers (2025-02-11T09:42:13Z) - ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports.
Based on this dataset, we focus on the challanging task of unsupervised pretraining.
We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z) - UniRAG: Universal Retrieval Augmentation for Large Vision Language Models [76.30799731147589]
We introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference.
Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models and smaller open-source models significantly enhance their generation quality.
arXiv Detail & Related papers (2024-05-16T17:58:45Z) - VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework [47.58359136198136]
We introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models.
VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features.
This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision.
arXiv Detail & Related papers (2024-03-14T01:39:40Z) - Intensive Vision-guided Network for Radiology Report Generation [22.030289124516326]
We propose a Globally-intensive Attention (GIA) module in the medical image encoder to simulate and integrate multi-view vision perception.
We also explore how to involve multi-modal signals to generate precisely matched reports, i.e., how to integrate previously predicted words with region-aware visual content in next word prediction.
arXiv Detail & Related papers (2024-02-06T06:46:46Z) - VL-GPT: A Generative Pre-trained Transformer for Vision and Language
Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data.
VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language.
The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image.
This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z) - Vision-and-Language Pretrained Models: A Survey [3.270244666687303]
We present an overview of the major advances achieved in Visual-Language Pretrained Models.
We first discuss the language and vision data encoding methods and then present the mainstream VLPM structure as the core content.
arXiv Detail & Related papers (2022-04-15T07:33:06Z) - VC-GPT: Visual Conditioned GPT for End-to-End Generative
Vision-and-Language Pre-training [9.511101155155957]
A vision-and-language pre-training model (VLMs) has achieved tremendous success in the cross-modal area, but most of them require millions of parallel image-caption data for pre-training.
In this work, we focus on reducing such need for generative vision-and-language pre-training by taking advantage of the visual pre-trained model (CLIP-ViT) as encoder and language pre-trained model (GPT2) as decoder.
arXiv Detail & Related papers (2022-01-30T04:44:54Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.