Related papers: SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

URL: http://arxiv.org/abs/2304.09974v2
Date: Sat, 22 Jul 2023 15:43:46 GMT
Title: SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery
Authors: Lalithkumar Seenivasan, Mobarakol Islam, Gokul Kannan and Hongliang Ren
Abstract summary: We develop an end-to-end trainable Language-Vision GPT model that expands the GPT2 model to include vision input (image) We prove that the LV-GPT model outperforms other state-of-the-art VQA models on two publically available surgical-VQA datasets.
Score: 15.490603884631764
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advances in GPT-based large language models (LLMs) are revolutionizing natural language processing, exponentially increasing its use across various domains. Incorporating uni-directional attention, these autoregressive LLMs can generate long and coherent paragraphs. However, for visual question answering (VQA) tasks that require both vision and language processing, models with bi-directional attention or models employing fusion techniques are often employed to capture the context of multiple modalities all at once. As GPT does not natively process vision tokens, to exploit the advancements in GPT models for VQA in robotic surgery, we design an end-to-end trainable Language-Vision GPT (LV-GPT) model that expands the GPT2 model to include vision input (image). The proposed LV-GPT incorporates a feature extractor (vision tokenizer) and vision token embedding (token type and pose). Given the limitations of unidirectional attention in GPT models and their ability to generate coherent long paragraphs, we carefully sequence the word tokens before vision tokens, mimicking the human thought process of understanding the question to infer an answer from an image. Quantitatively, we prove that the LV-GPT model outperforms other state-of-the-art VQA models on two publically available surgical-VQA datasets (based on endoscopic vision challenge robotic scene segmentation 2018 and CholecTriplet2021) and on our newly annotated dataset (based on the holistic surgical scene dataset). We further annotate all three datasets to include question-type annotations to allow sub-type analysis. Furthermore, we extensively study and present the effects of token sequencing, token type and pose embedding for vision tokens in the LV-GPT model.

Related papers

Exploring Typographic Visual Prompts Injection Threats in Cross-Modality Generation Models [24.076565048125975]
Cross-vision, encompassing Vision-Language Perception and Image-to-Image, has attracted significant attention. Previous research indicates that printing typographic words into input images significantly induces LVLMs and I2I GMs to generate disruptive outputs semantically related to those words. Visual prompts, as a more sophisticated form of typography, are also revealed to pose security risks to various applications of generative tasks when injected into images.
arXiv Detail & Related papers (2025-03-14T15:42:42Z)
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [95.6266030753644]
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies. We propose OTTER, a novel VLA architecture that leverages existing alignments through explicit, text-aware visual feature extraction.
arXiv Detail & Related papers (2025-03-05T18:44:48Z)
MGPATH: Vision-Language Model with Multi-Granular Prompt Learning for Few-Shot WSI Classification [19.29480118378639]
Whole slide pathology image classification presents challenges due to gigapixel image sizes and limited annotation labels. This paper introduces a prompt learning method to adapt large vision-language models for few-shot pathology classification.
arXiv Detail & Related papers (2025-02-11T09:42:13Z)
ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features [54.37042005469384]
We announce MVKL, the first multimodal mammography dataset encompassing multi-view images, detailed manifestations and reports. Based on this dataset, we focus on the challanging task of unsupervised pretraining. We propose ViKL, a framework that synergizes Visual, Knowledge, and Linguistic features.
arXiv Detail & Related papers (2024-09-24T05:01:23Z)
UniRAG: Universal Retrieval Augmentation for Large Vision Language Models [76.30799731147589]
We introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference. Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models and smaller open-source models significantly enhance their generation quality.
arXiv Detail & Related papers (2024-05-16T17:58:45Z)
VisionGPT: Vision-Language Understanding Agent Using Generalized Multimodal Framework [47.58359136198136]
We introduce VisionGPT to consolidate and automate the integration of state-of-the-art foundation models. VisionGPT builds upon a generalized multimodal framework that distinguishes itself through three key features. This paper outlines the architecture and capabilities of VisionGPT, demonstrating its potential to revolutionize the field of computer vision.
arXiv Detail & Related papers (2024-03-14T01:39:40Z)
Intensive Vision-guided Network for Radiology Report Generation [22.030289124516326]
We propose a Globally-intensive Attention (GIA) module in the medical image encoder to simulate and integrate multi-view vision perception. We also explore how to involve multi-modal signals to generate precisely matched reports, i.e., how to integrate previously predicted words with region-aware visual content in next word prediction.
arXiv Detail & Related papers (2024-02-06T06:46:46Z)
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation [79.02357561313785]
We introduce Vision-Language Generative Pre-trained Transformer (VL-GPT), a transformer model proficient at concurrently perceiving and generating visual and linguistic data. VL-GPT achieves a unified pre-training approach for both image and text modalities by employing a straightforward auto-regressive objective.
arXiv Detail & Related papers (2023-12-14T18:59:43Z)
APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models. APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z)
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language. We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language. We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
Vision-and-Language Pretrained Models: A Survey [3.270244666687303]
We present an overview of the major advances achieved in Visual-Language Pretrained Models. We first discuss the language and vision data encoding methods and then present the mainstream VLPM structure as the core content.
arXiv Detail & Related papers (2022-04-15T07:33:06Z)
VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training [9.511101155155957]
A vision-and-language pre-training model (VLMs) has achieved tremendous success in the cross-modal area, but most of them require millions of parallel image-caption data for pre-training. In this work, we focus on reducing such need for generative vision-and-language pre-training by taking advantage of the visual pre-trained model (CLIP-ViT) as encoder and language pre-trained model (GPT2) as decoder.
arXiv Detail & Related papers (2022-01-30T04:44:54Z)
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research. We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training. Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.