PitVQA++: Vector Matrix-Low-Rank Adaptation for Open-Ended Visual Question Answering in Pituitary Surgery
- URL: http://arxiv.org/abs/2502.14149v1
- Date: Wed, 19 Feb 2025 23:28:39 GMT
- Title: PitVQA++: Vector Matrix-Low-Rank Adaptation for Open-Ended Visual Question Answering in Pituitary Surgery
- Authors: Runlong He, Danyal Z. Khan, Evangelos B. Mazomenos, Hani J. Marcus, Danail Stoyanov, Matthew J. Clarkson, Mobarakol Islam,
- Abstract summary: Vision-Language Models (VLMs) in visual question answering (VQA) offer a unique opportunity to enhance intra-operative decision-making, promote intuitive interactions, and significantly advance surgical education.<n>The development of VLMs for surgical VQA is challenging due to limited datasets and the risk of overfitting and catastrophic forgetting during full fine-tuning of pretrained weights.<n>This work introduces PitVQA with an openended PitVQA dataset and an innovative VLM fine-tuning approach for adapting GPT-2 to pituitary surgery.
- Score: 16.957689975841113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Models (VLMs) in visual question answering (VQA) offer a unique opportunity to enhance intra-operative decision-making, promote intuitive interactions, and significantly advancing surgical education. However, the development of VLMs for surgical VQA is challenging due to limited datasets and the risk of overfitting and catastrophic forgetting during full fine-tuning of pretrained weights. While parameter-efficient techniques like Low-Rank Adaptation (LoRA) and Matrix of Rank Adaptation (MoRA) address adaptation challenges, their uniform parameter distribution overlooks the feature hierarchy in deep networks, where earlier layers, that learn general features, require more parameters than later ones. This work introduces PitVQA++ with an open-ended PitVQA dataset and vector matrix-low-rank adaptation (Vector-MoLoRA), an innovative VLM fine-tuning approach for adapting GPT-2 to pituitary surgery. Open-Ended PitVQA comprises around 101,803 frames from 25 procedural videos with 745,972 question-answer sentence pairs, covering key surgical elements such as phase and step recognition, context understanding, tool detection, localization, and interactions recognition. Vector-MoLoRA incorporates the principles of LoRA and MoRA to develop a matrix-low-rank adaptation strategy that employs vector ranking to allocate more parameters to earlier layers, gradually reducing them in the later layers. Our approach, validated on the Open-Ended PitVQA and EndoVis18-VQA datasets, effectively mitigates catastrophic forgetting while significantly enhancing performance over recent baselines. Furthermore, our risk-coverage analysis highlights its enhanced reliability and trustworthiness in handling uncertain predictions. Our source code and dataset is available at~\url{https://github.com/HRL-Mike/PitVQA-Plus}.
Related papers
- Multi-Modality Driven LoRA for Adverse Condition Depth Estimation [61.525312117638116]
We propose Multi-Modality Driven LoRA (MMD-LoRA) for Adverse Condition Depth Estimation.
It consists of two core components: Prompt Driven Domain Alignment (PDDA) and Visual-Text Consistent Contrastive Learning (VTCCL)
It achieves state-of-the-art performance on the nuScenes and Oxford RobotCar datasets.
arXiv Detail & Related papers (2024-12-28T14:23:58Z) - OP-LoRA: The Blessing of Dimensionality [93.08208871549557]
Low-rank adapters enable fine-tuning of large models with only a small number of parameters.<n>They often pose optimization challenges, with poor convergence.<n>We introduce an over- parameterized approach that accelerates training without increasing inference costs.<n>We achieve improvements in vision-language tasks and especially notable increases in image generation.
arXiv Detail & Related papers (2024-12-13T18:55:19Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.
Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Adapting Vision-Language Model with Fine-grained Semantics for Open-Vocabulary Segmentation [42.020470627552136]
Open-vocabulary segmentation is primarily bottlenecked by mask classification, not mask generation.
We propose a novel Fine-grained Semantic Adaptation (FISA) method to address this limitation.
FISA enhances the extracted visual features with fine-grained semantic awareness by explicitly integrating this crucial semantic information early in the visual encoding process.
arXiv Detail & Related papers (2024-09-24T17:50:28Z) - DARES: Depth Anything in Robotic Endoscopic Surgery with Self-supervised Vector-LoRA of the Foundation Model [17.41557655783514]
We introduce Depth Anything in Robotic Endoscopic Surgery (DARES)
New adaptation technique, Low-Rank Adaptation (LoRA) on the DAM V2 to perform self-supervised monocular depth estimation.
New method is validated superior over recent state-of-the-art self-supervised monocular depth estimation techniques.
arXiv Detail & Related papers (2024-08-30T17:35:06Z) - Joint Admission Control and Resource Allocation of Virtual Network Embedding via Hierarchical Deep Reinforcement Learning [69.00997996453842]
We propose a deep Reinforcement Learning approach to learn a joint Admission Control and Resource Allocation policy for virtual network embedding.
We show that HRL-ACRA outperforms state-of-the-art baselines in terms of both the acceptance ratio and long-term average revenue.
arXiv Detail & Related papers (2024-06-25T07:42:30Z) - OpenDAS: Open-Vocabulary Domain Adaptation for 2D and 3D Segmentation [54.98688607911399]
We propose the task of open-vocabulary domain adaptation to infuse domain-specific knowledge into Vision-Language Models (VLMs)
Existing VLM adaptation methods improve performance on base (training) queries, but fail to preserve the open-set capabilities of VLMs on novel queries.
Our approach is the only parameter-efficient method that consistently surpasses the original VLM on novel classes.
arXiv Detail & Related papers (2024-05-30T15:16:06Z) - HAAP: Vision-context Hierarchical Attention Autoregressive with Adaptive Permutation for Scene Text Recognition [17.412985505938508]
Internal Language Model (LM)-based methods use permutation language modeling (PLM) to solve the error correction caused by conditional independence in external LM-based methods.
This paper proposes the Hierarchical Attention autoregressive Model with Adaptive Permutation (HAAP) to enhance the location-context-image interaction capability.
arXiv Detail & Related papers (2024-05-15T06:41:43Z) - Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation [15.065302021892318]
Vision-Language Models (VLMs) play a crucial role in the advancement of Artificial General Intelligence (AGI)<n>Addressing security concerns has emerged as one of the most significant challenges for VLMs.<n>We propose a parameter-efficient adversarial adaptation method called textbftextitAdvLoRA based on Low-Rank Adaptation.
arXiv Detail & Related papers (2024-04-20T17:19:54Z) - p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models [10.713680139939354]
Vision-Language models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks.
PETL has garnered attention as a viable alternative to full fine-tuning.
We propose a new adapter architecture, $p$-adapter, which employs $p$-Laplacian message passing in Graph Neural Networks (GNNs)
arXiv Detail & Related papers (2023-12-17T05:30:35Z) - Adversarial Feature Augmentation and Normalization for Visual
Recognition [109.6834687220478]
Recent advances in computer vision take advantage of adversarial data augmentation to ameliorate the generalization ability of classification models.
Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings.
We validate the proposed approach across diverse visual recognition tasks with representative backbone networks.
arXiv Detail & Related papers (2021-03-22T20:36:34Z) - Phase Retrieval using Expectation Consistent Signal Recovery Algorithm
based on Hypernetwork [73.94896986868146]
Phase retrieval is an important component in modern computational imaging systems.
Recent advances in deep learning have opened up a new possibility for robust and fast PR.
We develop a novel framework for deep unfolding to overcome the existing limitations.
arXiv Detail & Related papers (2021-01-12T08:36:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.