RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- URL: http://arxiv.org/abs/2406.19150v1
- Date: Thu, 27 Jun 2024 13:08:35 GMT
- Title: RAVEN: Multitask Retrieval Augmented Vision-Language Learning
- Authors: Varun Nagaraj Rao, Siddharth Choudhary, Aditya Deshpande, Ravi Kumar Satzoda, Srikar Appalaraju,
- Abstract summary: The scaling of large language models to encode all the world's knowledge is unsustainable and has exacerbated resource barriers.
Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored.
This paper introduces RAVEN, a retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning.
- Score: 5.1583788731239455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The scaling of large language models to encode all the world's knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they're limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks. Our results and extensive ablations across retrieved modalities for the image captioning and VQA tasks indicate significant performance improvements compared to non retrieved baselines +1 CIDEr on MSCOCO, +4 CIDEr on NoCaps and nearly a +3\% accuracy on specific VQA question types. This underscores the efficacy of applying RAG approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale [66.73529246309033]
multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks.
Existing instruction-tuning datasets only provide phrase-level answers without any intermediate rationales.
We introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales.
arXiv Detail & Related papers (2024-12-06T18:14:24Z) - Sparse Attention Vectors: Generative Multimodal Model Features Are Discriminative Vision-Language Classifiers [79.45405711339322]
Generative Large Multimodal Models (LMMs) excel at a wide variety of vision-language (VL) tasks such as image captioning or visual question answering.
We propose an approach for finding features in the model's latent space to more effectively leverage LMMs for discriminative tasks.
arXiv Detail & Related papers (2024-11-28T18:55:41Z) - Active Prompt Learning with Vision-Language Model Priors [9.173468790066956]
We introduce a class-guided clustering that leverages the pre-trained image and text encoders of vision-language models.
We propose a budget-saving selective querying based on adaptive class-wise thresholds.
arXiv Detail & Related papers (2024-11-23T02:34:33Z) - mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA [78.45521005703958]
multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge.
We propose a novel framework called textbfRetrieval-textbfReftextbfAugmented textbfGeneration (mR$2$AG) which achieves adaptive retrieval and useful information localization.
mR$2$AG significantly outperforms state-of-the-art MLLMs on INFOSEEK and Encyclopedic-VQA
arXiv Detail & Related papers (2024-11-22T16:15:50Z) - ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - RoRA-VLM: Robust Retrieval-Augmented Vision Language Models [41.09545760534495]
RORA-VLM is a novel and robust retrieval augmentation framework specifically tailored for vision-language models.
We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets.
arXiv Detail & Related papers (2024-10-11T14:51:00Z) - Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering [6.798129852396113]
In this paper, we explore effective prompting techniques to enhance zero- and few-shot Visual Question Answering (VQA) performance.
We identify that specific templates significantly influence VQA outcomes, underscoring the need for strategic template selection.
To mitigate the challenges associated with evaluating free-form open-ended VQA responses, we introduce a straightforward LLM-guided pre-processing technique.
arXiv Detail & Related papers (2023-06-16T17:47:57Z) - OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs.
Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.