Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering
- URL: http://arxiv.org/abs/2409.07331v1
- Date: Wed, 11 Sep 2024 15:11:39 GMT
- Title: Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering
- Authors: Weixi Weng, Jieming Zhu, Hao Zhang, Xiaojun Meng, Rui Zhang, Chun Yuan,
- Abstract summary: We propose Retrieval-Augmented MLLM with Compressed Contexts (RACC)
RACC achieves a state-of-the-art (SOTA) performance of 62.9% on OK-VQA.
It significantly reduces inference latency by 22.0%-59.7% compared to the prominent RAVQA-v2.
- Score: 44.54319663913782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have demonstrated great zero-shot performance on visual question answering (VQA). However, when it comes to knowledge-based VQA (KB-VQA), MLLMs may lack human commonsense or specialized domain knowledge to answer such questions and require obtaining necessary information from external knowledge sources. Previous works like Retrival-Augmented VQA-v2 (RAVQA-v2) focus on utilizing as much input information, such as image-based textual descriptions and retrieved knowledge, as possible to improve performance, but they all overlook the issue that with the number of input tokens increasing, inference efficiency significantly decreases, which contradicts the demands of practical applications. To address this issue, we propose Retrieval-Augmented MLLM with Compressed Contexts (RACC). RACC learns to compress and aggregate retrieved contexts, from which it generates a compact modulation in the form of Key-Value (KV) cache. This modulation is then used to adapt the downstream frozen MLLM, thereby achieving effective and efficient inference. RACC achieves a state-of-the-art (SOTA) performance of 62.9% on OK-VQA. Moreover, it significantly reduces inference latency by 22.0%-59.7% compared to the prominent RAVQA-v2. Abundant experiments show RACC's broad applicability. It is compatible with various off-the-shelf MLLMs and can also handle different knowledge sources including textual and multimodal documents.
Related papers
- Declarative Knowledge Distillation from Large Language Models for Visual Question Answering Datasets [9.67464173044675]
Visual Question Answering (VQA) is the task of answering a question about an image.
We present an approach for declarative knowledge distillation from Large Language Models (LLMs)
Our results confirm that distilling knowledge from LLMs is in fact a promising direction besides data-driven rule learning approaches.
arXiv Detail & Related papers (2024-10-12T08:17:03Z) - Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning [5.053086684547045]
This study introduces an in-context learning-based approach to enhance the reasoning capabilities of RALMs.
Our approach increases accuracy in identifying unanswerable and conflicting scenarios without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-08-08T12:42:43Z) - RAVEN: Multitask Retrieval Augmented Vision-Language Learning [5.1583788731239455]
The scaling of large language models to encode all the world's knowledge is unsustainable and has exacerbated resource barriers.
Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored.
This paper introduces RAVEN, a retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning.
arXiv Detail & Related papers (2024-06-27T13:08:35Z) - REAR: A Relevance-Aware Retrieval-Augmented Framework for Open-Domain
Question Answering [122.62012375722124]
In existing methods, large language models (LLMs) cannot precisely assess the relevance of retrieved documents.
We propose REAR, a RElevance-Aware Retrieval-augmented approach for open-domain question answering (QA)
arXiv Detail & Related papers (2024-02-27T13:22:51Z) - Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks [54.153914606302486]
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs)
We propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering.
arXiv Detail & Related papers (2023-11-03T14:39:20Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Fine-grained Late-interaction Multi-modal Retrieval for Retrieval
Augmented Visual Question Answering [56.96857992123026]
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions.
This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA.
arXiv Detail & Related papers (2023-09-29T10:54:10Z) - Knowledge-Driven CoT: Exploring Faithful Reasoning in LLMs for
Knowledge-intensive Question Answering [17.672572064705445]
Large language models (LLMs) equipped with Chain-of-Thought (CoT) have shown impressive reasoning ability in various downstream tasks.
We propose a framework called Knowledge-Driven Chain-of-Thought (KD-CoT) to verify and modify reasoning traces in CoT via interaction with external knowledge.
arXiv Detail & Related papers (2023-08-25T09:23:55Z) - MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces.
We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA)
Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z) - Improving and Diagnosing Knowledge-Based Visual Question Answering via
Entity Enhanced Knowledge Injection [14.678153928301493]
Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image.
Recent single text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks.
arXiv Detail & Related papers (2021-12-13T18:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.