Related papers: Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering

Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering

URL: http://arxiv.org/abs/2409.07331v1
Date: Wed, 11 Sep 2024 15:11:39 GMT
Title: Learning to Compress Contexts for Efficient Knowledge-based Visual Question Answering
Authors: Weixi Weng, Jieming Zhu, Hao Zhang, Xiaojun Meng, Rui Zhang, Chun Yuan,
Abstract summary: We propose Retrieval-Augmented MLLM with Compressed Contexts (RACC) RACC achieves a state-of-the-art (SOTA) performance of 62.9% on OK-VQA. It significantly reduces inference latency by 22.0%-59.7% compared to the prominent RAVQA-v2.
Score: 44.54319663913782
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated great zero-shot performance on visual question answering (VQA). However, when it comes to knowledge-based VQA (KB-VQA), MLLMs may lack human commonsense or specialized domain knowledge to answer such questions and require obtaining necessary information from external knowledge sources. Previous works like Retrival-Augmented VQA-v2 (RAVQA-v2) focus on utilizing as much input information, such as image-based textual descriptions and retrieved knowledge, as possible to improve performance, but they all overlook the issue that with the number of input tokens increasing, inference efficiency significantly decreases, which contradicts the demands of practical applications. To address this issue, we propose Retrieval-Augmented MLLM with Compressed Contexts (RACC). RACC learns to compress and aggregate retrieved contexts, from which it generates a compact modulation in the form of Key-Value (KV) cache. This modulation is then used to adapt the downstream frozen MLLM, thereby achieving effective and efficient inference. RACC achieves a state-of-the-art (SOTA) performance of 62.9% on OK-VQA. Moreover, it significantly reduces inference latency by 22.0%-59.7% compared to the prominent RAVQA-v2. Abundant experiments show RACC's broad applicability. It is compatible with various off-the-shelf MLLMs and can also handle different knowledge sources including textual and multimodal documents.

Related papers

SPARQL Query Generation with LLMs: Measuring the Impact of Training Data Memorization and Knowledge Injection [81.78173888579941]
Large Language Models (LLMs) are considered a well-suited method to increase the quality of the question-answering functionality.<n>LLMs are trained on web data, where researchers have no control over whether the benchmark or the knowledge graph was already included in the training data.<n>This paper introduces a novel method that evaluates the quality of LLMs by generating a SPARQL query from a natural-language question.
arXiv Detail & Related papers (2025-07-18T12:28:08Z)
GC-KBVQA: A New Four-Stage Framework for Enhancing Knowledge Based Visual Question Answering Performance [0.9208007322096533]
Knowledge-Based Visual Question Answering (KB-VQA) methods focus on tasks that demand reasoning with information extending beyond the explicit content depicted in the image.<n>Recent approaches leverage Large Language Models (LLMs) as implicit knowledge sources.<n>We introduce a novel four-stage framework called Grounding Caption-Guided Knowledge-Based Visual Question Answering (GC-KBVQA)<n> Innovations include grounding question-aware caption generation to move beyond generic descriptions and have compact, yet detailed and context-rich information.
arXiv Detail & Related papers (2025-05-25T23:00:30Z)
UniRVQA: A Unified Framework for Retrieval-Augmented Vision Question Answering via Self-Reflective Joint Training [16.14877145354785]
We propose a Unified Retrieval-Augmented VQA framework (UniRVQA) for knowledge-intensive visual questions. UniRVQA adapts general multimodal pre-trained models for fine-grained knowledge-intensive tasks within a unified framework. Our approach achieves competitive performance against state-of-the-art models, delivering a significant 4.7% improvement in answering accuracy, and brings an average 7.5% boost in base MLLMs' VQA performance.
arXiv Detail & Related papers (2025-04-05T05:42:12Z)
QuoTA: Query-oriented Token Assignment via CoT Query Decouple for Long Video Comprehension [86.0749609778104]
We propose QuoTA, an ante-hoc training-free modular that extends existing large video-language models. QuoTA strategically allocates frame-level importance scores based on query relevance. We decouple the query through Chain-of-Thoughts reasoning to facilitate more precise LVLM-based frame importance scoring.
arXiv Detail & Related papers (2025-03-11T17:59:57Z)
Fine-Grained Retrieval-Augmented Generation for Visual Question Answering [12.622529359686016]
Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images. Retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs) emerges as a promising approach. This study presents fine-grained knowledge units, which merge textual snippets with entity images stored in vector databases.
arXiv Detail & Related papers (2025-02-28T11:25:38Z)
Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation [81.18701211912779]
We introduce an Adaptive Multi-Aspect Retrieval-augmented over KGs (Amar) framework. This method retrieves knowledge including entities, relations, and subgraphs, and converts each piece of retrieved text into prompt embeddings. Our method has achieved state-of-the-art performance on two common datasets.
arXiv Detail & Related papers (2024-12-24T16:38:04Z)
mR$^2$AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA [78.45521005703958]
multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge. We propose a novel framework called textbfRetrieval-textbfReftextbfAugmented textbfGeneration (mR$2$AG) which achieves adaptive retrieval and useful information localization. mR$2$AG significantly outperforms state-of-the-art MLLMs on INFOSEEK and Encyclopedic-VQA
arXiv Detail & Related papers (2024-11-22T16:15:50Z)
Enhancing Robustness of Retrieval-Augmented Language Models with In-Context Learning [5.053086684547045]
This study introduces an in-context learning-based approach to enhance the reasoning capabilities of RALMs. Our approach increases accuracy in identifying unanswerable and conflicting scenarios without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-08-08T12:42:43Z)
RAVEN: Multitask Retrieval Augmented Vision-Language Learning [5.1583788731239455]
The scaling of large language models to encode all the world's knowledge is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. This paper introduces RAVEN, a retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning.
arXiv Detail & Related papers (2024-06-27T13:08:35Z)
CuriousLLM: Elevating Multi-Document Question Answering with LLM-Enhanced Knowledge Graph Reasoning [0.9295048974480845]
We propose CuriousLLM, an enhancement that integrates a curiosity-driven reasoning mechanism into an LLM agent. This mechanism enables the agent to generate relevant follow-up questions, thereby guiding the information retrieval process more efficiently. Our experiments show that CuriousLLM significantly boosts LLM performance in multi-document question answering (MD-QA)
arXiv Detail & Related papers (2024-04-13T20:43:46Z)
Multi-Clue Reasoning with Memory Augmentation for Knowledge-based Visual Question Answering [32.21000330743921]
We propose a novel framework that endows the model with capabilities of answering more general questions. Specifically, a well-defined detector is adopted to predict image-question related relation phrases. The optimal answer is predicted by choosing the supporting fact with the highest score.
arXiv Detail & Related papers (2023-12-20T02:35:18Z)
Hint-enhanced In-Context Learning wakes Large Language Models up for knowledge-intensive tasks [54.153914606302486]
In-context learning (ICL) ability has emerged with the increasing scale of large language models (LLMs) We propose a new paradigm called Hint-enhanced In-Context Learning (HICL) to explore the power of ICL in open-domain question answering.
arXiv Detail & Related papers (2023-11-03T14:39:20Z)
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model. We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z)
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering [56.96857992123026]
Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA.
arXiv Detail & Related papers (2023-09-29T10:54:10Z)
Knowledge-Driven CoT: Exploring Faithful Reasoning in LLMs for Knowledge-intensive Question Answering [17.672572064705445]
Large language models (LLMs) equipped with Chain-of-Thought (CoT) have shown impressive reasoning ability in various downstream tasks. We propose a framework called Knowledge-Driven Chain-of-Thought (KD-CoT) to verify and modify reasoning traces in CoT via interaction with external knowledge.
arXiv Detail & Related papers (2023-08-25T09:23:55Z)
MGA-VQA: Multi-Granularity Alignment for Visual Question Answering [75.55108621064726]
Learning to answer visual questions is a challenging task since the multi-modal inputs are within two feature spaces. We propose Multi-Granularity Alignment architecture for Visual Question Answering task (MGA-VQA) Our model splits alignment into different levels to achieve learning better correlations without needing additional data and annotations.
arXiv Detail & Related papers (2022-01-25T22:30:54Z)
Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection [14.678153928301493]
Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image. Recent single text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks.
arXiv Detail & Related papers (2021-12-13T18:45:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.