Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries
- URL: http://arxiv.org/abs/2502.16636v1
- Date: Sun, 23 Feb 2025 16:23:50 GMT
- Title: Visual-RAG: Benchmarking Text-to-Image Retrieval Augmented Generation for Visual Knowledge Intensive Queries
- Authors: Yin Wu, Quanyu Long, Jing Li, Jianfei Yu, Wenya Wang,
- Abstract summary: Retrieval-Augmented Generation (RAG) is a popular approach for enhancing Large Language Models (LLMs)<n>Visual-RAG requires text-to-image retrieval and integration of relevant clue images to extract visual knowledge as evidence.
- Score: 30.692007887121278
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Retrieval-Augmented Generation (RAG) is a popular approach for enhancing Large Language Models (LLMs) by addressing their limitations in verifying facts and answering knowledge-intensive questions. As the research in LLM extends their capability to handle input modality other than text, e.g. image, several multimodal RAG benchmarks are proposed. Nonetheless, they mainly use textual knowledge bases as the primary source of evidences for augmentation. There still lack benchmarks designed to evaluate images as augmentation in RAG systems and how they leverage visual knowledge. We propose Visual-RAG, a novel Question Answering benchmark that emphasizes visual knowledge intensive questions. Unlike prior works relying on text-based evidence, Visual-RAG necessitates text-to-image retrieval and integration of relevant clue images to extract visual knowledge as evidence. With Visual-RAG, we evaluate 5 open-sourced and 3 proprietary Multimodal LLMs (MLLMs), revealing that images can serve as good evidence in RAG; however, even the SoTA models struggle with effectively extracting and utilizing visual knowledge
Related papers
- Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations [65.11348389219887]
We introduce Dialectic-RAG (DRAG), a modular approach that evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives.
We show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models.
arXiv Detail & Related papers (2025-04-07T06:55:15Z) - A Survey of Multimodal Retrieval-Augmented Generation [3.9616308910160445]
Multimodal Retrieval-Augmented Generation (MRAG) enhances large language models (LLMs) by integrating multimodal data (text, images, videos) into retrieval and generation processes.
Recent studies show MRAG outperforms traditional Retrieval-Augmented Generation (RAG) in scenarios requiring both visual and textual understanding.
arXiv Detail & Related papers (2025-03-26T02:43:09Z) - Retrieval Augmented Generation and Understanding in Vision: A Survey and New Outlook [85.43403500874889]
Retrieval-augmented generation (RAG) has emerged as a pivotal technique in artificial intelligence (AI)
Recent advancements in RAG for embodied AI, with a particular focus on applications in planning, task execution, multimodal perception, interaction, and specialized domains.
arXiv Detail & Related papers (2025-03-23T10:33:28Z) - Fine-Grained Retrieval-Augmented Generation for Visual Question Answering [12.622529359686016]
Visual Question Answering (VQA) focuses on providing answers to natural language questions by utilizing information from images.
Retrieval-augmented generation (RAG) leveraging external knowledge bases (KBs) emerges as a promising approach.
This study presents fine-grained knowledge units, which merge textual snippets with entity images stored in vector databases.
arXiv Detail & Related papers (2025-02-28T11:25:38Z) - Beyond Text: Optimizing RAG with Multimodal Inputs for Industrial Applications [3.7636375810345744]
Large Language Models (LLMs) have demonstrated impressive capabilities in answering questions, but they lack domain-specific knowledge and are prone to hallucinations.
Retrieval Augmented Generation (RAG) is one approach to address these challenges, while multimodal models are emerging as promising AI assistants for processing both text and images.
We describe a series of experiments aimed at determining how to best integrate multimodal models into RAG systems for the industrial domain.
arXiv Detail & Related papers (2024-10-29T11:03:31Z) - VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents [66.42579289213941]
Retrieval-augmented generation (RAG) is an effective technique that enables large language models to utilize external knowledge sources for generation.
In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline.
In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.
arXiv Detail & Related papers (2024-10-14T15:04:18Z) - MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models [115.16022378880376]
We introduce a multimodal retrieval-augmented generation benchmark, MRAG-Bench.
MRAG-Bench consists of 16,130 images and 1,353 human-annotated multiple-choice questions.
Results show that all large vision-language models (LVLMs) exhibit greater improvements when augmented with images compared to textual knowledge.
arXiv Detail & Related papers (2024-10-10T17:55:02Z) - GEM-RAG: Graphical Eigen Memories For Retrieval Augmented Generation [3.2027710059627545]
We introduce Graphical Eigen Memories For Retrieval Augmented Generation (GEM-RAG)
GEM-RAG works by tagging each chunk of text in a given text corpus with LLM generated utility'' questions.
We evaluate GEM-RAG, using both UnifiedQA and GPT-3.5 Turbo as the LLMs, with SBERT, and OpenAI's text encoders on two standard QA tasks.
arXiv Detail & Related papers (2024-09-23T21:42:47Z) - Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.<n>The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.<n>We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z) - UniRAG: Universal Retrieval Augmentation for Large Vision Language Models [76.30799731147589]
We introduce UniRAG, a plug-and-play technique that adds relevant retrieved information to prompts as few-shot examples during inference.
Unlike the common belief that Retrieval Augmentation (RA) mainly improves generation or understanding of uncommon entities, our evaluation results on the MSCOCO dataset with common entities show that both proprietary models and smaller open-source models significantly enhance their generation quality.
arXiv Detail & Related papers (2024-05-16T17:58:45Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Retrieval-Augmented Generation for Large Language Models: A Survey [17.82361213043507]
Large Language Models (LLMs) showcase impressive capabilities but encounter challenges like hallucination.
Retrieval-Augmented Generation (RAG) has emerged as a promising solution by incorporating knowledge from external databases.
arXiv Detail & Related papers (2023-12-18T07:47:33Z) - External Knowledge Augmented Text Visual Question Answering [0.6445605125467573]
We propose a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks.
We generate results comparable to the state-of-the-art on two publicly available datasets.
arXiv Detail & Related papers (2021-08-22T13:21:58Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.