VISA: Retrieval Augmented Generation with Visual Source Attribution
- URL: http://arxiv.org/abs/2412.14457v1
- Date: Thu, 19 Dec 2024 02:17:35 GMT
- Title: VISA: Retrieval Augmented Generation with Visual Source Attribution
- Authors: Xueguang Ma, Shengyao Zhuang, Bevan Koopman, Guido Zuccon, Wenhu Chen, Jimmy Lin,
- Abstract summary: Existing approaches in RAG primarily link generated content to document-level references.<n>We propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution.<n>To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain.
- Score: 100.78278689901593
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generation with source attribution is important for enhancing the verifiability of retrieval-augmented generation (RAG) systems. However, existing approaches in RAG primarily link generated content to document-level references, making it challenging for users to locate evidence among multiple content-rich retrieved documents. To address this challenge, we propose Retrieval-Augmented Generation with Visual Source Attribution (VISA), a novel approach that combines answer generation with visual source attribution. Leveraging large vision-language models (VLMs), VISA identifies the evidence and highlights the exact regions that support the generated answers with bounding boxes in the retrieved document screenshots. To evaluate its effectiveness, we curated two datasets: Wiki-VISA, based on crawled Wikipedia webpage screenshots, and Paper-VISA, derived from PubLayNet and tailored to the medical domain. Experimental results demonstrate the effectiveness of VISA for visual source attribution on documents' original look, as well as highlighting the challenges for improvement. Code, data, and model checkpoints will be released.
Related papers
- Separate the Wheat from the Chaff: Winnowing Down Divergent Views in Retrieval Augmented Generation [61.47019392413271]
WinnowRAG is designed to systematically filter out noisy documents while preserving valuable content.<n>WinnowRAG operates in two stages: In Stage I, we perform query-aware clustering to group similar documents and form distinct topic clusters.<n>In Stage II, we perform winnowing, wherein a critic LLM evaluates the outputs of multiple agents and iteratively separates useful documents from noisy ones.
arXiv Detail & Related papers (2025-11-01T20:08:13Z) - RegionRAG: Region-level Retrieval-Augumented Generation for Visually-Rich Documents [40.107303323097646]
modelname is a novel framework that shifts the retrieval paradigm from the document level to the region level.<n> Experiments on six benchmarks demonstrate that RegionRAG achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-10-31T08:00:32Z) - VeriCite: Towards Reliable Citations in Retrieval-Augmented Generation via Rigorous Verification [107.75781898355562]
We introduce a novel framework, called VeriCite, designed to rigorously validate supporting evidence and enhance answer attribution.<n>We conduct experiments across five open-source LLMs and four datasets, demonstrating that VeriCite can significantly improve citation quality while maintaining the correctness of the answers.
arXiv Detail & Related papers (2025-10-13T13:38:54Z) - RAG-IGBench: Innovative Evaluation for RAG-based Interleaved Generation in Open-domain Question Answering [50.42577862494645]
We present RAG-IGBench, a benchmark designed to evaluate the task of Interleaved Generation based on Retrieval-Augmented Generation (RAG-IG) in open-domain question answering.<n>RAG-IG integrates multimodal large language models (MLLMs) with retrieval mechanisms, enabling the models to access external image-text information for generating coherent multimodal content.
arXiv Detail & Related papers (2025-10-11T03:06:39Z) - Enhancing Document VQA Models via Retrieval-Augmented Generation [1.6769365072542683]
Document VQA must cope with documents that span dozens of pages, yet leading systems still rely on very large vision-language models.<n>Retrieval-Augmented Generation (RAG) offers an attractive alternative, first retrieving a concise set of relevant segments before generating answers.
arXiv Detail & Related papers (2025-08-26T12:32:55Z) - LAQuer: Localized Attribution Queries in Content-grounded Generation [69.60308443863606]
Grounded text generation models often produce content that deviates from their source material, requiring user verification to ensure accuracy.<n>Existing attribution methods associate entire sentences with source documents, which can be overwhelming for users seeking to fact-check specific claims.<n>We introduce Localized Attribution Queries (LAQuer), a new task that localizes selected spans of generated output to their corresponding source spans, allowing fine-grained and user-directed attribution.
arXiv Detail & Related papers (2025-06-01T21:46:23Z) - SCAN: Semantic Document Layout Analysis for Textual and Visual Retrieval-Augmented Generation [5.458935851230595]
We present SCAN, a novel approach enhancing both textual and visual Retrieval-Augmented Generation (RAG) systems.<n>SCAN uses a coarse-grained semantic approach that divides documents into coherent regions covering continuous components.<n>Our experimental results across English and Japanese datasets demonstrate that applying SCAN improves end-to-end textual RAG performance by up to 9.0% and visual RAG performance by up to 6.4%.
arXiv Detail & Related papers (2025-05-20T14:03:24Z) - VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents [30.012487475552575]
We introduce a new RAG framework, VDocRAG, which can directly understand varied documents and modalities in a unified image format.
We also introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets.
arXiv Detail & Related papers (2025-04-14T01:50:33Z) - ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents [27.90338725230132]
ViDoSeek is a dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning.
We propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents.
Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark.
arXiv Detail & Related papers (2025-02-25T09:26:12Z) - RetroLLM: Empowering Large Language Models to Retrieve Fine-grained Evidence within Generation [21.764973680014368]
RetroLLM is a unified framework that integrates retrieval and generation into a single, cohesive process.<n>To mitigate false pruning in the process of constrained evidence generation, we introduce hierarchical FM-Index constraints.<n>Experiments on five open-domain QA datasets demonstrate RetroLLM's superior performance across both in-domain and out-of-domain tasks.
arXiv Detail & Related papers (2024-12-16T16:03:25Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z) - VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents [66.42579289213941]
Retrieval-augmented generation (RAG) is an effective technique that enables large language models to utilize external knowledge sources for generation.
In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline.
In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM.
arXiv Detail & Related papers (2024-10-14T15:04:18Z) - A Survey of Generative Information Retrieval [25.1249210843116]
Generative Retrieval (GR) is an emerging paradigm in information retrieval that leverages generative models to map queries to relevant document identifiers (DocIDs) without the need for traditional query processing or document reranking.
This survey provides a comprehensive overview of GR, highlighting key developments, indexing and retrieval strategies, and challenges.
arXiv Detail & Related papers (2024-06-03T10:59:33Z) - Corrective Retrieval Augmented Generation [36.04062963574603]
Retrieval-augmented generation (RAG) relies heavily on relevance of retrieved documents, raising concerns about how the model behaves if retrieval goes wrong.
We propose the Corrective Retrieval Augmented Generation (CRAG) to improve the robustness of generation.
CRAG is plug-and-play and can be seamlessly coupled with various RAG-based approaches.
arXiv Detail & Related papers (2024-01-29T04:36:39Z) - Generate rather than Retrieve: Large Language Models are Strong Context
Generators [74.87021992611672]
We present a novel perspective for solving knowledge-intensive tasks by replacing document retrievers with large language model generators.
We call our method generate-then-read (GenRead), which first prompts a large language model to generate contextutal documents based on a given question, and then reads the generated documents to produce the final answer.
arXiv Detail & Related papers (2022-09-21T01:30:59Z) - GERE: Generative Evidence Retrieval for Fact Verification [57.78768817972026]
We propose GERE, the first system that retrieves evidences in a generative fashion.
The experimental results on the FEVER dataset show that GERE achieves significant improvements over the state-of-the-art baselines.
arXiv Detail & Related papers (2022-04-12T03:49:35Z) - SciREX: A Challenge Dataset for Document-Level Information Extraction [56.83748634747753]
It is challenging to create a large-scale information extraction dataset at the document level.
We introduce SciREX, a document level IE dataset that encompasses multiple IE tasks.
We develop a neural model as a strong baseline that extends previous state-of-the-art IE models to document-level IE.
arXiv Detail & Related papers (2020-05-01T17:30:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.