RAG-Anything: All-in-One RAG Framework
- URL: http://arxiv.org/abs/2510.12323v1
- Date: Tue, 14 Oct 2025 09:25:35 GMT
- Title: RAG-Anything: All-in-One RAG Framework
- Authors: Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang,
- Abstract summary: We present RAG-Anything, a unified framework that enables comprehensive knowledge retrieval across all modalities.<n>Our approach reconceptualizes multimodal content as interconnected knowledge entities rather than isolated data types.
- Score: 10.858282833070726
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-Augmented Generation (RAG) has emerged as a fundamental paradigm for expanding Large Language Models beyond their static training limitations. However, a critical misalignment exists between current RAG capabilities and real-world information environments. Modern knowledge repositories are inherently multimodal, containing rich combinations of textual content, visual elements, structured tables, and mathematical expressions. Yet existing RAG frameworks are limited to textual content, creating fundamental gaps when processing multimodal documents. We present RAG-Anything, a unified framework that enables comprehensive knowledge retrieval across all modalities. Our approach reconceptualizes multimodal content as interconnected knowledge entities rather than isolated data types. The framework introduces dual-graph construction to capture both cross-modal relationships and textual semantics within a unified representation. We develop cross-modal hybrid retrieval that combines structural knowledge navigation with semantic matching. This enables effective reasoning over heterogeneous content where relevant evidence spans multiple modalities. RAG-Anything demonstrates superior performance on challenging multimodal benchmarks, achieving significant improvements over state-of-the-art methods. Performance gains become particularly pronounced on long documents where traditional approaches fail. Our framework establishes a new paradigm for multimodal knowledge access, eliminating the architectural fragmentation that constrains current systems. Our framework is open-sourced at: https://github.com/HKUDS/RAG-Anything.
Related papers
- MegaRAG: Multimodal Knowledge Graph-Based Retrieval Augmented Generation [17.382062394739588]
Large language models (LLMs) struggle with high-level conceptual understanding and holistic comprehension due to limited context windows.<n>We introduce a multimodal knowledge graph-based RAG that enables cross-modal reasoning for better content understanding.<n>Our method incorporates visual cues into the construction of knowledge graphs, the retrieval phase, and the answer generation process.
arXiv Detail & Related papers (2025-11-26T05:00:03Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - Multimodal RAG for Unstructured Data:Leveraging Modality-Aware Knowledge Graphs with Hybrid Retrieval [1.160208922584163]
We present a Modality-Aware Hybrid retrieval Architecture (MAHA) for multimodal question answering with reasoning through a modality-aware knowledge graph.<n>MAHA integrates dense vector retrieval with structured graph traversal, where the knowledge graph encodes cross-modal semantics and relationships.<n>Our work establishes a scalable and interpretable retrieval framework that advances RAG systems by enabling modality-aware reasoning over unstructured multimodal data.
arXiv Detail & Related papers (2025-10-16T11:55:24Z) - UNIDOC-BENCH: A Unified Benchmark for Document-Centric Multimodal RAG [82.84014669683863]
Multimodal retrieval-augmented generation (MM-RAG) is a key approach for applying large language models to real-world knowledge bases.<n>UniDoc-Bench is the first large-scale, realistic benchmark for MM-RAG built from 70k real-world PDF pages.<n>Our experiments show that multimodal text-image fusion RAG systems consistently outperform both unimodal and jointly multimodal embedding-based retrieval.
arXiv Detail & Related papers (2025-10-04T04:30:13Z) - CMRAG: Co-modality-based visual document retrieval and question answering [21.016544020685668]
Co-Modality-based RAG (RAG) framework can leverage texts and images for more accurate retrieval and generation.<n>Our framework consistently outperforms single-modality-based RAG in multiple visual document question-answering (VDQA) benchmarks.
arXiv Detail & Related papers (2025-09-02T09:17:57Z) - mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering [29.5761347590239]
Retrieval-Augmented Generation (RAG) has been proposed to expand internal knowledge of Multimodal Large Language Models (MLLMs)<n>In this paper, we propose a novel multimodal knowledge-augmented generation framework (mKG-RAG) based on multimodal KGs for knowledge-intensive VQA tasks.
arXiv Detail & Related papers (2025-08-07T12:22:50Z) - MHier-RAG: Multi-Modal RAG for Visual-Rich Document Question-Answering via Hierarchical and Multi-Granularity Reasoning [5.55667293255701]
The multi-modal long-context document question-answering task aims to locate and integrate multi-modal evidences distributed across multiple pages.<n>The existing methods can be categorized into Large Vision-Language Model (LVLM)-based and Retrieval-Augmented Generation (RAG)-based methods.<n>A novel multi-modal RAG model, named MHier-RAG, was proposed, leveraging both textual and visual information across long-range pages.
arXiv Detail & Related papers (2025-08-01T12:22:53Z) - MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs [6.165053219836395]
We propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph.<n>It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process.<n> Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets.
arXiv Detail & Related papers (2025-07-28T13:16:23Z) - Benchmarking Retrieval-Augmented Multimodal Generation for Document Question Answering [60.062194349648195]
Document Visual Question Answering (DocVQA) faces dual challenges in processing lengthy multimodal documents.<n>Current document retrieval-augmented generation (DocRAG) methods remain limited by their text-centric approaches.<n>We introduce MMDocRAG, a comprehensive benchmark featuring 4,055 expert-annotated QA pairs with multi-page, cross-modal evidence chains.
arXiv Detail & Related papers (2025-05-22T09:52:57Z) - UniversalRAG: Retrieval-Augmented Generation over Corpora of Diverse Modalities and Granularities [53.76854299076118]
UniversalRAG is a novel RAG framework designed to retrieve and integrate knowledge from heterogeneous sources with diverse modalities and granularities.<n>We propose a modality-aware routing mechanism that dynamically identifies the most appropriate modality-specific corpus and performs targeted retrieval within it.<n>We validate UniversalRAG on 8 benchmarks spanning multiple modalities, showing its superiority over various modality-specific and unified baselines.
arXiv Detail & Related papers (2025-04-29T13:18:58Z) - IDEA: Inverted Text with Cooperative Deformable Aggregation for Multi-modal Object Re-Identification [60.38841251693781]
We propose a novel framework to generate robust multi-modal object ReIDs.<n>Our framework uses Modal Prefixes and InverseNet to integrate multi-modal information with semantic guidance from inverted text.<n>Experiments on three multi-modal object ReID benchmarks demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2025-03-13T13:00:31Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.