Related papers: VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph

URL: http://arxiv.org/abs/2602.12735v1
Date: Fri, 13 Feb 2026 09:05:09 GMT
Title: VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph
Authors: Qiuchen Wang, Shihang Wang, Yu Zeng, Qiang Zhang, Fanrui Zhang, Zhuoning Guo, Bosi Zhang, Wenxuan Huang, Lin Chen, Zehui Chen, Pengjun Xie, Ruixue Ding,
Abstract summary: VimRAG is a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos.<n>We propose a Graph-Guided Policy Optimization strategy to disentangle step-wise validity from trajectory-level rewards.<n>Experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks.
Score: 42.348770377488094
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effectively retrieving, reasoning, and understanding multimodal information remains a critical challenge for agentic systems. Traditional Retrieval-augmented Generation (RAG) methods rely on linear interaction histories, which struggle to handle long-context tasks, especially those involving information-sparse yet token-heavy visual data in iterative reasoning scenarios. To bridge this gap, we introduce VimRAG, a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos. Inspired by our systematic study, we model the reasoning process as a dynamic directed acyclic graph that structures the agent states and retrieved multimodal evidence. Building upon this structured memory, we introduce a Graph-Modulated Visual Memory Encoding mechanism, with which the significance of memory nodes is evaluated via their topological position, allowing the model to dynamically allocate high-resolution tokens to pivotal evidence while compressing or discarding trivial clues. To implement this paradigm, we propose a Graph-Guided Policy Optimization strategy. This strategy disentangles step-wise validity from trajectory-level rewards by pruning memory nodes associated with redundant actions, thereby facilitating fine-grained credit assignment. Extensive experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks. The code is available at https://github.com/Alibaba-NLP/VRAG.

Related papers

Graph-based Agent Memory: Taxonomy, Techniques, and Applications [63.70340159016138]
Memory emerges as the core module in the Large Language Model (LLM)-based agents for long-horizon complex tasks.<n>Among diverse paradigms, graph stands out as a powerful structure for agent memory due to the intrinsic capabilities to model relational dependencies.<n>This survey presents a comprehensive review of agent memory from the graph-based perspective.
arXiv Detail & Related papers (2026-02-05T13:49:05Z)
Graph-Anchored Knowledge Indexing for Retrieval-Augmented Generation [53.42323544075114]
We propose GraphAnchor, a novel Graph-Anchored Knowledge Indexing approach.<n> Experiments on four multi-hop question answering benchmarks demonstrate the effectiveness of GraphAnchor.
arXiv Detail & Related papers (2026-01-23T05:41:05Z)
Disco-RAG: Discourse-Aware Retrieval-Augmented Generation [81.53888908988756]
We propose Disco-RAG, a discourse-aware framework that injects discourse signals into the generation process.<n>Our method constructs intra-chunk discourse trees to capture local hierarchies and builds inter-chunk rhetorical graphs to model cross-passage coherence.<n>Experiments on question answering and long-document summarization benchmarks show the efficacy of our approach.
arXiv Detail & Related papers (2026-01-07T20:32:50Z)
Leveraging Spreading Activation for Improved Document Retrieval in Knowledge-Graph-Based RAG Systems [0.0]
Retrieval-augmented generation (RAG) systems struggle to reliably retrieve and connect the multi-step evidence required for complicated reasoning tasks.<n>Most of the standard RAG frameworks regard all retrieved information as equally reliable, overlooking the varying credibility and interconnected nature of large textual corpora.<n>We propose a novel RAG framework that employs the spreading activation algorithm to retrieve information from a corpus of documents interconnected by automatically constructed knowledge graphs.
arXiv Detail & Related papers (2025-12-17T19:38:35Z)
GRIL: Knowledge Graph Retrieval-Integrated Learning with Large Language Models [59.72897499248909]
We propose a novel graph retriever trained end-to-end with Large Language Models (LLMs)<n>Within the extracted subgraph, structural knowledge and semantic features are encoded via soft tokens and the verbalized graph, respectively, which are infused into the LLM together.<n>Our approach consistently achieves state-of-the-art performance, validating the strength of joint graph-LLM optimization for complex reasoning tasks.
arXiv Detail & Related papers (2025-09-20T02:38:00Z)
MMGraphRAG: Bridging Vision and Language with Interpretable Multimodal Knowledge Graphs [6.165053219836395]
We propose MMGraphRAG, which refines visual content through scene graphs and constructs a multimodal knowledge graph.<n>It employs spectral clustering to achieve cross-modal entity linking and retrieves context along reasoning paths to guide the generative process.<n> Experimental results show that MMGraphRAG achieves state-of-the-art performance on the DocBench and MMLongBench datasets.
arXiv Detail & Related papers (2025-07-28T13:16:23Z)
PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents [15.524189150821147]
Large language models (LLMs) combined with Retrieval-Augmented Generation (RAG) fail to scale in complex, long-term interactions.<n>We propose a flexible external memory framework based on knowledge graphs, automatically constructed and updated by the LLM itself.<n>Building upon the AriGraph architecture, we introduce a novel hybrid graph design that supports both standard edges and two types of hyperedges.<n>We evaluate our system on three benchmarks-TriviaQA, HotpotQA, and DiaASQ-demonstrating that different memory and retrieval configurations yield optimal performance depending on the task.
arXiv Detail & Related papers (2025-06-20T13:52:15Z)
Divide by Question, Conquer by Agent: SPLIT-RAG with Question-Driven Graph Partitioning [62.640169289390535]
SPLIT-RAG is a multi-agent RAG framework that addresses the limitations with question-driven semantic graph partitioning and collaborative subgraph retrieval.<n>The innovative framework first create Semantic Partitioning of Linked Information, then use the Type-Specialized knowledge base to achieve Multi-Agent RAG.<n>The attribute-aware graph segmentation manages to divide knowledge graphs into semantically coherent subgraphs, ensuring subgraphs align with different query types.<n>A hierarchical merging module resolves inconsistencies across subgraph-derived answers through logical verifications.
arXiv Detail & Related papers (2025-05-20T06:44:34Z)
ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents [27.90338725230132]
ViDoSeek is a dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning.<n>We propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents.<n> Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark.
arXiv Detail & Related papers (2025-02-25T09:26:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.