Model-Document Protocol for AI Search
- URL: http://arxiv.org/abs/2510.25160v2
- Date: Thu, 30 Oct 2025 08:52:17 GMT
- Title: Model-Document Protocol for AI Search
- Authors: Hongjin Qian, Zheng Liu,
- Abstract summary: We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to large language models (LLMs)<n>Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs.<n>As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process.
- Score: 11.377241012645994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.
Related papers
- Multi-Sourced, Multi-Agent Evidence Retrieval for Fact-Checking [47.47518672198846]
Misinformation spreading over the Internet poses a significant threat to both societies and individuals.<n>Previous methods rely on semantic and social-contextual patterns learned from training data.<n>We propose WKGFC, which exploits authorized open knowledge graph as a core resource of evidence.
arXiv Detail & Related papers (2026-02-27T19:29:01Z) - DiffuRank: Effective Document Reranking with Diffusion Language Models [71.16830004674513]
We propose DiffuRank, a reranking framework built upon diffusion language models (dLLMs)<n>dLLMs support more flexible decoding and generation processes that are not constrained to a left-to-right order.<n>We show dLLMs achieve performance comparable to, and in some cases exceeding, that of autoregressive LLMs with similar model sizes.
arXiv Detail & Related papers (2026-02-13T02:18:14Z) - DeepRead: Document Structure-Aware Reasoning to Enhance Agentic Search [23.447631421934847]
DeepRead is a structure-aware document reasoning agent designed to operationalize document-native structural priors into actionable reasoning capabilities.<n>DeepRead elicits a human-like locate-then-read'' reasoning paradigm, effectively mitigating the context fragmentation inherent in traditional retrieval methods.
arXiv Detail & Related papers (2026-02-04T20:03:28Z) - URaG: Unified Retrieval and Generation in Multimodal LLMs for Efficient Long Document Understanding [55.45331924836242]
We present URaG, a framework that Unifies Retrieval and Generation within a single MLLM.<n>We show that URaG achieves state-of-the-art performance while reducing computational overhead by 44-56%.
arXiv Detail & Related papers (2025-11-13T17:54:09Z) - Scaling Beyond Context: A Survey of Multimodal Retrieval-Augmented Generation for Document Understanding [61.36285696607487]
Document understanding is critical for applications from financial analysis to scientific discovery.<n>Current approaches, whether OCR-based pipelines feeding Large Language Models (LLMs) or native Multimodal LLMs (MLLMs) face key limitations.<n>Retrieval-Augmented Generation (RAG) helps ground models in external data, but documents' multimodal nature, combining text, tables, charts, and layout, demands a more advanced paradigm: Multimodal RAG.
arXiv Detail & Related papers (2025-10-17T02:33:16Z) - MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems [31.434573363421368]
Mixtures of scenario-aware document Memories (MoM) framework designed to efficiently handle documents from multiple domains.<n>MoM instructs large language models (LLMs) to simulate domain experts in generating document logical outlines.<n>We incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes.
arXiv Detail & Related papers (2025-10-16T03:09:51Z) - The Role of Parametric Injection-A Systematic Study of Parametric Retrieval-Augmented Generation [8.544971676258971]
Paranoid retrieval-augmented generation (PRAG) encodes documents as model parameters and injects these representations into the model during inference.<n>We show that PRAG captures only partial semantic information of documents, and relying on them alone yields inferior performance compared to interaction at text level.<n>When combined parameterized documents with textual documents, the model can leverage relevant information more effectively and become more robust to noisy inputs.
arXiv Detail & Related papers (2025-10-14T16:05:01Z) - When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs [64.27273946787344]
Recent Long-Context Language Models can process hundreds of thousands of tokens in a single prompt.<n>We recast reasoning as reusable thought caches, derived from prior problem solving traces.<n>We propose an update strategy that iteratively refines templates derived from training data through natural-language feedback.
arXiv Detail & Related papers (2025-10-08T19:52:35Z) - Beyond Isolated Dots: Benchmarking Structured Table Construction as Deep Knowledge Extraction [28.47810405584841]
Arranged and Organized Extraction Benchmark designed to evaluate ability of large language models to comprehend fragmented documents.<n>AOE includes 11 carefully crafted tasks across three diverse domains, requiring models to generate context-specific schema tailored to varied input queries.<n>Results show that even the most advanced models struggled significantly.
arXiv Detail & Related papers (2025-07-22T06:37:51Z) - Beyond Chunking: Discourse-Aware Hierarchical Retrieval for Long Document Question Answering [51.7493726399073]
We present a discourse-aware hierarchical framework to enhance long document question answering.<n>The framework involves three key innovations: specialized discourse parsing for lengthy documents, LLM-based enhancement of discourse relation nodes, and structure-guided hierarchical retrieval.
arXiv Detail & Related papers (2025-05-26T14:45:12Z) - M-DocSum: Do LVLMs Genuinely Comprehend Interleaved Image-Text in Document Summarization? [49.53982792497275]
We investigate whether Large Vision-Language Models (LVLMs) genuinely comprehend interleaved image-text in the document.<n>Existing document understanding benchmarks often assess LVLMs using question-answer formats.<n>We introduce a novel and challenging Multimodal Document Summarization Benchmark (M-DocSum-Bench)<n>M-DocSum-Bench comprises 500 high-quality arXiv papers, along with interleaved multimodal summaries aligned with human preferences.
arXiv Detail & Related papers (2025-03-27T07:28:32Z) - Learning Refined Document Representations for Dense Retrieval via Deliberate Thinking [58.69615583599489]
Deliberate Thinking based Retriever (Debater) is a novel approach that enhances document representations by incorporating a step-by-step thinking process.<n>Debater significantly outperforms existing methods across several retrieval benchmarks.
arXiv Detail & Related papers (2025-02-18T15:56:34Z) - LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document Summarization [9.739781953744606]
Open-Domain Multi-Document Summarization (ODMDS) is the task of generating summaries from large document collections in response to user queries.
Traditional retrieve-then-summarize approaches fall short for open-ended queries in ODMDS tasks.
We propose LightPAL, a lightweight passage retrieval method for ODMDS.
arXiv Detail & Related papers (2024-06-18T10:57:27Z) - Peering into the Mind of Language Models: An Approach for Attribution in Contextual Question Answering [9.86691461253151]
We introduce a novel method for attribution in contextual question answering, leveraging the hidden state representations of large language models (LLMs)
Our approach bypasses the need for extensive model retraining and retrieval model overhead, offering granular attributions and preserving the quality of generated answers.
We present Verifiability-granular, an attribution dataset which has token level annotations for LLM generations in the contextual question answering setup.
arXiv Detail & Related papers (2024-05-28T09:12:44Z) - PEARL: Prompting Large Language Models to Plan and Execute Actions Over
Long Documents [78.27865456183397]
We propose PEARL, a prompting framework to improve reasoning over long documents.
Each stage of PEARL is implemented via zero-shot or few-shot prompting with minimal human input.
We evaluate PEARL on a challenging subset of the QuALITY dataset, which contains questions that require complex reasoning over long narrative texts.
arXiv Detail & Related papers (2023-05-23T23:06:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.