Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
- URL: http://arxiv.org/abs/2603.01666v1
- Date: Mon, 02 Mar 2026 09:55:00 GMT
- Title: Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations
- Authors: Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Jiahao Huo, Yu Huang, James Kwok, Xuming Hu,
- Abstract summary: ColParse is a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings.<n>Experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains.
- Score: 39.98860473310998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.
Related papers
- Multi-Vector Index Compression in Any Modality [73.7330345057813]
Late interaction has emerged as a dominant paradigm for information retrieval in text, images, visual documents, and videos.<n>We introduce four approaches for index compression: sequence resizing, memory tokens, hierarchical pooling, and a novel attention-guided clustering (AGC)<n>AGC uses an attention-guided mechanism to identify the most semantically salient regions of a document as cluster centroids and to weight token aggregation.
arXiv Detail & Related papers (2026-02-24T18:57:33Z) - Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding [102.88996030431662]
We propose a training-free and highly efficient acceleration method for document parsing tasks.<n>Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens.<n>We demonstrate the effectiveness of our approach on the general-purpose OmniDocBench.
arXiv Detail & Related papers (2026-02-13T14:22:10Z) - CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding [71.88471147281406]
We propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings.<n>By incorporating iterative margin loss during contrastive training, CausalEmbed encourages embedding models to learn compact and well-structured representations.<n>Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count.
arXiv Detail & Related papers (2026-01-29T04:47:27Z) - Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding [35.429403152845836]
Youtu-Parsing is an efficient and versatile document parsing model designed for high-performance content extraction.<n>The model exhibits strong robustness when handling rare characters, multilingual text, and handwritten content.<n>Youtu-Parsing achieves state-of-the-art (SOTA) performance on the OmniDocBench and olmOCR-bench benchmarks.
arXiv Detail & Related papers (2026-01-28T09:37:13Z) - PARL: Position-Aware Relation Learning Network for Document Layout Analysis [23.497081928689525]
We argue that effective layout analysis depends not on text-visual fusion, but on a deep understanding of documents' intrinsic visual structure.<n>We propose a novel OCR-free, vision-only framework that models layout through positional sensitivity and relational structure.<n>Experiments show that PARL (65M) is highly efficient, using roughly four times fewer parameters than large multimodal models.
arXiv Detail & Related papers (2026-01-12T15:05:35Z) - Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing [46.14775667559124]
Document parsing from scanned images remains a significant challenge due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables.<n>Existing supervised fine-tuning methods often struggle to generalize across diverse document types, leading to poor performance, particularly on out-of-distribution data.<n>We introduce LayoutRL, a reinforcement learning framework that optimize layout understanding through composite rewards integrating normalized edit distance count accuracy, and reading order preservation.<n>We show that Infinity-Bench consistently achieves state-of-the-art performance across a broad range of document types, languages, and structural complexities.
arXiv Detail & Related papers (2025-10-17T06:26:59Z) - CAL-RAG: Retrieval-Augmented Multi-Agent Generation for Content-Aware Layout Design [6.830055289299306]
CAL-RAG is a retrieval-augmented, agentic framework for content-aware layout generation.<n>We implement our framework using LangGraph and evaluate it on a benchmark rich in semantic variability.<n>Results demonstrate that combining retrieval augmentation with agentic multi-step reasoning yields a scalable, interpretable, and high-fidelity solution.
arXiv Detail & Related papers (2025-06-27T06:09:56Z) - OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models [58.45517851437422]
Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding.<n>Existing solutions often rely on task-specific architectures and objectives for individual tasks.<n>In this paper, we introduce Omni V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis.
arXiv Detail & Related papers (2025-02-22T09:32:01Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.