CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
- URL: http://arxiv.org/abs/2601.21262v1
- Date: Thu, 29 Jan 2026 04:47:27 GMT
- Title: CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
- Authors: Jiahao Huo, Yu Huang, Yibo Yan, Ye Pan, Yi Cao, Mingdong Ou, Philip S. Yu, Xuming Hu,
- Abstract summary: We propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings.<n>By incorporating iterative margin loss during contrastive training, CausalEmbed encourages embedding models to learn compact and well-structured representations.<n>Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count.
- Score: 71.88471147281406
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval.
Related papers
- Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations [39.98860473310998]
ColParse is a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings.<n>Experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains.
arXiv Detail & Related papers (2026-03-02T09:55:00Z) - VimRAG: Navigating Massive Visual Context in Retrieval-Augmented Generation via Multimodal Memory Graph [42.348770377488094]
VimRAG is a framework tailored for multimodal Retrieval-augmented Reasoning across text, images, and videos.<n>We propose a Graph-Guided Policy Optimization strategy to disentangle step-wise validity from trajectory-level rewards.<n>Experiments demonstrate that VimRAG consistently achieves state-of-the-art performance on diverse multimodal RAG benchmarks.
arXiv Detail & Related papers (2026-02-13T09:05:09Z) - ViTCoP: Accelerating Large Vision-Language Models via Visual and Textual Semantic Collaborative Pruning [8.933549837045932]
Large Vision-Language Models incur high computational costs due to significant redundancy in their visual tokens.<n>We propose a Visual and Textual Semantic Collaborative Pruning framework (ViTCoP) that combines redundancy filtering in the vision encoder with step-wise co-pruning within the Large Language Models.
arXiv Detail & Related papers (2026-01-25T12:47:30Z) - Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization [10.476757608225475]
Multimodal encoders have pushed the boundaries of visual document retrieval.<n>Recent models relying on this paradigm have massively scaled the sizes of their query and document representations.<n>We investigate whether a lightweight dense text retriever can enhance a stronger vision-centric model.
arXiv Detail & Related papers (2025-10-06T17:12:53Z) - v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning [27.688428439248607]
We introduce v1, a lightweight extension that enables active visual referencing through a simple point-and-copy approach.<n>This allows the model to identify relevant image patches and copy their embeddings back into the reasoning stream.<n>Our pointing strategy lets the MLLM directly select image patches using their semantic representations as keys, keeping perceptual evidence embedded in the same space as the model's reasoning.
arXiv Detail & Related papers (2025-05-24T19:30:47Z) - CAFe: Unifying Representation and Generation with Contrastive-Autoregressive Finetuning [24.981279071712173]
We introduce CAFe, a contrastive-autoregressive fine-tuning framework that enhances LVLMs for both representation and generative tasks.<n>Our approach unifies these traditionally separate tasks, achieving state-of-the-art results in both multimodal retrieval and multimodal generative benchmarks.
arXiv Detail & Related papers (2025-03-25T17:57:17Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for
Accelerating Vision-Language Transformer [66.71930982549028]
Vision-Language Transformers (VLTs) have shown great success recently, but are accompanied by heavy computation costs.
We propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
arXiv Detail & Related papers (2024-03-05T14:13:50Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - $\textit{latent}$-GLAT: Glancing at Latent Variables for Parallel Text
Generation [65.29170569821093]
parallel text generation has received widespread attention due to its success in generation efficiency.
In this paper, we propose $textitlatent$-GLAT, which employs the discrete latent variables to capture word categorical information.
Experiment results show that our method outperforms strong baselines without the help of an autoregressive model.
arXiv Detail & Related papers (2022-04-05T07:34:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.