ModernVBERT: Towards Smaller Visual Document Retrievers
- URL: http://arxiv.org/abs/2510.01149v1
- Date: Wed, 01 Oct 2025 17:41:17 GMT
- Title: ModernVBERT: Towards Smaller Visual Document Retrievers
- Authors: Paul Teiletche, Quentin Macé, Max Conti, Antonio Loison, Gautier Viaud, Pierre Colombo, Manuel Faysse,
- Abstract summary: ModernVBERT is a compact vision-language encoder that outperforms models up to 10 times larger when finetuned on document retrieval tasks.<n>We measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors.
- Score: 8.752477008109844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal embedding models are gaining prevalence, notably for document retrieval as efficient alternatives to text-only pipelines. These models are typically built by finetuning large vision-language decoders (VLMs) with contrastive losses on text-image pairs. In this work, we show that, while cost-efficient, this repurposing approach often bottlenecks retrieval performance. Through controlled experiments, we establish a principled recipe for improving visual document retrieval models. We notably measure the impact of attention masking, image resolution, modality alignment data regimes, and late interaction centered contrastive objectives which emerge as central performance factors. Building on these insights, we release ModernVBERT, a compact 250M-parameter vision-language encoder that outperforms models up to 10 times larger when finetuned on document retrieval tasks. Models and code are made available at https://huggingface.co/ModernVBERT.
Related papers
- PARL: Position-Aware Relation Learning Network for Document Layout Analysis [23.497081928689525]
We argue that effective layout analysis depends not on text-visual fusion, but on a deep understanding of documents' intrinsic visual structure.<n>We propose a novel OCR-free, vision-only framework that models layout through positional sensitivity and relational structure.<n>Experiments show that PARL (65M) is highly efficient, using roughly four times fewer parameters than large multimodal models.
arXiv Detail & Related papers (2026-01-12T15:05:35Z) - Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization [10.476757608225475]
Multimodal encoders have pushed the boundaries of visual document retrieval.<n>Recent models relying on this paradigm have massively scaled the sizes of their query and document representations.<n>We investigate whether a lightweight dense text retriever can enhance a stronger vision-centric model.
arXiv Detail & Related papers (2025-10-06T17:12:53Z) - Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions [81.33113485830711]
We introduce a vision-free, single-encoder retrieval pipeline for vision-language models.<n>We migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions.<n>Our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks.
arXiv Detail & Related papers (2025-09-23T16:22:27Z) - SERVAL: Surprisingly Effective Zero-Shot Visual Document Retrieval Powered by Large Vision and Language Models [17.85605201420847]
Visual Document Retrieval (VDR) typically operates as text-to-image retrieval using specialized bi-encoders trained to directly embed document images.<n>We revisit a zero-shot generate-and-encode pipeline: a vision-language model first produces a detailed textual description of each document image.<n>On the ViDoRe-v2 benchmark, the method reaches 63.4% nDCG@5, surpassing the strongest specialised multi-vector visual document encoder.
arXiv Detail & Related papers (2025-09-18T21:11:13Z) - SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension [77.93156509994994]
We show how to represent short chunks in a way that is conditioned on a broader context window to enhance retrieval performance.<n>Existing embedding models are not well-equipped to encode such situated context effectively.<n>Our method substantially outperforms state-of-the-art embedding models.
arXiv Detail & Related papers (2025-08-03T23:59:31Z) - QID: Efficient Query-Informed ViTs in Data-Scarce Regimes for OCR-free Visual Document Understanding [53.69841526266547]
Fine-tuning a pre-trained Vision-Language Model with new datasets often falls short in optimizing the vision encoder.<n>We introduce QID, a novel, streamlined, architecture-preserving approach that integrates query embeddings into the vision encoder.
arXiv Detail & Related papers (2025-04-03T18:47:16Z) - Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - ColPali: Efficient Document Retrieval with Vision Language Models [15.369861972085136]
We introduce the Visual Document Retrieval Benchmark ViDoRe, composed of various page-level retrieval tasks spanning multiple domains, languages, and practical settings.<n>The inherent complexity and performance shortcomings of modern systems motivate a new concept; doing document retrieval by directly embedding the images of the document pages.<n>We release ColPali, a Vision Language Model trained to produce high-quality multi-vector embeddings from images of document pages.
arXiv Detail & Related papers (2024-06-27T15:45:29Z) - Re-ViLM: Retrieval-Augmented Visual Language Model for Zero and Few-Shot
Image Captioning [153.98100182439165]
We introduce a Retrieval-augmented Visual Language Model, Re-ViLM, built upon the Flamingo.
By storing certain knowledge explicitly in the external database, our approach reduces the number of model parameters.
We demonstrate that Re-ViLM significantly boosts performance for image-to-text generation tasks.
arXiv Detail & Related papers (2023-02-09T18:57:56Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical
Encoder for Long-Form Document Matching [28.190001111358438]
We propose a Siamese Multi-depth Transformer-based SMITH for long-form document matching.
Our model contains several innovations to adapt self-attention models for longer text input.
We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching.
arXiv Detail & Related papers (2020-04-26T07:04:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.