Related papers: Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search

Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search

URL: http://arxiv.org/abs/2602.12510v1
Date: Fri, 13 Feb 2026 01:27:39 GMT
Title: Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search
Authors: Ara Yeroyan,
Abstract summary: Multi-vector visual retrievers deliver strong accuracy, but scale poorly because each page yields thousands of vectors.<n>We present Visual RAG Toolkit, a system for scaling visual multi-vector retrieval with training-free, model-aware pooling and multi-stage retrieval.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-vector visual retrievers (e.g., ColPali-style late interaction models) deliver strong accuracy, but scale poorly because each page yields thousands of vectors, making indexing and search increasingly expensive. We present Visual RAG Toolkit, a practical system for scaling visual multi-vector retrieval with training-free, model-aware pooling and multi-stage retrieval. Motivated by Matryoshka Embeddings, our method performs static spatial pooling - including a lightweight sliding-window averaging variant - over patch embeddings to produce compact tile-level and global representations for fast candidate generation, followed by exact MaxSim reranking using full multi-vector embeddings. Our design yields a quadratic reduction in vector-to-vector comparisons by reducing stored vectors per page from thousands to dozens, notably without requiring post-training, adapters, or distillation. Across experiments with interaction-style models such as ColPali and ColSmol-500M, we observe that over the limited ViDoRe v2 benchmark corpus 2-stage retrieval typically preserves NDCG and Recall @ 5/10 with minimal degradation, while substantially improving throughput (approximately 4x QPS); with sensitivity mainly at very large k. The toolkit additionally provides robust preprocessing - high resolution PDF to image conversion, optional margin/empty-region cropping and token hygiene (indexing only visual tokens) - and a reproducible evaluation pipeline, enabling rapid exploration of two-, three-, and cascaded retrieval variants. By emphasizing efficiency at common cutoffs (e.g., k <= 10), the toolkit lowers hardware barriers and makes state-of-the-art visual retrieval more accessible in practice.

Related papers

Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations [39.98860473310998]
ColParse is a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings.<n>Experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains.
arXiv Detail & Related papers (2026-03-02T09:55:00Z)
CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding [71.88471147281406]
We propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings.<n>By incorporating iterative margin loss during contrastive training, CausalEmbed encourages embedding models to learn compact and well-structured representations.<n>Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count.
arXiv Detail & Related papers (2026-01-29T04:47:27Z)
MuSASplat: Efficient Sparse-View 3D Gaussian Splats via Lightweight Multi-Scale Adaptation [92.57609195819647]
MuSASplat is a novel framework that dramatically reduces the computational burden of training pose-free feed-forward 3D Gaussian splats models.<n>Central to our approach is a lightweight Multi-Scale Adapter that enables efficient fine-tuning of ViT-based architectures with only a small fraction of training parameters.
arXiv Detail & Related papers (2025-12-08T04:56:46Z)
Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy [36.03315207229038]
HEAVEN is a two-stage hybrid-vector framework for visually rich document retrieval.<n>It efficiently retrieves candidate pages using a single-vector method over Visually-Summarized Pages.<n>It reranks candidates with a multi-vector method while filtering query tokens by linguistic importance to reduce redundant computations.
arXiv Detail & Related papers (2025-10-25T08:27:37Z)
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization [10.476757608225475]
Multimodal encoders have pushed the boundaries of visual document retrieval.<n>Recent models relying on this paradigm have massively scaled the sizes of their query and document representations.<n>We investigate whether a lightweight dense text retriever can enhance a stronger vision-centric model.
arXiv Detail & Related papers (2025-10-06T17:12:53Z)
SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer [62.11796778482088]
We present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots.<n>The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects.<n> experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.
arXiv Detail & Related papers (2025-02-11T03:21:25Z)
Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors. We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval. Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets. Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z)
Deep Forest with Hashing Screening and Window Screening [25.745779145969053]
We introduce a hashing screening mechanism for multi-grained scanning of gcForest. We propose a model called HW-Forest which adopts two strategies, hashing screening and window screening. Our experimental results show that HW-Forest has higher accuracy than other models, and the time cost is also reduced.
arXiv Detail & Related papers (2022-07-25T07:39:55Z)
Vision Transformer Slimming: Multi-Dimension Searching in Continuous Optimization Space [35.04846842178276]
We introduce a pure vision transformer slimming (ViT-Slim) framework that can search such a sub-structure across multiple dimensions. Our method is based on a learnable and unified l1 sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions. Our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by 0.6% on ImageNet.
arXiv Detail & Related papers (2022-01-03T18:59:54Z)
Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets. For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales. An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z)
Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture. We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions. Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.