Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search
- URL: http://arxiv.org/abs/2602.12510v1
- Date: Fri, 13 Feb 2026 01:27:39 GMT
- Title: Visual RAG Toolkit: Scaling Multi-Vector Visual Retrieval with Training-Free Pooling and Multi-Stage Search
- Authors: Ara Yeroyan,
- Abstract summary: Multi-vector visual retrievers deliver strong accuracy, but scale poorly because each page yields thousands of vectors.<n>We present Visual RAG Toolkit, a system for scaling visual multi-vector retrieval with training-free, model-aware pooling and multi-stage retrieval.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-vector visual retrievers (e.g., ColPali-style late interaction models) deliver strong accuracy, but scale poorly because each page yields thousands of vectors, making indexing and search increasingly expensive. We present Visual RAG Toolkit, a practical system for scaling visual multi-vector retrieval with training-free, model-aware pooling and multi-stage retrieval. Motivated by Matryoshka Embeddings, our method performs static spatial pooling - including a lightweight sliding-window averaging variant - over patch embeddings to produce compact tile-level and global representations for fast candidate generation, followed by exact MaxSim reranking using full multi-vector embeddings. Our design yields a quadratic reduction in vector-to-vector comparisons by reducing stored vectors per page from thousands to dozens, notably without requiring post-training, adapters, or distillation. Across experiments with interaction-style models such as ColPali and ColSmol-500M, we observe that over the limited ViDoRe v2 benchmark corpus 2-stage retrieval typically preserves NDCG and Recall @ 5/10 with minimal degradation, while substantially improving throughput (approximately 4x QPS); with sensitivity mainly at very large k. The toolkit additionally provides robust preprocessing - high resolution PDF to image conversion, optional margin/empty-region cropping and token hygiene (indexing only visual tokens) - and a reproducible evaluation pipeline, enabling rapid exploration of two-, three-, and cascaded retrieval variants. By emphasizing efficiency at common cutoffs (e.g., k <= 10), the toolkit lowers hardware barriers and makes state-of-the-art visual retrieval more accessible in practice.
Related papers
- Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations [39.98860473310998]
ColParse is a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings.<n>Experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains.
arXiv Detail & Related papers (2026-03-02T09:55:00Z) - CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding [71.88471147281406]
We propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings.<n>By incorporating iterative margin loss during contrastive training, CausalEmbed encourages embedding models to learn compact and well-structured representations.<n>Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count.
arXiv Detail & Related papers (2026-01-29T04:47:27Z) - MuSASplat: Efficient Sparse-View 3D Gaussian Splats via Lightweight Multi-Scale Adaptation [92.57609195819647]
MuSASplat is a novel framework that dramatically reduces the computational burden of training pose-free feed-forward 3D Gaussian splats models.<n>Central to our approach is a lightweight Multi-Scale Adapter that enables efficient fine-tuning of ViT-based architectures with only a small fraction of training parameters.
arXiv Detail & Related papers (2025-12-08T04:56:46Z) - Hybrid-Vector Retrieval for Visually Rich Documents: Combining Single-Vector Efficiency and Multi-Vector Accuracy [36.03315207229038]
HEAVEN is a two-stage hybrid-vector framework for visually rich document retrieval.<n>It efficiently retrieves candidate pages using a single-vector method over Visually-Summarized Pages.<n>It reranks candidates with a multi-vector method while filtering query tokens by linguistic importance to reduce redundant computations.
arXiv Detail & Related papers (2025-10-25T08:27:37Z) - Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization [10.476757608225475]
Multimodal encoders have pushed the boundaries of visual document retrieval.<n>Recent models relying on this paradigm have massively scaled the sizes of their query and document representations.<n>We investigate whether a lightweight dense text retriever can enhance a stronger vision-centric model.
arXiv Detail & Related papers (2025-10-06T17:12:53Z) - SparseFormer: Detecting Objects in HRW Shots via Sparse Vision Transformer [62.11796778482088]
We present a novel model-agnostic sparse vision transformer, dubbed SparseFormer, to bridge the gap of object detection between close-up and HRW shots.<n>The proposed SparseFormer selectively uses attentive tokens to scrutinize the sparsely distributed windows that may contain objects.<n> experiments on two HRW benchmarks, PANDA and DOTA-v1.0, demonstrate that the proposed SparseFormer significantly improves detection accuracy (up to 5.8%) and speed (up to 3x) over the state-of-the-art approaches.
arXiv Detail & Related papers (2025-02-11T03:21:25Z) - Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - Deep Forest with Hashing Screening and Window Screening [25.745779145969053]
We introduce a hashing screening mechanism for multi-grained scanning of gcForest.
We propose a model called HW-Forest which adopts two strategies, hashing screening and window screening.
Our experimental results show that HW-Forest has higher accuracy than other models, and the time cost is also reduced.
arXiv Detail & Related papers (2022-07-25T07:39:55Z) - Vision Transformer Slimming: Multi-Dimension Searching in Continuous
Optimization Space [35.04846842178276]
We introduce a pure vision transformer slimming (ViT-Slim) framework that can search such a sub-structure across multiple dimensions.
Our method is based on a learnable and unified l1 sparsity constraint with pre-defined factors to reflect the global importance in the continuous searching space of different dimensions.
Our ViT-Slim can compress up to 40% of parameters and 40% FLOPs on various vision transformers while increasing the accuracy by 0.6% on ImageNet.
arXiv Detail & Related papers (2022-01-03T18:59:54Z) - Thinking Fast and Slow: Efficient Text-to-Visual Retrieval with
Transformers [115.90778814368703]
Our objective is language-based search of large-scale image and video datasets.
For this task, the approach that consists of independently mapping text and vision to a joint embedding space, a.k.a. dual encoders, is attractive as retrieval scales.
An alternative approach of using vision-text transformers with cross-attention gives considerable improvements in accuracy over the joint embeddings.
arXiv Detail & Related papers (2021-03-30T17:57:08Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.