Related papers: What's the Best Way to Retrieve Slides? A Comparative Study of Multimodal, Caption-Based, and Hybrid Retrieval Techniques

What's the Best Way to Retrieve Slides? A Comparative Study of Multimodal, Caption-Based, and Hybrid Retrieval Techniques

URL: http://arxiv.org/abs/2509.15211v1
Date: Thu, 18 Sep 2025 17:57:07 GMT
Title: What's the Best Way to Retrieve Slides? A Comparative Study of Multimodal, Caption-Based, and Hybrid Retrieval Techniques
Authors: Petros Stylianos Giouroukis, Dimitris Dimitriadis, Dimitrios Papadopoulos, Zhenwen Shao, Grigorios Tsoumakas,
Abstract summary: multimodality, combining text, images, and charts, presents challenges for retrieval-augmented generation systems.<n>This paper investigates various methodologies for effective slide retrieval, including visual late-interaction embedding models like ColPali, the use of visual rerankers, and hybrid retrieval techniques that combine dense retrieval with BM25.<n>A novel Vision-Language Models-based captioning pipeline is also evaluated, demonstrating significantly reduced embedding storage requirements compared to visual late-interaction techniques.
Score: 11.421689052786467
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Slide decks, serving as digital reports that bridge the gap between presentation slides and written documents, are a prevalent medium for conveying information in both academic and corporate settings. Their multimodal nature, combining text, images, and charts, presents challenges for retrieval-augmented generation systems, where the quality of retrieval directly impacts downstream performance. Traditional approaches to slide retrieval often involve separate indexing of modalities, which can increase complexity and lose contextual information. This paper investigates various methodologies for effective slide retrieval, including visual late-interaction embedding models like ColPali, the use of visual rerankers, and hybrid retrieval techniques that combine dense retrieval with BM25, further enhanced by textual rerankers and fusion methods like Reciprocal Rank Fusion. A novel Vision-Language Models-based captioning pipeline is also evaluated, demonstrating significantly reduced embedding storage requirements compared to visual late-interaction techniques, alongside comparable retrieval performance. Our analysis extends to the practical aspects of these methods, evaluating their runtime performance and storage demands alongside retrieval efficacy, thus offering practical guidance for the selection and development of efficient and robust slide retrieval systems for real-world applications.

Related papers

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories [52.57197752244638]
We introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task.<n>Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues.<n>We construct DISBench, a challenging benchmark built on interconnected visual data.
arXiv Detail & Related papers (2026-02-11T12:51:10Z)
Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization [10.476757608225475]
Multimodal encoders have pushed the boundaries of visual document retrieval.<n>Recent models relying on this paradigm have massively scaled the sizes of their query and document representations.<n>We investigate whether a lightweight dense text retriever can enhance a stronger vision-centric model.
arXiv Detail & Related papers (2025-10-06T17:12:53Z)
A Multi-Granularity Retrieval Framework for Visually-Rich Documents [4.804551482123172]
We propose a unified multi-granularity multimodal retrieval framework tailored for two benchmark tasks: MMDocIR and M2KR.<n>Our approach integrates hierarchical encoding strategies, modality-aware retrieval mechanisms, and vision-language model (VLM)-based candidate filtering.<n>Our framework demonstrates robust performance without the need for task-specific fine-tuning.
arXiv Detail & Related papers (2025-05-01T02:40:30Z)
Generative Compositor for Few-Shot Visual Information Extraction [60.663887314625164]
We propose a novel generative model, named Generative generative spatialtor, to address the challenge of few-shot VIE.<n>Generative generative spatialtor is a hybrid pointer-generator network that emulates the operations of a compositor by retrieving words from the source text.<n>The proposed method achieves highly competitive results in the full-sample training, while notably outperforms the baseline in the 1-shot, 5-shot, and 10-shot settings.
arXiv Detail & Related papers (2025-03-21T04:56:24Z)
Few-Shot, No Problem: Descriptive Continual Relation Extraction [27.296604792388646]
Few-shot Continual Relation Extraction is a crucial challenge for enabling AI systems to identify and adapt to evolving relationships in real-world domains.<n>Traditional memory-based approaches often overfit to limited samples, failing to reinforce old knowledge.<n>We propose a novel retrieval-based solution, starting with a large language model to generate descriptions for each relation.
arXiv Detail & Related papers (2025-02-27T23:44:30Z)
Retrieval-guided Cross-view Image Synthesis [3.7477511412024573]
Cross-view image synthesis presents significant challenges in establishing reliable correspondences.<n>We propose a retrieval-guided framework that reimagines how retrieval techniques can facilitate effective cross-view image synthesis.<n>Our work bridges information retrieval and synthesis tasks, offering insights into how retrieval techniques can address complex cross-domain synthesis challenges.
arXiv Detail & Related papers (2024-11-29T07:04:44Z)
DiSCo: LLM Knowledge Distillation for Efficient Sparse Retrieval in Conversational Search [19.694957365385896]
Conversational Search (CS) involves retrieving relevant documents from a corpus while considering context modeling.<n>Recent advancements in Large Language Models (LLMs) have significantly enhanced CS by enabling query rewriting based on context.<n>We introduceo (Distillation of Sparse Conversational retrieval), a novel approach that unifies retrieval and context modeling.
arXiv Detail & Related papers (2024-10-18T17:03:17Z)
Compressible and Searchable: AI-native Multi-Modal Retrieval System with Learned Image Compression [0.6345523830122168]
Conventional approaches struggle to cope with the escalating complexity and scale of multimedia data. We proposed framework addresses this challenge by fusing AI-native multi-modal search capabilities with neural image compression. Our work marks a significant advancement towards scalable and efficient multi-modal search systems in the era of big data.
arXiv Detail & Related papers (2024-04-16T02:29:00Z)
Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR) We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z)
Learning to Rank in Generative Retrieval [62.91492903161522]
Generative retrieval aims to generate identifier strings of relevant passages as the retrieval target. We propose a learning-to-rank framework for generative retrieval, dubbed LTRGR. This framework only requires an additional learning-to-rank training phase to enhance current generative retrieval systems.
arXiv Detail & Related papers (2023-06-27T05:48:14Z)
IRGen: Generative Modeling for Image Retrieval [82.62022344988993]
In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling. We develop our model, dubbed IRGen, to address the technical challenge of converting an image into a concise sequence of semantic units. Our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks and two million-scale datasets.
arXiv Detail & Related papers (2023-03-17T17:07:36Z)
Incorporating Relevance Feedback for Information-Seeking Retrieval using Few-Shot Document Re-Ranking [56.80065604034095]
We introduce a kNN approach that re-ranks documents based on their similarity with the query and the documents the user considers relevant. To evaluate our different integration strategies, we transform four existing information retrieval datasets into the relevance feedback scenario.
arXiv Detail & Related papers (2022-10-19T16:19:37Z)
Where Does the Performance Improvement Come From? - A Reproducibility Concern about Image-Text Retrieval [85.03655458677295]
Image-text retrieval has gradually become a major research direction in the field of information retrieval. We first examine the related concerns and why the focus is on image-text retrieval tasks. We analyze various aspects of the reproduction of pretrained and nonpretrained retrieval models.
arXiv Detail & Related papers (2022-03-08T05:01:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.