Related papers: Multimodal RAG Enhanced Visual Description

Multimodal RAG Enhanced Visual Description

URL: http://arxiv.org/abs/2508.09170v1
Date: Wed, 06 Aug 2025 19:04:38 GMT
Title: Multimodal RAG Enhanced Visual Description
Authors: Amit Kumar Jaiswal, Haiming Liu, Ingo Frommholz,
Abstract summary: Pre-trained large multimodal models (LMMs) encounter a modality gap, characterised by a misalignment between textual and visual representations.<n>We propose a lightweight training-free approach utilising Retrieval-Augmented Generation (RAG) to extend across the modality.<n> Experimental results on two benchmark multimodal datasets demonstrate significant improvements.
Score: 3.2771631221674333
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Textual descriptions for multimodal inputs entail recurrent refinement of queries to produce relevant output images. Despite efforts to address challenges such as scaling model size and data volume, the cost associated with pre-training and fine-tuning remains substantial. However, pre-trained large multimodal models (LMMs) encounter a modality gap, characterised by a misalignment between textual and visual representations within a common embedding space. Although fine-tuning can potentially mitigate this gap, it is typically expensive and impractical due to the requirement for extensive domain-driven data. To overcome this challenge, we propose a lightweight training-free approach utilising Retrieval-Augmented Generation (RAG) to extend across the modality using a linear mapping, which can be computed efficiently. During inference, this mapping is applied to images embedded by an LMM enabling retrieval of closest textual descriptions from the training set. These textual descriptions, in conjunction with an instruction, cater as an input prompt for the language model to generate new textual descriptions. In addition, we introduce an iterative technique for distilling the mapping by generating synthetic descriptions via the language model facilitating optimisation for standard utilised image description measures. Experimental results on two benchmark multimodal datasets demonstrate significant improvements.

Related papers

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models [84.78794648147608]
A persistent geometric anomaly, the Modality Gap, remains.<n>Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions.<n>We propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap into stable biases and anisotropic residuals.<n>We then introduce ReAlign, a training-free modality alignment strategy.
arXiv Detail & Related papers (2026-02-02T13:59:39Z)
Chat-Driven Text Generation and Interaction for Person Retrieval [16.448356660477682]
We introduce two complementary modules: Multi-Turn Text Generation (MTG) and Multi-Turn Text Interaction (MTI)<n>MTG generates rich pseudo-labels through simulated dialogues with MLLMs, producing fine-grained and diverse visual descriptions without manual supervision.<n>MTI refines user queries at inference time through dynamic, dialogue-based reasoning, enabling the system to interpret and resolve vague, incomplete, or ambiguous descriptions.
arXiv Detail & Related papers (2025-09-16T04:40:24Z)
Weighted Multi-Prompt Learning with Description-free Large Language Model Distillation [1.3381749415517021]
New approaches leveraging Large Language Models (LLM) in prompts have been proposed, enhancing robustness to unseen and diverse data.<n>Existing methods typically extract text-based responses (i.e., descriptions) from LLM to incorporate into prompts.<n>We propose Description-free Multi-prompt Learning(DeMul), a novel method that eliminates the process of extracting descriptions and instead directly distills knowledge from LLM into prompts.
arXiv Detail & Related papers (2025-07-09T07:55:25Z)
An analysis of vision-language models for fabric retrieval [4.311804611758908]
Cross-modal retrieval is essential for applications like information retrieval and recommendation systems.<n>This paper investigates the use of Vision Language Models for zero-shot text-to-image retrieval on fabric samples.
arXiv Detail & Related papers (2025-07-07T08:00:18Z)
Towards Visual Text Grounding of Multimodal Large Language Model [88.0588924255417]
We introduce TRIG, a novel task with a newly designed instruction dataset for benchmarking text-rich image grounding.<n>Specifically, we propose an OCR-LLM-human interaction pipeline to create 800 manually annotated question-answer pairs as a benchmark.<n>A comprehensive evaluation of various MLLMs on our proposed benchmark exposes substantial limitations in their grounding capability on text-rich images.
arXiv Detail & Related papers (2025-04-07T12:01:59Z)
Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.<n>We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.<n>We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Diffusion Augmented Retrieval: A Training-Free Approach to Interactive Text-to-Image Retrieval [7.439049772394586]
Diffusion Augmented Retrieval (DAR) is a framework that generates multiple intermediate representations via dialogue refinements and DMs.<n>DAR results on par with finetuned I-TIR models, yet without incurring their tuning overhead.
arXiv Detail & Related papers (2025-01-26T03:29:18Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
Few-shot Action Recognition with Captioning Foundation Models [61.40271046233581]
CapFSAR is a framework to exploit knowledge of multimodal models without manually annotating text. Visual-text aggregation module based on Transformer is further designed to incorporate cross-modal-temporal complementary information. experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods.
arXiv Detail & Related papers (2023-10-16T07:08:39Z)
SwitchGPT: Adapting Large Language Models for Non-Text Outputs [28.656227306028743]
Large Language Models (LLMs) are primarily trained on text-based datasets. LLMs exhibit exceptional proficiencies in understanding and executing complex linguistic instructions via text outputs. We propose a novel approach that evolves a text-based LLM into a multi-modal one.
arXiv Detail & Related papers (2023-09-14T11:38:23Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
Enhanced Modality Transition for Image Captioning [51.72997126838352]
We build a Modality Transition Module (MTM) to transfer visual features into semantic representations before forwarding them to the language model. During the training phase, the modality transition network is optimised by the proposed modality loss. Experiments have been conducted on the MS-COCO dataset demonstrating the effectiveness of the proposed framework.
arXiv Detail & Related papers (2021-02-23T07:20:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.