DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts
- URL: http://arxiv.org/abs/2510.24813v1
- Date: Tue, 28 Oct 2025 08:29:02 GMT
- Title: DualCap: Enhancing Lightweight Image Captioning via Dual Retrieval with Similar Scenes Visual Prompts
- Authors: Binbin Li, Guimiao Yang, Zisen Qi, Haiping Wang, Yu Ding,
- Abstract summary: $DualCap$ is a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images.<n>Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes.
- Score: 5.621348483113461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent lightweight retrieval-augmented image caption models often utilize retrieved data solely as text prompts, thereby creating a semantic gap by leaving the original visual features unenhanced, particularly for object details or complex scenes. To address this limitation, we propose $DualCap$, a novel approach that enriches the visual representation by generating a visual prompt from retrieved similar images. Our model employs a dual retrieval mechanism, using standard image-to-text retrieval for text prompts and a novel image-to-image retrieval to source visually analogous scenes. Specifically, salient keywords and phrases are derived from the captions of visually similar scenes to capture key objects and similar details. These textual features are then encoded and integrated with the original image features through a lightweight, trainable feature fusion network. Extensive experiments demonstrate that our method achieves competitive performance while requiring fewer trainable parameters compared to previous visual-prompting captioning approaches.
Related papers
- Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration [64.12127577975696]
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications.<n>Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively.<n>We propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration.
arXiv Detail & Related papers (2026-01-20T15:17:14Z) - Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions [81.33113485830711]
We introduce a vision-free, single-encoder retrieval pipeline for vision-language models.<n>We migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions.<n>Our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks.
arXiv Detail & Related papers (2025-09-23T16:22:27Z) - ViPCap: Retrieval Text-Based Visual Prompts for Lightweight Image Captioning [3.502816712907136]
We propose ViPCap, a novel retrieval text-based visual prompt for lightweight image captioning.<n>ViPCap leverages the retrieved text with image information as visual prompts to enhance the ability of the model to capture relevant visual information.<n> Experimental results demonstrate that ViPCap significantly outperforms prior lightweight captioning models in efficiency and effectiveness.
arXiv Detail & Related papers (2024-12-26T17:29:38Z) - You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval [120.49126407479717]
We introduce a novel compositionality framework, effectively combining sketches and text using pre-trained CLIP models.
Our system extends to novel applications in composed image retrieval, domain transfer, and fine-grained generation.
arXiv Detail & Related papers (2024-03-12T00:27:18Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Text-based Person Search without Parallel Image-Text Data [52.63433741872629]
Text-based person search (TBPS) aims to retrieve the images of the target person from a large image gallery based on a given natural language description.
Existing methods are dominated by training models with parallel image-text pairs, which are very costly to collect.
In this paper, we make the first attempt to explore TBPS without parallel image-text data.
arXiv Detail & Related papers (2023-05-22T12:13:08Z) - ARTEMIS: Attention-based Retrieval with Text-Explicit Matching and
Implicit Similarity [16.550790981646276]
Current approaches combine the features of each of the two elements of the query into a single representation.
Our work aims at shedding new light on the task by looking at it through the prism of two familiar and related frameworks: text-to-image and image-to-image retrieval.
arXiv Detail & Related papers (2022-03-15T17:29:20Z) - Telling the What while Pointing the Where: Fine-grained Mouse Trace and
Language Supervision for Improved Image Retrieval [60.24860627782486]
Fine-grained image retrieval often requires the ability to also express the where in the image the content they are looking for is.
In this paper, we describe an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where")
Our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.
arXiv Detail & Related papers (2021-02-09T17:54:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.