RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning
- URL: http://arxiv.org/abs/2509.15883v1
- Date: Fri, 19 Sep 2025 11:29:42 GMT
- Title: RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning
- Authors: Xiaosheng Long, Hanyu Wang, Zhentao Song, Kun Luo, Hongde Liu,
- Abstract summary: We propose RACap, a relation-aware retrieval-augmented model for image captioning.<n> RACap, with only 10.8M trainable parameters, achieves superior performance compared to previous lightweight captioning models.
- Score: 8.596137792629529
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent retrieval-augmented image captioning methods incorporate external knowledge to compensate for the limitations in comprehending complex scenes. However, current approaches face challenges in relation modeling: (1) the representation of semantic prompts is too coarse-grained to capture fine-grained relationships; (2) these methods lack explicit modeling of image objects and their semantic relationships. To address these limitations, we propose RACap, a relation-aware retrieval-augmented model for image captioning, which not only mines structured relation semantics from retrieval captions, but also identifies heterogeneous objects from the image. RACap effectively retrieves structured relation features that contain heterogeneous visual information to enhance the semantic consistency and relational expressiveness. Experimental results show that RACap, with only 10.8M trainable parameters, achieves superior performance compared to previous lightweight captioning models.
Related papers
- Seeing Through Words: Controlling Visual Retrieval Quality with Language Models [68.49490036960559]
We propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality.<n>Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms.<n>Our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries.
arXiv Detail & Related papers (2026-02-24T18:20:57Z) - SaFiRe: Saccade-Fixation Reiteration with Mamba for Referring Image Segmentation [58.80001825332851]
Referring Image (RIS) aims to segment the target object in an image given a natural language expression.<n>Recent methods predominantly focus on simple expressions like "red car" or "left girl"
arXiv Detail & Related papers (2025-10-11T10:50:58Z) - RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning [1.5999407512883512]
A Retrieval-based Objects and Relations Prompt for Image Captioning (RORPCap) is proposed.<n>RORPCap employs an Objects and relations Extraction Model to extract object and relation words from the image.<n>The resulting prompt embeddings and visual-text embeddings are textual-enriched feature embeddings, which are fed into a GPT-2 model for caption generation.
arXiv Detail & Related papers (2025-08-10T12:27:27Z) - VSC: Visual Search Compositional Text-to-Image Diffusion Model [15.682990658945682]
We introduce a novel compositional generation method that leverages pairwise image embeddings to improve attribute-object binding.<n>Our approach decomposes complex prompts into sub-prompts, generates corresponding images, and computes visual prototypes that fuse with text embeddings to enhance representation.<n>Our approaches outperform existing compositional text-to-image diffusion models on the benchmark T2I CompBench, achieving better image quality, evaluated by humans, and emerging robustness under scaling number of binding pairs in the prompt.
arXiv Detail & Related papers (2025-05-02T08:31:43Z) - Dynamic Relation Inference via Verb Embeddings [2.2843519327656363]
We offer insights and practical methods to advance the field of relation inference from images.<n>We propose Dynamic Relation Inference via Verb Embeddings (DRIVE), which augments the COCO dataset, fine-tunes CLIP with hard negatives subject-relation-object triples and corresponding images, and introduces a novel loss function to improve relation detection.
arXiv Detail & Related papers (2025-03-17T10:24:27Z) - Sentence-level Prompts Benefit Composed Image Retrieval [69.78119883060006]
Composed image retrieval (CIR) is the task of retrieving specific images by using a query that involves both a reference image and a relative caption.
We propose to leverage pretrained V-L models, e.g., BLIP-2, to generate sentence-level prompts.
Our proposed method performs favorably against the state-of-the-art CIR methods on the Fashion-IQ and CIRR datasets.
arXiv Detail & Related papers (2023-10-09T07:31:44Z) - Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations [67.92679668612858]
We propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals.
Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings; and (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions.
On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its
arXiv Detail & Related papers (2023-06-03T11:50:44Z) - FECANet: Boosting Few-Shot Semantic Segmentation with Feature-Enhanced
Context-Aware Network [48.912196729711624]
Few-shot semantic segmentation is the task of learning to locate each pixel of a novel class in a query image with only a few annotated support images.
We propose a Feature-Enhanced Context-Aware Network (FECANet) to suppress the matching noise caused by inter-class local similarity.
In addition, we propose a novel correlation reconstruction module that encodes extra correspondence relations between foreground and background and multi-scale context semantic features.
arXiv Detail & Related papers (2023-01-19T16:31:13Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - Exploring Semantic Relationships for Unpaired Image Captioning [40.401322131624866]
We achieve unpaired image captioning by bridging the vision and the language domains with high-level semantic information.
We propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image.
The proposed approach boosts five strong baselines under the paired setting, where the most significant improvement in CIDEr score reaches 8%.
arXiv Detail & Related papers (2021-06-20T09:10:11Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.