Related papers: Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva

Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva

URL: http://arxiv.org/abs/2512.20042v1
Date: Tue, 23 Dec 2025 04:21:15 GMT
Title: Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva
Authors: Nguyen Lam Phu Quy, Pham Phu Hoa, Tran Chi Nguyen, Dao Sy Duy Minh, Nguyen Hoang Minh Ngoc, Huynh Trung Kiet,
Abstract summary: Real-world image captions often lack contextual depth.<n>This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives.<n>We propose a multimodal pipeline that augments visual input with external textual knowledge.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-world image captions often lack contextual depth, omitting crucial details such as event background, temporal cues, outcomes, and named entities that are not visually discernible. This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives, where richer, more informative descriptions are essential. To address this, we propose a multimodal pipeline that augments visual input with external textual knowledge. Our system retrieves semantically similar images using BEIT-3 (Flickr30k-384 and COCO-384) and SigLIP So-384, reranks them using ORB and SIFT for geometric alignment, and extracts contextual information from related articles via semantic search. A fine-tuned Qwen3 model with QLoRA then integrates this context with base captions generated by Instruct BLIP (Vicuna-7B) to produce event-enriched, context-aware descriptions. Evaluated on the OpenEvents v1 dataset, our approach generates significantly more informative captions compared to traditional methods, showing strong potential for real-world applications requiring deeper visual-textual understanding

Related papers

Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration [64.12127577975696]
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications.<n>Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively.<n>We propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration.
arXiv Detail & Related papers (2026-01-20T15:17:14Z)
ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization [9.914251544971686]
ReCap is a novel pipeline for event-enriched image retrieval and captioning.<n>It incorporates broader contextual information from relevant articles to generate narrative-rich captions.<n>Our approach addresses the limitations of standard vision-language models.
arXiv Detail & Related papers (2025-09-01T08:48:33Z)
EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions [11.853877966862086]
Event-based image retrieval from free-form captions presents a significant challenge.<n>We introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection.<n>Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge.
arXiv Detail & Related papers (2025-08-31T09:03:25Z)
OmniCaptioner: One Captioner to Rule Them All [33.98387155732322]
We propose a versatile visual captioning framework for generating fine-grained textual descriptions.<n>By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities.<n>We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.
arXiv Detail & Related papers (2025-04-09T17:58:58Z)
FLAIR: VLM with Fine-grained Language-informed Image Representations [49.2684130383925]
FLAIR is an approach that utilizes long and detailed image descriptions to learn localized image embeddings.<n>Our experiments demonstrate the effectiveness of FLAIR trained on 30M image-text pairs in capturing fine-grained visual information.
arXiv Detail & Related papers (2024-12-04T18:56:04Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.<n>First, we prompt VLLMs to generate natural language descriptions of the subject's apparent emotion in relation to the visual context.<n>Second, the descriptions, along with the visual input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z)
Visually-Aware Context Modeling for News Image Captioning [54.31708859631821]
News Image Captioning aims to create captions from news articles and images. We propose a face-naming module for learning better name embeddings. We use CLIP to retrieve sentences that are semantically close to the image.
arXiv Detail & Related papers (2023-08-16T12:39:39Z)
Generating image captions with external encyclopedic knowledge [1.452875650827562]
We create an end-to-end caption generation system that makes extensive use of image-specific encyclopedic data. Our approach includes a novel way of using image location to identify relevant open-domain facts in an external knowledge base. Our system is trained and tested on a new dataset with naturally produced knowledge-rich captions.
arXiv Detail & Related papers (2022-10-10T16:09:21Z)
Retrieval-Augmented Transformer for Image Captioning [51.79146669195357]
We develop an image captioning approach with a kNN memory, with which knowledge can be retrieved from an external corpus to aid the generation process. Our architecture combines a knowledge retriever based on visual similarities, a differentiable encoder, and a kNN-augmented attention layer to predict tokens. Experimental results, conducted on the COCO dataset, demonstrate that employing an explicit external memory can aid the generation process and increase caption quality.
arXiv Detail & Related papers (2022-07-26T19:35:49Z)
Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation. Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning. During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.