ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization
- URL: http://arxiv.org/abs/2509.01259v1
- Date: Mon, 01 Sep 2025 08:48:33 GMT
- Title: ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization
- Authors: Thinh-Phuc Nguyen, Thanh-Hai Nguyen, Gia-Huy Dinh, Lam-Huy Nguyen, Minh-Triet Tran, Trung-Nghia Le,
- Abstract summary: ReCap is a novel pipeline for event-enriched image retrieval and captioning.<n>It incorporates broader contextual information from relevant articles to generate narrative-rich captions.<n>Our approach addresses the limitations of standard vision-language models.
- Score: 9.914251544971686
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Image captioning systems often produce generic descriptions that fail to capture event-level semantics which are crucial for applications like news reporting and digital archiving. We present ReCap, a novel pipeline for event-enriched image retrieval and captioning that incorporates broader contextual information from relevant articles to generate narrative-rich, factually grounded captions. Our approach addresses the limitations of standard vision-language models that typically focus on visible content while missing temporal, social, and historical contexts. ReCap comprises three integrated components: (1) a robust two-stage article retrieval system using DINOv2 embeddings with global feature similarity for initial candidate selection followed by patch-level mutual nearest neighbor similarity re-ranking; (2) a context extraction framework that synthesizes information from article summaries, generic captions, and original source metadata; and (3) a large language model-based caption generation system with Semantic Gaussian Normalization to enhance fluency and relevance. Evaluated on the OpenEvents V1 dataset as part of Track 1 in the EVENTA 2025 Grand Challenge, ReCap achieved a strong overall score of 0.54666, ranking 2nd on the private test set. These results highlight ReCap's effectiveness in bridging visual perception with real-world knowledge, offering a practical solution for context-aware image understanding in high-stakes domains. The code is available at https://github.com/Noridom1/EVENTA2025-Event-Enriched-Image-Captioning.
Related papers
- Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration [64.12127577975696]
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications.<n>Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively.<n>We propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration.
arXiv Detail & Related papers (2026-01-20T15:17:14Z) - Beyond Vision: Contextually Enriched Image Captioning with Multi-Modal Retrieva [0.0]
Real-world image captions often lack contextual depth.<n>This gap limits the effectiveness of image understanding in domains like journalism, education, and digital archives.<n>We propose a multimodal pipeline that augments visual input with external textual knowledge.
arXiv Detail & Related papers (2025-12-23T04:21:15Z) - EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions [11.853877966862086]
Event-based image retrieval from free-form captions presents a significant challenge.<n>We introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection.<n>Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge.
arXiv Detail & Related papers (2025-08-31T09:03:25Z) - ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing [128.8346376825612]
Key challenges of high-quality image captioning lie in the inherent biases of LVLMs.<n>We propose a scalable debiased captioning strategy, which continuously enriches and calibrates the caption with increased inference budget.<n>Annotating 450K images with ScaleCap and using them for LVLM pretraining leads to consistent performance gains across 11 widely used benchmarks.
arXiv Detail & Related papers (2025-06-24T17:59:55Z) - DIR: Retrieval-Augmented Image Captioning with Comprehensive Understanding [10.347788969721844]
Dive Into Retrieval (DIR) is designed to enhance both the image-to-text retrieval process and the utilization of retrieved text.<n>DIR not only maintains competitive in-domain performance but also significantly improves out-of-domain generalization, all without increasing inference costs.
arXiv Detail & Related papers (2024-12-02T04:39:17Z) - Generating image captions with external encyclopedic knowledge [1.452875650827562]
We create an end-to-end caption generation system that makes extensive use of image-specific encyclopedic data.
Our approach includes a novel way of using image location to identify relevant open-domain facts in an external knowledge base.
Our system is trained and tested on a new dataset with naturally produced knowledge-rich captions.
arXiv Detail & Related papers (2022-10-10T16:09:21Z) - Scaling Autoregressive Models for Content-Rich Text-to-Image Generation [95.02406834386814]
Parti treats text-to-image generation as a sequence-to-sequence modeling problem.
Parti uses a Transformer-based image tokenizer, ViT-VQGAN, to encode images as sequences of discrete tokens.
PartiPrompts (P2) is a new holistic benchmark of over 1600 English prompts.
arXiv Detail & Related papers (2022-06-22T01:11:29Z) - Fine-grained Image Captioning with CLIP Reward [104.71533106301598]
We propose using CLIP, a multimodal encoder trained on huge image-text pairs from web, to calculate multimodal similarity and use it as a reward function.
We also propose a simple finetuning strategy of the CLIP text encoder to improve grammar that does not require extra text annotation.
In experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than the CIDEr-optimized model.
arXiv Detail & Related papers (2022-05-26T02:46:09Z) - CapOnImage: Context-driven Dense-Captioning on Image [13.604173177437536]
We introduce a new task called captioning on image (CapOnImage), which aims to generate dense captions at different locations of the image based on contextual information.
We propose a multi-modal pre-training model with multi-level pre-training tasks that progressively learn the correspondence between texts and image locations.
Compared with other image captioning model variants, our model achieves the best results in both captioning accuracy and diversity aspects.
arXiv Detail & Related papers (2022-04-27T14:40:31Z) - Injecting Semantic Concepts into End-to-End Image Captioning [61.41154537334627]
We propose a pure vision transformer-based image captioning model, dubbed as ViTCAP, in which grid representations are used without extracting the regional features.
For improved performance, we introduce a novel Concept Token Network (CTN) to predict the semantic concepts and then incorporate them into the end-to-end captioning.
In particular, the CTN is built on the basis of a vision transformer and is designed to predict the concept tokens through a classification task.
arXiv Detail & Related papers (2021-12-09T22:05:05Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.