Related papers: EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions

EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions

URL: http://arxiv.org/abs/2509.00751v1
Date: Sun, 31 Aug 2025 09:03:25 GMT
Title: EVENT-Retriever: Event-Aware Multimodal Image Retrieval for Realistic Captions
Authors: Dinh-Khoi Vo, Van-Loc Nguyen, Minh-Triet Tran, Trung-Nghia Le,
Abstract summary: Event-based image retrieval from free-form captions presents a significant challenge.<n>We introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection.<n>Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge.
Score: 11.853877966862086
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Event-based image retrieval from free-form captions presents a significant challenge: models must understand not only visual features but also latent event semantics, context, and real-world knowledge. Conventional vision-language retrieval approaches often fall short when captions describe abstract events, implicit causality, temporal context, or contain long, complex narratives. To tackle these issues, we introduce a multi-stage retrieval framework combining dense article retrieval, event-aware language model reranking, and efficient image collection, followed by caption-guided semantic matching and rank-aware selection. We leverage Qwen3 for article search, Qwen3-Reranker for contextual alignment, and Qwen2-VL for precise image scoring. To further enhance performance and robustness, we fuse outputs from multiple configurations using Reciprocal Rank Fusion (RRF). Our system achieves the top-1 score on the private test set of Track 2 in the EVENTA 2025 Grand Challenge, demonstrating the effectiveness of combining language-based reasoning and multimodal retrieval for complex, real-world image understanding. The code is available at https://github.com/vdkhoi20/EVENT-Retriever.

Related papers

Seeing Through Words: Controlling Visual Retrieval Quality with Language Models [68.49490036960559]
We propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality.<n>Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms.<n>Our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries.
arXiv Detail & Related papers (2026-02-24T18:20:57Z)
Fine-Grained Zero-Shot Composed Image Retrieval with Complementary Visual-Semantic Integration [64.12127577975696]
Zero-shot composed image retrieval (ZS-CIR) is a rapidly growing area with significant practical applications.<n>Existing ZS-CIR methods often struggle to capture fine-grained changes and integrate visual and semantic information effectively.<n>We propose a novel Fine-Grained Zero-Shot Composed Image Retrieval method with Complementary Visual-Semantic Integration.
arXiv Detail & Related papers (2026-01-20T15:17:14Z)
Leveraging Lightweight Entity Extraction for Scalable Event-Based Image Retrieval [0.0]
Real-world image-text retrieval is challenging due to vague or context-dependent queries, linguistic variability, and the need for scalable solutions.<n>We propose a lightweight two-stage retrieval pipeline that leverages event-centric entity extraction to incorporate temporal and contextual signals from real-world captions.<n>Our method achieves a mean average precision of 0.559, substantially outperforming prior baselines.
arXiv Detail & Related papers (2025-12-24T15:02:33Z)
Vision-Free Retrieval: Rethinking Multimodal Search with Textual Scene Descriptions [81.33113485830711]
We introduce a vision-free, single-encoder retrieval pipeline for vision-language models.<n>We migrate to a text-to-text paradigm with the assistance of VLLM-generated structured image descriptions.<n>Our approach achieves state-of-the-art zero-shot performance on multiple retrieval and compositionality benchmarks.
arXiv Detail & Related papers (2025-09-23T16:22:27Z)
ReCap: Event-Aware Image Captioning with Article Retrieval and Semantic Gaussian Normalization [9.914251544971686]
ReCap is a novel pipeline for event-enriched image retrieval and captioning.<n>It incorporates broader contextual information from relevant articles to generate narrative-rich captions.<n>Our approach addresses the limitations of standard vision-language models.
arXiv Detail & Related papers (2025-09-01T08:48:33Z)
Qwen-Image Technical Report [86.46471547116158]
We present Qwen-Image, an image generation foundation model that achieves significant advances in complex text rendering and precise image editing.<n>We design a comprehensive data pipeline that includes large-scale data collection, filtering, annotation, synthesis, and balancing.<n>Qwen-Image performs exceptionally well in alphabetic languages such as English, and also achieves remarkable progress on more challenging logographic languages like Chinese.
arXiv Detail & Related papers (2025-08-04T11:49:20Z)
ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models [12.265270657795275]
ImageChain is a framework that enhances MLLMs with sequential reasoning capabilities over image data.<n>Our approach improves performance on the next-scene description task.<n>ImageChain achieves robust zero-shot out-of-domain performance in applications ranging from comics to robotics.
arXiv Detail & Related papers (2025-02-26T18:55:06Z)
Leopard: A Vision Language Model For Text-Rich Multi-Image Tasks [62.758680527838436]
We propose Leopard, an MLLM tailored for handling vision-language tasks involving multiple text-rich images.<n>First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios.<n>Second, we proposed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length.
arXiv Detail & Related papers (2024-10-02T16:55:01Z)
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases. Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z)
CLIP-Event: Connecting Text and Images with Event Structures [123.31452120399827]
We propose a contrastive learning framework to enforce vision-language pretraining models. We take advantage of text information extraction technologies to obtain event structural knowledge. Experiments show that our zero-shot CLIP-Event outperforms the state-of-the-art supervised model in argument extraction.
arXiv Detail & Related papers (2022-01-13T17:03:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.