RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval
- URL: http://arxiv.org/abs/2501.16303v1
- Date: Mon, 27 Jan 2025 18:45:07 GMT
- Title: RAPID: Retrieval-Augmented Parallel Inference Drafting for Text-Based Video Event Retrieval
- Authors: Long Nguyen, Huy Nguyen, Bao Khuu, Huy Luu, Huy Le, Tuan Nguyen, Tho Quan,
- Abstract summary: Existing methods for text-based video event retrieval focus heavily on object-level descriptions, overlooking the crucial role of contextual information.
We propose a novel system called RAPID, which leverages advancements in Large Language Models (LLMs) and prompt-based learning to semantically correct user queries.
Our system was validated for both speed and accuracy through participation in the Ho Chi Minh City AI Challenge 2024, where it successfully retrieved events from over 300 hours of video.
- Score: 2.9927319356868436
- License:
- Abstract: Retrieving events from videos using text queries has become increasingly challenging due to the rapid growth of multimedia content. Existing methods for text-based video event retrieval often focus heavily on object-level descriptions, overlooking the crucial role of contextual information. This limitation is especially apparent when queries lack sufficient context, such as missing location details or ambiguous background elements. To address these challenges, we propose a novel system called RAPID (Retrieval-Augmented Parallel Inference Drafting), which leverages advancements in Large Language Models (LLMs) and prompt-based learning to semantically correct and enrich user queries with relevant contextual information. These enriched queries are then processed through parallel retrieval, followed by an evaluation step to select the most relevant results based on their alignment with the original query. Through extensive experiments on our custom-developed dataset, we demonstrate that RAPID significantly outperforms traditional retrieval methods, particularly for contextually incomplete queries. Our system was validated for both speed and accuracy through participation in the Ho Chi Minh City AI Challenge 2024, where it successfully retrieved events from over 300 hours of video. Further evaluation comparing RAPID with the baseline proposed by the competition organizers demonstrated its superior effectiveness, highlighting the strength and robustness of our approach.
Related papers
- Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning.
We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks.
This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z) - GQE: Generalized Query Expansion for Enhanced Text-Video Retrieval [56.610806615527885]
This paper introduces a novel data-centric approach, Generalized Query Expansion (GQE), to address the inherent information imbalance between text and video.
By adaptively segmenting videos into short clips and employing zero-shot captioning, GQE enriches the training dataset with comprehensive scene descriptions.
GQE achieves state-of-the-art performance on several benchmarks, including MSR-VTT, MSVD, LSMDC, and VATEX.
arXiv Detail & Related papers (2024-08-14T01:24:09Z) - Improving Retrieval in Sponsored Search by Leveraging Query Context Signals [6.152499434499752]
We propose an approach to enhance query understanding by augmenting queries with rich contextual signals.
We use web search titles and snippets to ground queries in real-world information and utilize GPT-4 to generate query rewrites and explanations.
Our context-aware approach substantially outperforms context-free models.
arXiv Detail & Related papers (2024-07-19T14:28:53Z) - EA-VTR: Event-Aware Video-Text Retrieval [97.30850809266725]
Event-Aware Video-Text Retrieval model achieves powerful video-text retrieval ability through superior video event awareness.
EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment.
arXiv Detail & Related papers (2024-07-10T09:09:58Z) - Analyzing Temporal Complex Events with Large Language Models? A Benchmark towards Temporal, Long Context Understanding [57.62275091656578]
We refer to the complex events composed of many news articles over an extended period as Temporal Complex Event (TCE)
This paper proposes a novel approach using Large Language Models (LLMs) to systematically extract and analyze the event chain within TCE.
arXiv Detail & Related papers (2024-06-04T16:42:17Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Event-driven Real-time Retrieval in Web Search [15.235255100530496]
This paper expands the query with event information that represents real-time search intent.
We further enhance the model's capacity for event representation through multi-task training.
Our proposed approach significantly outperforms existing state-of-the-art baseline methods.
arXiv Detail & Related papers (2023-12-01T06:30:31Z) - Improving Query-Focused Meeting Summarization with Query-Relevant
Knowledge [71.14873115781366]
We propose a knowledge-enhanced two-stage framework called Knowledge-Aware Summarizer (KAS) to tackle the challenges.
In the first stage, we introduce knowledge-aware scores to improve the query-relevant segment extraction.
In the second stage, we incorporate query-relevant knowledge in the summary generation.
arXiv Detail & Related papers (2023-09-05T10:26:02Z) - Part2Whole: Iteratively Enrich Detail for Cross-Modal Retrieval with
Partial Query [25.398090300086302]
We propose an interactive retrieval framework called Part2Whole to tackle this problem.
An Interactive Retrieval Agent is trained to build an optimal policy to refine the initial query.
We present a weakly-supervised reinforcement learning method that needs no human-annotated data other than the text-image dataset.
arXiv Detail & Related papers (2021-03-02T11:27:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.