Enhanced Multimodal Video Retrieval System: Integrating Query Expansion and Cross-modal Temporal Event Retrieval
- URL: http://arxiv.org/abs/2512.06334v1
- Date: Sat, 06 Dec 2025 07:46:51 GMT
- Title: Enhanced Multimodal Video Retrieval System: Integrating Query Expansion and Cross-modal Temporal Event Retrieval
- Authors: Van-Thinh Vo, Minh-Khoi Nguyen, Minh-Huy Tran, Anh-Quan Nguyen-Tran, Duy-Tan Nguyen, Khanh-Loi Nguyen, Anh-Minh Phan,
- Abstract summary: Cross-modal temporal event retrieval framework is proposed.<n> Kernel Density Mixture Thresholding (KDE-GMM) algorithm is used.<n>System incorporates a large language model (LLM) to refine and expand user queries.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimedia information retrieval from videos remains a challenging problem. While recent systems have advanced multimodal search through semantic, object, and OCR queries - and can retrieve temporally consecutive scenes - they often rely on a single query modality for an entire sequence, limiting robustness in complex temporal contexts. To overcome this, we propose a cross-modal temporal event retrieval framework that enables different query modalities to describe distinct scenes within a sequence. To determine decision thresholds for scene transition and slide change adaptively, we build Kernel Density Gaussian Mixture Thresholding (KDE-GMM) algorithm, ensuring optimal keyframe selection. These extracted keyframes act as compact, high-quality visual exemplars that retain each segment's semantic essence, improving retrieval precision and efficiency. Additionally, the system incorporates a large language model (LLM) to refine and expand user queries, enhancing overall retrieval performance. The proposed system's effectiveness and robustness were demonstrated through its strong results in the Ho Chi Minh AI Challenge 2025.
Related papers
- OmniRet: Efficient and High-Fidelity Omni Modality Retrieval [51.80205678389465]
We present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.<n>Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others.
arXiv Detail & Related papers (2026-03-02T17:19:55Z) - Pixel-Grounded Retrieval for Knowledgeable Large Multimodal Models [58.46663983451155]
PixSearch is an end-to-end Segmenting Large Multimodal Model (LMM) that unifies region-level perception and retrieval-augmented reasoning.<n>During encoding, PixSearch emits search> tokens to trigger retrieval, selects query modalities (text, image, or region), and generates pixel-level masks that directly serve as visual queries.<n>On egocentric and entity-centric VQA benchmarks, PixSearch substantially improves factual consistency and generalization.
arXiv Detail & Related papers (2026-01-27T00:46:08Z) - FusAD: Time-Frequency Fusion with Adaptive Denoising for General Time Series Analysis [92.23551599659186]
Time series analysis plays a vital role in fields such as finance, healthcare, industry, and meteorology.<n>FusAD is a unified analysis framework designed for diverse time series tasks.
arXiv Detail & Related papers (2025-12-16T04:34:27Z) - Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion [0.0]
We propose a unified multimodal moment retrieval system with three key innovations.<n>First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval.<n>Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search.<n>Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries.
arXiv Detail & Related papers (2025-12-15T02:50:43Z) - MADTempo: An Interactive System for Multi-Event Temporal Video Retrieval with Query Augmentation [2.819801450768979]
We introduce MADTempo, a video retrieval framework developed by our team, AIO_Trinh.<n>Our temporal search mechanism captures event-level continuity by aggregating similarity scores across sequential video segments.<n>A Google Image Search-based fallback module expands query representations with external web imagery.
arXiv Detail & Related papers (2025-12-15T02:25:46Z) - X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification [79.37768038337971]
We propose a novel cross-modality feature learning framework named X-ReID for VVI-ReID.<n> Specifically, we first propose a Cross-modality Prototype Collaboration (CPC)<n>Then, a Multi-granularity Information Interaction (MII) is designed, incorporating short-term interactions from adjacent frames, long-term cross-frame information fusion, and cross-modality feature alignment.
arXiv Detail & Related papers (2025-11-22T07:57:15Z) - Generalized Contrastive Learning for Universal Multimodal Retrieval [53.70202081784898]
Cross-modal retrieval models (e.g., CLIP) show degraded performances with retrieving keys composed of fused image-text modality.<n>This paper proposes Generalized Contrastive Learning (GCL), a novel loss formulation that improves multimodal retrieval performance without the need for new dataset curation.
arXiv Detail & Related papers (2025-09-30T01:25:04Z) - CLaMR: Contextualized Late-Interaction for Multimodal Content Retrieval [70.9990850395981]
We introduce CLaMR, a multimodal, late-interaction retriever that jointly indexes 4 modalities: video frames, transcribed speech, on-screen text, and metadata.<n>CLaMR is trained to enhance dynamic modality selection via two key innovations.
arXiv Detail & Related papers (2025-06-06T15:02:30Z) - OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multimodal Retrieval [31.69320295943039]
Vision-language retrieval-augmented generation (RAG) has become an effective approach for tackling Knowledge-Based Visual Question Answering (KB-VQA)<n>We propose a multimodal RAG system featuring a coarse-to-fine, multi-step retrieval that harmonizes multiple granularities and modalities to enhance efficacy.
arXiv Detail & Related papers (2025-05-10T14:24:41Z) - Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking [3.5291730624600848]
Long-form video understanding presents significant challenges for interactive retrieval systems.<n>Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking.<n>This paper presents a novel framework to enhance interactive video retrieval through four key innovations.
arXiv Detail & Related papers (2025-04-11T09:36:46Z) - Everything Can Be Described in Words: A Simple Unified Multi-Modal Framework with Semantic and Temporal Alignment [0.0]
We propose UMaT, a framework that unifies visual and auditory inputs as structured text for large language models.<n>It significantly improves state-of-the-art Long Video Question Answering accuracy.
arXiv Detail & Related papers (2025-03-12T05:28:24Z) - VisDoM: Multi-Document QA with Visually Rich Elements Using Multimodal Retrieval-Augmented Generation [100.06122876025063]
This paper introduces VisDoMBench, the first comprehensive benchmark designed to evaluate QA systems in multi-document settings.<n>We propose VisDoMRAG, a novel multimodal Retrieval Augmented Generation (RAG) approach that simultaneously utilizes visual and textual RAG.
arXiv Detail & Related papers (2024-12-14T06:24:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.