Related papers: Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion

URL: http://arxiv.org/abs/2512.12935v1
Date: Mon, 15 Dec 2025 02:50:43 GMT
Title: Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion
Authors: Toan Le Ngo Thanh, Phat Ha Huu, Tan Nguyen Dang Duy, Thong Nguyen Le Minh, Anh Nguyen Nhu Tinh,
Abstract summary: We propose a unified multimodal moment retrieval system with three key innovations.<n>First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval.<n>Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search.<n>Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The exponential growth of video content has created an urgent need for efficient multimodal moment retrieval systems. However, existing approaches face three critical challenges: (1) fixed-weight fusion strategies fail across cross modal noise and ambiguous queries, (2) temporal modeling struggles to capture coherent event sequences while penalizing unrealistic gaps, and (3) systems require manual modality selection, reducing usability. We propose a unified multimodal moment retrieval system with three key innovations. First, a cascaded dual-embedding pipeline combines BEIT-3 and SigLIP for broad retrieval, refined by BLIP-2 based reranking to balance recall and precision. Second, a temporal-aware scoring mechanism applies exponential decay penalties to large temporal gaps via beam search, constructing coherent event sequences rather than isolated frames. Third, Agent-guided query decomposition (GPT-4o) automatically interprets ambiguous queries, decomposes them into modality specific sub-queries (visual/OCR/ASR), and performs adaptive score fusion eliminating manual modality selection. Qualitative analysis demonstrates that our system effectively handles ambiguous queries, retrieves temporally coherent sequences, and dynamically adapts fusion strategies, advancing interactive moment search capabilities.

Related papers

OmniRet: Efficient and High-Fidelity Omni Modality Retrieval [51.80205678389465]
We present OmniRet, the first retrieval model capable of handling complex, composed queries spanning three key modalities: text, vision, and audio.<n>Our model demonstrates significant improvements on composed query, audio and video retrieval tasks, while achieving on-par performance with state-of-the-art models on others.
arXiv Detail & Related papers (2026-03-02T17:19:55Z)
Enhanced Multimodal Video Retrieval System: Integrating Query Expansion and Cross-modal Temporal Event Retrieval [0.0]
Cross-modal temporal event retrieval framework is proposed.<n> Kernel Density Mixture Thresholding (KDE-GMM) algorithm is used.<n>System incorporates a large language model (LLM) to refine and expand user queries.
arXiv Detail & Related papers (2025-12-06T07:46:51Z)
Adaptive Multi-Agent Reasoning for Text-to-Video Retrieval [12.701443847087164]
We propose an adaptive multi-agent retrieval framework that orchestrates specialized agents over multiple reasoning iterations.<n>Our framework achieves a twofold improvement over CLIP4Clip and significantly outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2025-12-02T09:52:51Z)
Multi-Agent Pointer Transformer: Seq-to-Seq Reinforcement Learning for Multi-Vehicle Dynamic Pickup-Delivery Problems [17.3780399150554]
This paper proposes an end-to-end centralized decision-making framework based on sequence-to-sequence, named Multi-Agent Pointer Transformer (MAPT)<n>MAPT significantly outperforms existing baseline methods in terms of performance and substantial computational time advantages compared to classical operations research methods.
arXiv Detail & Related papers (2025-11-21T17:32:10Z)
Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey [92.71325249013535]
Deliberative tree search is a cornerstone of Large Language Model (LLM) research.<n>This paper introduces a unified framework that deconstructs search algorithms into three core components.
arXiv Detail & Related papers (2025-10-11T03:29:18Z)
Re3: Learning to Balance Relevance & Recency for Temporal Information Retrieval [10.939002113975706]
Temporal Information Retrieval is a critical yet unresolved task for modern search systems.<n>Re3 is a framework that balances semantic and temporal information through a query-aware gating mechanism.<n>On Re2Bench, Re3 achieves state-of-the-art results, leading in R@1 across all three subsets.
arXiv Detail & Related papers (2025-09-01T09:44:01Z)
Hybrid Deep Searcher: Integrating Parallel and Sequential Search Reasoning [57.78245296980122]
We introduce HDS-QA (Hybrid Deep Search QA), a dataset automatically generated from Natural Questions.<n>It comprises hybrid-hop questions that combine parallelizable independent subqueries (executable simultaneously) and sequentially dependent subqueries (requiring step-by-step resolution)<n>We name the model HybridDeepSearcher, which outperforms state-of-the-art baselines across multiple benchmarks.
arXiv Detail & Related papers (2025-08-26T15:15:17Z)
Towards Efficient and Robust Moment Retrieval System: A Unified Framework for Multi-Granularity Models and Temporal Reranking [3.5291730624600848]
Long-form video understanding presents significant challenges for interactive retrieval systems.<n>Existing approaches often rely on single models, inefficient storage, unstable temporal search, and context-agnostic reranking.<n>This paper presents a novel framework to enhance interactive video retrieval through four key innovations.
arXiv Detail & Related papers (2025-04-11T09:36:46Z)
BRATI: Bidirectional Recurrent Attention for Time-Series Imputation [0.14999444543328289]
Missing data in time-series analysis poses significant challenges, affecting the reliability of downstream applications.<n>This paper introduces BRATI, a novel deep-learning model designed to address multivariate time-series imputation.<n>BRATI processes temporal dependencies and feature correlations across long and short time horizons, utilizing two imputation blocks that operate in opposite temporal directions.
arXiv Detail & Related papers (2025-01-09T17:50:56Z)
Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z)
FormerTime: Hierarchical Multi-Scale Representations for Multivariate Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task. It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z)
Action Quality Assessment with Temporal Parsing Transformer [84.1272079121699]
Action Quality Assessment (AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences. We propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations. Our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.
arXiv Detail & Related papers (2022-07-19T13:29:05Z)
Deep Explicit Duration Switching Models for Time Series [84.33678003781908]
We propose a flexible model that is capable of identifying both state- and time-dependent switching dynamics. State-dependent switching is enabled by a recurrent state-to-switch connection. An explicit duration count variable is used to improve the time-dependent switching behavior.
arXiv Detail & Related papers (2021-10-26T17:35:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.