Fugu-MT 論文翻訳(概要): Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

論文の概要: Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

arxiv url: http://arxiv.org/abs/2603.22121v1
Date: Mon, 23 Mar 2026 15:44:48 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.761046
Title: Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding
Title（参考訳）: Mamba-VMR: 正確な時間的接地のための生成ビデオによるマルチモーダルクエリ拡張
Authors: Yunzhuo Sun, Xinyue Liu, Yanyang Li, Nanding Wu, Yifang Xu, Linlin Zong, Xianchao Zhang, Wenxin Liang,
Abstract要約: テキスト駆動ビデオモーメント検索(VMR)は、未トリミングビデオに隠された時間的ダイナミクスが限られているため、依然として困難である。既存のアプローチでは、サブタイトルコンテキストと時間的事前生成を効果的に統合できない。時間的接地強化のための新しい2段階フレームワークを提案する。
参考スコア（独自算出の注目度）: 19.92734717848329
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Text-driven video moment retrieval (VMR) remains challenging due to limited capture of hidden temporal dynamics in untrimmed videos, leading to imprecise grounding in long sequences. Traditional methods rely on natural language queries (NLQs) or static image augmentations, overlooking motion sequences and suffering from high computational costs in Transformer-based architectures. Existing approaches fail to integrate subtitle contexts and generated temporal priors effectively, we therefore propose a novel two-stage framework for enhanced temporal grounding. In the first stage, LLM-guided subtitle matching identifies relevant textual cues from video subtitles, fused with the query to generate auxiliary short videos via text-to-video models, capturing implicit motion information as temporal priors. In the second stage, augmented queries are processed through a multi-modal controlled Mamba network, extending text-controlled selection with video-guided gating for efficient fusion of generated priors and long sequences while filtering noise. Our framework is agnostic to base retrieval models and widely applicable for multimodal VMR. Experimental evaluations on the TVR benchmark demonstrate significant improvements over state-of-the-art methods, including reduced computational overhead and higher recall in long-sequence grounding.
Abstract（参考訳）: テキスト駆動ビデオモーメント検索(VMR)は、未トリミングビデオにおける隠れ時間的ダイナミクスの捕捉が限られており、長いシーケンスで不正確なグラウンド化につながるため、依然として困難である。従来の手法は自然言語クエリ(NLQ)や静的画像拡張に依存しており、動作シーケンスを見下ろし、Transformerベースのアーキテクチャでは高い計算コストに悩まされている。既存のアプローチではサブタイトルコンテキストの統合に失敗し、時間的先行を効果的に生成するので、時間的接地を強化するための新しい2段階の枠組みを提案する。第1段階では、LLM誘導字幕マッチングは、ビデオ字幕から関連するテキストの手がかりを識別し、クエリと融合してテキスト・ツー・ビデオモデルを介して補助的なショートビデオを生成し、暗黙の動作情報を時間的先行としてキャプチャする。第2段階では、拡張クエリはマルチモーダル制御されたMambaネットワークを介して処理され、ビデオ誘導ゲーティングを用いてテキスト制御の選択を拡張し、ノイズをフィルタリングしながら生成された前列と長い列を効率的に融合させる。本フレームワークはベース検索モデルに非依存であり,マルチモーダルVMRに適用可能である。また,TVRベンチマークによる評価により,リアルタイムグラウンドディングにおける計算オーバーヘッドの低減やリコール率の向上など,最先端手法の大幅な改善が示された。

論文の概要: Mamba-VMR: Multimodal Query Augmentation via Generated Videos for Precise Temporal Grounding

関連論文リスト