Fugu-MT 論文翻訳(概要): Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

論文の概要: Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

arxiv url: http://arxiv.org/abs/2509.15178v1
Date: Thu, 18 Sep 2025 17:35:50 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 17:26:53.365761
Title: Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding
Title（参考訳）: ゼロショット時空間ビデオグラウンドにおけるマルチモーダルLDMの可能性
Authors: Zaiquan Yang, Yuhao Liu, Gerhard Hancke, Rynson W. H. Lau,
Abstract要約: 大規模言語モデル(MLLM)を用いて,STVGのゼロショットソリューションを探索する。 STVGのためのMLLMベースのゼロショットフレームワークを提案する。
参考スコア（独自算出の注目度）: 47.400649582392255
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatio-temporal video grounding (STVG) aims at localizing the spatio-temporal tube of a video, as specified by the input text query. In this paper, we utilize multimodal large language models (MLLMs) to explore a zero-shot solution in STVG. We reveal two key insights about MLLMs: (1) MLLMs tend to dynamically assign special tokens, referred to as \textit{grounding tokens}, for grounding the text query; and (2) MLLMs often suffer from suboptimal grounding due to the inability to fully integrate the cues in the text query (\textit{e.g.}, attributes, actions) for inference. Based on these insights, we propose a MLLM-based zero-shot framework for STVG, which includes novel decomposed spatio-temporal highlighting (DSTH) and temporal-augmented assembling (TAS) strategies to unleash the reasoning ability of MLLMs. The DSTH strategy first decouples the original query into attribute and action sub-queries for inquiring the existence of the target both spatially and temporally. It then uses a novel logit-guided re-attention (LRA) module to learn latent variables as spatial and temporal prompts, by regularizing token predictions for each sub-query. These prompts highlight attribute and action cues, respectively, directing the model's attention to reliable spatial and temporal related visual regions. In addition, as the spatial grounding by the attribute sub-query should be temporally consistent, we introduce the TAS strategy to assemble the predictions using the original video frames and the temporal-augmented frames as inputs to help improve temporal consistency. We evaluate our method on various MLLMs, and show that it outperforms SOTA methods on three common STVG benchmarks. The code will be available at https://github.com/zaiquanyang/LLaVA_Next_STVG.
Abstract（参考訳）: STVG(Spatio-temporal video grounding)は、入力されたテキストクエリによって指定されたビデオの時空間管をローカライズすることを目的としている。本稿では,マルチモーダル大言語モデル(MLLM)を用いて,STVGにおけるゼロショット解の探索を行う。 MLLMはテキストクエリをグラウンド化するために \textit{grounding tokens} と呼ばれる特別なトークンを動的に割り当てる傾向がある。これらの知見に基づいて,STVG のための MLLM ベースのゼロショットフレームワークを提案する。これは,新しい分解時空間強調(DSTH)と,MLLM の推論能力を解き放つための時間拡張組立(TAS)戦略を含む。 DSTH戦略は、まず元のクエリを属性とアクションサブクエリに分解し、ターゲットの存在を空間的および時間的に問う。次に、新しいlogit-guided re-attention (LRA)モジュールを使用して、各サブクエリのトークン予測を正規化することにより、潜伏変数を空間的および時間的プロンプトとして学習する。これらのプロンプトは、それぞれ属性とアクションキューを強調し、モデルの注意を信頼できる空間的および時間的関連視覚領域に向ける。さらに,属性サブクエリによる空間的接地は時間的に一貫したものであるべきであり,時間的整合性を改善するための入力として,元のビデオフレームと時間的拡張フレームを用いて予測を組み立てるためのTAS戦略を導入する。本手法を各種MLLM上で評価し,STVGベンチマークでSOTA法より優れていることを示す。コードはhttps://github.com/zaiquanyang/LLaVA_Next_STVGで入手できる。

論文の概要: Unleashing the Potential of Multimodal LLMs for Zero-Shot Spatio-Temporal Video Grounding

関連論文リスト