Fugu-MT 論文翻訳(概要): Temporal-Aware Reasoning Optimization for Video Temporal Grounding

論文の概要: Temporal-Aware Reasoning Optimization for Video Temporal Grounding

arxiv url: http://arxiv.org/abs/2606.09248v1
Date: Mon, 08 Jun 2026 09:21:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.891405
Title: Temporal-Aware Reasoning Optimization for Video Temporal Grounding
Title（参考訳）: ビデオ時間グラウンドの時間認識推論最適化
Authors: Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu,
Abstract要約: 本稿では,時間的思考能力を明確に向上させるフレームワークであるTaRO(Temporal-Aware Reasoning Optimization)を提案する。まず、事前生成した高密度キャプションを利用して、明示的な視覚的手がかりやタイムスタンプに基づく推論経路を構築するコンストラクティブ推論探索を導入する。第二に、推論の品質を評価するために、テンポラル・センシティビティ・リワードを設計する。高品質な推論は特定のイベントやタイムスタンプに固定されるべきである。
参考スコア（独自算出の注目度）: 55.29748680163419
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-modal Large Language Models (MLLMs) have achieved remarkable progress in video temporal grounding with reinforcement learning for generating reasoning paths. However, existing models often produce superficial reasoning, which offers limited guidance for precise temporal localization. This limitation stems from (1) inefficient random exploration and (2) reward functions that focus solely on the answer correctness while ignoring reasoning quality. To address these issues, we propose TaRO (Temporal-Aware Reasoning Optimization), a framework that explicitly enhances the model's ability of thinking with time. First, we introduce a Constructive Reasoning Exploration that leverages pre-generated dense captions to construct reasoning paths grounded in explicit visual cues and timestamps, enabling efficient exploration of high-quality time-aware reasoning. Second, to evaluate reasoning quality, we design a Temporal-Sensitivity Reward. High-quality reasoning should be anchored to specific events and timestamps. If the event boundary under thinking is disrupted, such reasoning should become invalid, leading to a drop in the logit of the reasoning path. We utilize this drop as a critique of reasoning quality. Finally, TaRO follows a progressive curriculum, which starts by utilizing this reward to select better constructed reasoning paths, and evolves to a free exploration phase where the model autonomously generates effective reasoning. Experiments demonstrate that TaRO achieves state-of-the-art performance on VTG benchmarks. Code is available at https://github.com/oceanflowlab/TaRO.
Abstract（参考訳）: MLLM(Multi-modal Large Language Models)は、ビデオ時間的グラウンドニングにおいて、推論経路を生成するための強化学習による顕著な進歩を達成している。しかし、既存のモデルはしばしば表面的推論を生成し、正確な時間的局所化のための限られたガイダンスを提供する。この制限は、(1)非効率なランダム探索と(2)応答の正しさにのみ焦点をあてる報酬関数から生じる。これらの問題に対処するため,時間とともに思考能力を高めるフレームワークであるTaRO(Temporal-Aware Reasoning Optimization)を提案する。まず、事前生成した高密度キャプションを利用して、明示的な視覚的手がかりやタイムスタンプに基づく推論経路を構築することで、高品質な時間認識推論の効率的な探索を可能にするコンストラクティブ推論探索を提案する。第二に、推論品質を評価するために、時間感度リワードを設計する。高品質な推論は、特定のイベントやタイムスタンプに固定されるべきである。思考中のイベント境界が破壊されると、そのような推論は無効になり、推論パスのロジットが低下する。私たちはこの落差を推論品質の批判として利用する。最後に、TaROはプログレッシブカリキュラムに従い、この報酬を利用してより良い構築された推論経路を選択し、モデルが自律的に効果的な推論を生成する自由な探索段階へと進化する。実験により、TaROはVTGベンチマークで最先端のパフォーマンスを達成することが示された。コードはhttps://github.com/oceanflowlab/TaROで公開されている。

論文の概要: Temporal-Aware Reasoning Optimization for Video Temporal Grounding

関連論文リスト