Fugu-MT 論文翻訳(概要): Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

論文の概要: Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

arxiv url: http://arxiv.org/abs/2509.03501v1
Date: Wed, 03 Sep 2025 17:33:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 21:40:46.614717
Title: Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data
Title（参考訳）: Strefer: 合成インストラクションデータによる時空参照と推論によるビデオLLMの強化
Authors: Honglu Zhou, Xiangyu Peng, Shrikant Kendre, Michael S. Ryoo, Silvio Savarese, Caiming Xiong, Juan Carlos Niebles,
Abstract要約: Streferはビデオ大モデルに参照と推論機能を持たせるために設計された合成データ生成フレームワークである。 Streferは、時間的に密度が高くきめ細かなビデオメタデータを擬似アノテーションするデータエンジンを使用して、多様な命令生成データを生成する。我々のアプローチは、ビデオLLMが空間的および時間的参照を解釈する能力を高め、現実のAIコンパニオンに不可欠な、より汎用的で時空間対応の推論を育む。
参考スコア（独自算出の注目度）: 100.5266292850922
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Next-generation AI companions must go beyond general video understanding to resolve spatial and temporal references in dynamic, real-world environments. Existing Video Large Language Models (Video LLMs), while capable of coarse-level comprehension, struggle with fine-grained, spatiotemporal reasoning, especially when user queries rely on time-based event references for temporal anchoring, or gestural cues for spatial anchoring to clarify object references and positions. To bridge this critical gap, we introduce Strefer, a synthetic instruction data generation framework designed to equip Video LLMs with spatiotemporal referring and reasoning capabilities. Strefer produces diverse instruction-tuning data using a data engine that pseudo-annotates temporally dense, fine-grained video metadata, capturing rich spatial and temporal information in a structured manner, including subjects, objects, their locations as masklets, and their action descriptions and timelines. Our approach enhances the ability of Video LLMs to interpret spatial and temporal references, fostering more versatile, space-time-aware reasoning essential for real-world AI companions. Without using proprietary models, costly human annotation, or the need to annotate large volumes of new videos, experimental evaluations show that models trained with data produced by Strefer outperform baselines on tasks requiring spatial and temporal disambiguation. Additionally, these models exhibit enhanced space-time-aware reasoning, establishing a new foundation for perceptually grounded, instruction-tuned Video LLMs.
Abstract（参考訳）: 次世代AIコンパニオンは、ダイナミックで現実世界の環境における空間的および時間的参照を解決するために、一般的なビデオ理解を超えなければならない。既存のビデオ大言語モデル (Video Large Language Models, ビデオLLM) は、粗いレベルの理解が可能でありながら、細粒度で時空間的な推論に苦慮している。この重要なギャップを埋めるために,ビデオLLMに時空間参照と推論機能を持たせるための合成命令データ生成フレームワークであるStreferを導入する。 Streferは、時間的に密度が高くきめ細かなビデオメタデータを擬似的に注釈付けし、被写体、オブジェクト、マスクレットとしての位置、行動記述とタイムラインを含む、豊富な空間的および時間的情報を構造化された方法でキャプチャするデータエンジンを使用して、多様な命令チューニングデータを生成する。我々のアプローチは、ビデオLLMが空間的および時間的参照を解釈する能力を高め、現実のAIコンパニオンに不可欠な、より汎用的で時空間対応の推論を育む。プロプライエタリなモデル、コストのかかる人的アノテーション、あるいは大量の新しいビデオに注釈を加える必要がなければ、実験的な評価は、Streferによって生成されたデータで訓練されたモデルは、空間的および時間的曖昧さを必要とするタスクにおいて、ベースラインより優れていることを示している。さらに、これらのモデルは拡張された時空間認識推論を示し、知覚的に接地し、命令で調整されたビデオLLMのための新しい基盤を確立した。

論文の概要: Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

関連論文リスト