Fugu-MT 論文翻訳(概要): OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

論文の概要: OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

arxiv url: http://arxiv.org/abs/2604.25276v1
Date: Tue, 28 Apr 2026 06:34:19 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.738315
Title: OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
Title（参考訳）: OmniVTG: オープンワールドビデオ時間グラウンドのための大規模データセットとトレーニングパラダイム
Authors: Minghang Zheng, Zihao Yin, Yi Yang, Yuxin Peng, Yang Liu,
Abstract要約: Video Temporal Grounding (VTG)は、データセットの規模やセマンティックな多様性が制限されているため、オープンワールド設定で苦労している。オープンワールドVTGのための新しい大規模データセットであるOmniVTGを紹介する。 MLLMをトレーニングして、まず予測を行い、その理解能力を使用して、独自の予測を反映し、洗練します。
参考スコア（独自算出の注目度）: 55.29748680163419
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Temporal Grounding (VTG), the task of localizing video segments from text queries, struggles in open-world settings due to limited dataset scale and semantic diversity, causing performance gaps between common and rare concepts. To overcome these limitations, we introduce OmniVTG, a new large-scale dataset for open-world VTG, coupled with a Self-Correction Chain-of-Thought (CoT) training paradigm designed to enhance the grounding capabilities of Multimodal Large Language Models (MLLMs). Our OmniVTG is constructed via a novel Semantic Coverage Iterative Expansion pipeline, which first identifies gaps in the vocabulary of existing datasets and collects videos that are highly likely to contain these target concepts. For high-quality annotation, we leverage the insight that modern MLLMs excel at dense captioning more than direct grounding and design a caption-centric data engine to prompt MLLMs to generate dense, timestamped descriptions. Beyond the dataset, we observe that simple supervised finetuning (SFT) is insufficient, as a performance gap between rare and common concepts still persists. We find that MLLMs' video understanding ability significantly surpasses their direct grounding ability. Based on this, we propose a Self-Correction Chain-of-Thought (CoT) training paradigm. We train the MLLM to first predict, then use its understanding capabilities to reflect on and refine its own predictions. This capability is instilled via a three-stage pipeline of SFT, CoT finetuning, and reinforcement learning. Extensive experiments show our approach not only excels at open-world grounding in our OmniVTG dataset but also achieves state-of-the-art zero-shot performance on four existing VTG benchmarks. Code is available at https://github.com/oceanflowlab/OmniVTG.
Abstract（参考訳）: テキストクエリからビデオセグメントをローカライズするタスクであるVTG(Video Temporal Grounding)は、データセットスケールの制限とセマンティックな多様性のために、オープンワールド設定で苦労し、一般的な概念と稀な概念の間にパフォーマンスギャップを生じさせる。これらの制限を克服するために、オープンワールドVTGのための新しい大規模データセットであるOmniVTGと、マルチモーダル大規模言語モデル(MLLM)の基盤能力を高めるために設計されたセルフコレクレーション・チェーン・オブ・ソート(CoT)トレーニングパラダイムを導入する。私たちのOmniVTGは、Semantic Coverage Iterative Expansionパイプラインによって構築されています。高品質なアノテーションでは,従来のMLLMが直接接地以上の高密度キャプションで優れているという知見を利用して,MLLMに高密度でタイムスタンプのある記述を生成するよう促すキャプション中心のデータエンジンを設計する。データセット以外には、まれな概念と一般的な概念の間のパフォーマンスギャップが依然として持続しているため、単純な教師付き微調整(SFT)が不十分であることを示す。 MLLMの映像理解能力は直接接地能力をはるかに上回っていることがわかった。そこで本研究では,CoT(Self-Correction Chain-of-Thought)トレーニングパラダイムを提案する。 MLLMをトレーニングして、まず予測を行い、その理解能力を使用して、独自の予測を反映し、洗練します。この能力は、SFT、CoTファインタニング、強化学習の3段階パイプラインを介して注入される。大規模な実験では、OmniVTGデータセットのオープンワールドグラウンドに優れるだけでなく、4つの既存のVTGベンチマークで最先端のゼロショットパフォーマンスを実現している。コードはhttps://github.com/oceanflowlab/OmniVTGで入手できる。

論文の概要: OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

関連論文リスト