Fugu-MT 論文翻訳(概要): StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

論文の概要: StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

arxiv url: http://arxiv.org/abs/2606.06338v1
Date: Thu, 04 Jun 2026 16:12:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.939594
Title: StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset
Title（参考訳）: StoryVideoQA: 大規模・多世代・自動生成データセットによるディープビデオ理解のスケールアップ
Authors: Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li, Hanlin Ge, Zhongyuan Wang, Chunxia Xiao, Chao Liang,
Abstract要約: ビデオ質問応答(Video QA)は、ビデオに関する質問に答えることを目的としている。既存のアプローチはファクトイドビデオQAに優れているが、深層ビデオ理解(DVU)に苦戦しているこの課題は、固有の長距離ビデオコンテンツ、多面的質問タイプ、インスタンスレベルのストーリー要素から生じる。
参考スコア（独自算出の注目度）: 37.79440712452179
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets.These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/
Abstract（参考訳）: ビデオ質問応答(Video QA)は、ビデオに関する質問に答えることを目的としている。既存のアプローチは、ファクトイドのVideoQAよりも優れているが、複雑なストーリーラインの理解を必要とするディープ・ビデオ理解(DVU)に苦慮している。この課題は、手動で構築したDVUデータセットのスケールと多様性を制限し、手動で構築したDVUデータセットのスケールと多様性を制限している、固有の長距離ビデオコンテンツ、多面的質問タイプ、インスタンスレベルのストーリー要素から生じる。これらの問題に対処するため、私たちは以前StoryMindを導入し、バランスのとれたきめ細かいトピックを持つDVUデータセットを自動構築しました。テレビシリーズの質の高い質問応答ペア(QA)を生成することができるが、より長い複雑な映画を扱う際には大きな性能低下を被る。本稿では,テレビシリーズと映画の両方で高品質なDVUデータセットを生成するための,マルチエージェントコラボレーションフレームワークであるStoryMindv2をさらに設計する。新たなスーパーバイザー誘導生成機構と改良されたマルチビューア投票戦略を統合することで、TVシリーズ(約1,635秒)や映画(約7,878秒)を含む393.2時間の多彩なストーリービデオのQAを含む、これまでで最大のDVUデータセットであるStoryVideoQAを構築することができる。この大規模ベンチマークによる20種類のビデオQA手法の包括的評価により,長距離キャラクタアソシエーションの完全維持や,複雑なストーリーラインのコヒーレントな理解の構築が不可能であることが判明した。本稿では,このギャップを埋めるために,長距離映像コンテンツを階層的なプロット構造に再構成し,ストーリービデオQA上での効率的なストーリーライン推論を可能にする,新しい映像理解エージェントであるPlotTreeを提案する。プロジェクトページ: https://github.com/nercms-mmap/StoryVideoQA/

論文の概要: StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

関連論文リスト