Fugu-MT 論文翻訳(概要): OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

論文の概要: OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

arxiv url: http://arxiv.org/abs/2606.14702v2
Date: Wed, 17 Jun 2026 03:32:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 13:57:35.113688
Title: OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains
Title（参考訳）: OmniVideo-100K:構造化スクリプトとエビデンスチェーンによるオーディオ・ビジュアル推論のためのデータセット
Authors: Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan,
Abstract要約: 音声-視覚的質問回答(QA)のための現在の自動パイプラインは、一般的にビデオキャプション-QA'のパラダイムを採用している。 textbfEntity-Anchored Video Scripting' と textbfClue-Guided QA Generation' という2つのメカニズムを備えた自動データエンジンを提案する。
参考スコア（独自算出の注目度）: 50.186434778589415
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.
Abstract（参考訳）: 音声-視覚的質問回答(QA)のための現在の自動パイプラインは、一般に `video-caption-QA''' パラダイムを採用しています。しかし、これらの手法は通常、ビデオを短いクリップに分割し、オーディオと視覚のモダリティを別々に記述する。この分離された処理は、音とその視覚的源との固有の関連を包含する一方、独立したクリップ処理は、しばしばセグメント間で同じ実体の一貫性のない記述を引き起こす。さらに、長文理解とQA合成を一つのステップにまとめることで、モデルが局所的な事象に制限されることがしばしばあり、長期的な時間的つながりや深い相互モーダル推論に欠ける疑問が生じる。これらの問題に対処するために,(1) <textbf{Entity-Anchored Video Scripting} は,要約,主エンティティリスト,セグメントワイドオーディオ視覚記述を含む構造化スクリプトに変換する。エンティティリストは、クロスセグメント参照整合性を確保し、オーディオ視覚関連を再構築するために、グローバルな事前として機能する。 2) \textbf{Clue-Guided QA Generation} は、まずスクリプトからクロスセグメント、マルチモーダルなヒントをマイニングし、次にこれらの高値なヒントに基づいてQAペアを生成する。このパイプラインを活用することで、命令チューニングデータセット \textbf{OmniVideo-100K} と、人間検証テストセット \textbf{OmniVideo-Test} を構築します。微調整VITA-1.5、Qwen2.5-Omni-7B、Qwen3-Omni-30B on OmniVideo-100KはOmniVideo-Testで最大20.59%の性能向上を達成し、Daily-OmniやJointAVBenchのような既存のベンチマークで強力な一般化(最大12.64%の改善)を示す。

論文の概要: OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

関連論文リスト