Fugu-MT 論文翻訳(概要): Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

論文の概要: Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

arxiv url: http://arxiv.org/abs/2604.11244v2
Date: Wed, 15 Apr 2026 07:55:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 13:09:57.436659
Title: Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding
Title（参考訳）: Script-a-Video:Factized StreamsとRelational Groundingによる深層構造型オーディオビジュアルキャプション
Authors: Tencent Hunyuan Team,
Abstract要約: MTSS(Multi-Stream Scene Script)はモノリシックなテキストを因数化して具体化されたシーン記述に置き換える新しいパラダイムである。広範囲な実験によりMTSSは様々なモデルにおけるビデオ理解を一貫して強化することを示した。アーキテクチャの適応がなくても、マルチショットビデオ生成におけるモノリシックプロンプトをMTSSに置き換えると、大幅に改善される。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit identity and temporal links to maintain holistic video consistency. Extensive experiments demonstrate that MTSS consistently enhances video understanding across various models, achieving an average reduction of 25% in the total error rate on Video-SALMONN-2 and an average performance gain of 67% on the Daily-Omni reasoning benchmark. It also narrows the performance gap between smaller and larger MLLMs, indicating a substantially more learnable caption interface. Finally, even without architectural adaptation, replacing monolithic prompts with MTSS in multi-shot video generation yields substantial human-rated improvements: a 45% boost in cross-shot identity consistency, a 56% boost in audio-visual alignment, and a 71% boost in temporal controllability.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)の進歩は、ビデオキャプションを記述的なエンドポイントから、ビデオ理解と生成の両方のためのセマンティックインターフェースに変換する。しかし、支配的なパラダイムは、ビデオが視覚的、聴覚的、アイデンティティ情報を絡ませるモノリシックな物語の段落として、今でも使われている。この密結合は表現の忠実さを損なうだけでなく、拡張性も制限する。この構造的ボトルネックに対処するために,モノリシックテキストを因数化して具体化されたシーン記述に置き換える新しいパラダイムであるMulti-Stream Scene Script (MTSS)を提案する。 MTSSは、ビデオを補完的なストリーム(参照、ショット、イベント、グローバル)に分離するStream Factorizationと、これらの分離されたストリームを明示的なアイデンティティと時間的リンクを通じて再接続して、全体的なビデオ一貫性を維持するRelational Groundingという2つの基本原則に基づいて構築されている。 MTSSは,ビデオ・SALMONN-2における誤り率の平均25%,デイリー・オムニ推論ベンチマークの平均67%の精度向上を実現し,様々なモデル間のビデオ理解を継続的に向上させることを示した。また、より小さなMLLMと大きなMLLMのパフォーマンスギャップを狭め、より学習しやすいキャプションインターフェースを示している。最後に、アーキテクチャの適応がなくても、マルチショットビデオ生成におけるモノリシックプロンプトをMTSSに置き換えると、45%のクロスショットアイデンティティ一貫性の向上、56%のオーディオ視覚アライメント、そして71%のテンポラリなコントロール容易性向上が実現します。

論文の概要: Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

関連論文リスト