Fugu-MT 論文翻訳(概要): CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

論文の概要: CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

arxiv url: http://arxiv.org/abs/2605.19995v1
Date: Tue, 19 May 2026 15:29:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:09.487539
Title: CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition
Title（参考訳）: CogOmniControl:創造的意図認識による推論駆動制御可能なビデオ生成
Authors: Hongji Yang, Songlian Li, Yucheng Zhou, Xiaotong Zhao, Alan Zhao, Chengzhong Xu, Jianbing Shen,
Abstract要約: 我々は、制御可能なビデオ生成を創造的な意図認識と生成に分解する推論駆動フレームワークであるCag OmniControlを紹介する。具体的には,アニメ制作データを用いて,特殊なCagVLMを訓練する。一般的なVLMと比較すると、よりプロ的で明確な出力を生成し、スパースや抽象的な条件からユーザの創造的意図を正確に認識する。
参考スコア（独自算出の注目度）: 64.38611644311136
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent diffusion models achieve strong photorealism and fluency in video generation, yet remain fragile under abstract, sparse or complex conditions, leading to poor performance in professional production workflows such as storyboard sketches and clay render conditions. Existing video generation models, either inject conditions through adapters or couple a generic vision-language model (VLM) within a diffusion backbone, leaving a capability gap and failing to produce the videos that align with the user's creative intent. We present CogOmniControl, a reasoning-driven framework that factorizes controllable video generation into creative intent cognition and generation. Specifically, we train a specialized CogVLM using authentic anime production data. Compared to generic VLMs, it generates more professional and clear outputs, accurately cognizing user creative intent from sparse and abstract conditions and tuning these cues into dense reasoning output. Besides, CogOmniDiT unifies the controls from various conditions through in-context generation and is aligned to the CogVLM reasoning outputs via reinforcement learning. Furthermore, leveraging CogVLM's robust capability in guiding video generation, we release its potential in planning specific evaluators and enable a Best-of-N selection for the generated videos. This integration transforms the entire framework into a closed-loop "harness-like" architecture. We further introduce CogReasonBench and CogControlBench, built from professional workflows data that carry genuine creative intent rather than simulated ones. Experiments on two benchmarks show that CogOmniControl surpassed the existing open-source models. The project website: https://um-lab.github.io/CogOmniControl/
Abstract（参考訳）: 最近の拡散モデルは、ビデオ生成において強力なフォトリアリズムと流布性を達成するが、抽象的、疎らか、複雑な条件下では脆弱であり、ストーリーボードスケッチや粘土のレンダリング条件のようなプロのプロダクションワークフローでは性能が低下する。既存のビデオ生成モデルでは、アダプタを介して条件を注入するか、拡散バックボーン内に汎用視覚言語モデル(VLM)を結合するかのいずれかで、機能的なギャップを残し、ユーザの創造的な意図に沿ったビデオを生成することができない。我々は、制御可能なビデオ生成を創造的な意図認識と生成に分解する推論駆動フレームワークであるCagOmniControlを紹介する。具体的には,アニメ制作データを用いて,特殊なCagVLMを訓練する。一般的なVLMと比較すると、より専門的で明確な出力を生成し、スパースや抽象的な条件からユーザの創造的意図を正確に認識し、これらのキューを高密度な推論出力にチューニングする。さらに、CogOmniDiTはコンテキスト内生成を通じて様々な条件から制御を統一し、強化学習を通じてCagVLM推論出力に整合する。さらに,ビデオ生成におけるCogVLMのロバストな能力を活用して,特定の評価指標を計画し,生成したビデオのベスト・オブ・N選択を可能にする可能性を公開する。この統合により、フレームワーク全体がクローズドループの"ハーネスのような"アーキテクチャに変換される。さらにCagReasonBenchとCagControlBenchを紹介します。 2つのベンチマークの実験は、CagOmniControlが既存のオープンソースモデルを上回ったことを示している。プロジェクトのWebサイト: https://um-lab.github.io/CogOmniControl/

論文の概要: CogOmniControl: Reasoning-Driven Controllable Video Generation via Creative Intent Cognition

関連論文リスト