Fugu-MT 論文翻訳(概要): Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation

論文の概要: Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation

arxiv url: http://arxiv.org/abs/2508.08949v1
Date: Tue, 12 Aug 2025 14:04:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-13 21:07:34.448963
Title: Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation
Title（参考訳）: Lay2Story: レイアウト可能なストーリー生成のための拡散変換器の拡張
Authors: Ao Ma, Jiasong Feng, Ke Cao, Jing Wang, Yun Wang, Quanwei Zhang, Zhanjie Zhang,
Abstract要約: 被験者の位置や詳細な属性などのレイアウト条件は,フレーム間のきめ細かい相互作用を効果的に促進することを示す。レイアウト条件を組み込むことで、正確な主観的制御を可能にする。本手法は従来のSOTA(State-of-the-art)技術よりも優れており,一貫性,意味的相関,美的品質の面で最高の結果が得られる。
参考スコア（独自算出の注目度）: 7.280340351151054
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Storytelling tasks involving generating consistent subjects have gained significant attention recently. However, existing methods, whether training-free or training-based, continue to face challenges in maintaining subject consistency due to the lack of fine-grained guidance and inter-frame interaction. Additionally, the scarcity of high-quality data in this field makes it difficult to precisely control storytelling tasks, including the subject's position, appearance, clothing, expression, and posture, thereby hindering further advancements. In this paper, we demonstrate that layout conditions, such as the subject's position and detailed attributes, effectively facilitate fine-grained interactions between frames. This not only strengthens the consistency of the generated frame sequence but also allows for precise control over the subject's position, appearance, and other key details. Building on this, we introduce an advanced storytelling task: Layout-Togglable Storytelling, which enables precise subject control by incorporating layout conditions. To address the lack of high-quality datasets with layout annotations for this task, we develop Lay2Story-1M, which contains over 1 million 720p and higher-resolution images, processed from approximately 11,300 hours of cartoon videos. Building on Lay2Story-1M, we create Lay2Story-Bench, a benchmark with 3,000 prompts designed to evaluate the performance of different methods on this task. Furthermore, we propose Lay2Story, a robust framework based on the Diffusion Transformers (DiTs) architecture for Layout-Togglable Storytelling tasks. Through both qualitative and quantitative experiments, we find that our method outperforms the previous state-of-the-art (SOTA) techniques, achieving the best results in terms of consistency, semantic correlation, and aesthetic quality.
Abstract（参考訳）: 近年,一貫した主題の生成に関わるストーリーテリング作業が注目されている。しかしながら、トレーニングフリーであれトレーニングベースであれ、既存の手法は、きめ細かいガイダンスやフレーム間の相互作用が欠如しているため、主題の一貫性を維持するという課題に直面し続けている。さらに、この分野における高品質なデータの不足は、被験者の位置、外観、衣服、表情、姿勢など、ストーリーテリングのタスクを正確に制御することが難しく、さらなる進歩を妨げる。本稿では,被験者の位置や詳細な属性などのレイアウト条件が,フレーム間のきめ細かい相互作用を効果的に促進することを示す。これは生成されたフレームシーケンスの一貫性を強化するだけでなく、被写体の位置、外観、その他の重要な詳細を正確に制御することを可能にする。これに基づいて,レイアウト条件を組み込むことで,正確な主観的制御を可能にするレイアウト・トグルブル・ストーリーテリングという,高度なストーリーテリングタスクを導入する。レイアウトアノテーションを用いた高品質なデータセットの欠如に対処するため,約11300時間の漫画ビデオから,100万以上の720pおよび高解像度画像を含むLay2Story-1Mを開発した。 Lay2Story-1M上に構築されたLay2Story-Benchは,3,000のプロンプトを持つベンチマークで,各メソッドのパフォーマンスを評価する。さらに,Layout-Togglable Storytelling タスクのための Diffusion Transformers (DiT) アーキテクチャに基づく堅牢なフレームワーク Lay2Story を提案する。定性的かつ定量的な実験により,本手法は従来のSOTA技術よりも優れており,一貫性,意味的相関,美的品質の点で最高の結果が得られることがわかった。

論文の概要: Lay2Story: Extending Diffusion Transformers for Layout-Togglable Story Generation

関連論文リスト