Fugu-MT 論文翻訳(概要): ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

論文の概要: ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

arxiv url: http://arxiv.org/abs/2603.24270v2
Date: Thu, 26 Mar 2026 12:11:26 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-27 13:32:29.987782
Title: ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors
Title（参考訳）: ビデオ拡散プリミティブで32K画像生成をアンロックするScrrollScape
Authors: Haodong Yu, Yabo Zhang, Donglin Di, Ruyi Zhang, Wangmeng Zuo,
Abstract要約: ScrollScapeは、EAR画像合成を連続的なビデオ生成プロセスに変換する新しいフレームワークである。また,ScrollScapeは,高度に局所化されたアーティファクトを除去することにより,既存の画像拡散ベースラインを著しく上回ることを示す。
参考スコア（独自算出の注目度）: 48.033666517340464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation. This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions. To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations. By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity. Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.
Abstract（参考訳）: 拡散モデルは従来の次元で画像を生成するのに優れているが、極端アスペクト比 (EAR) で超高分解能画像を合成するよう圧力をかけると、物体の繰り返しや空間の断片化といった破滅的な構造的失敗を引き起こすことが多い。この制限は、静的テキスト・画像モデルが主に従来の次元のイメージ分布に基づいて訓練されているため、基本的には、堅牢な空間的事前の欠如に起因している。このボトルネックを克服するために、2つのコアイノベーションを通じてEAR画像合成を連続的なビデオ生成プロセスに変換する新しいフレームワークであるScrollScapeを提案する。巨大なキャンバスの空間展開をビデオフレームの時間的進化にマッピングすることで、ScrrollScapeはビデオモデルの本質的な時間的一貫性を強力なグローバル制約として活用し、長距離構造的整合性を確保する。具体的には、ScanPE(Scanning Positional Encoding)はフレーム間でグローバル座標を分散してフレキシブルな移動カメラとして機能し、ScrollSR(ScrollSR)はビデオ超解像前処理を利用してメモリボトルネックを回避し、出力を前例のない32K解像度に効率的にスケーリングする。キュレートされた3Kマルチ比画像データセットに基づいて微調整されたScrrollScapeは、トレーニング済みのビデオの事前処理をEAR生成タスクと効果的に整合させる。広範囲な評価は、高度に局所化されたアーティファクトを除去することで、既存の画像拡散ベースラインを著しく上回っていることを示している。その結果,本手法は構造的ボトルネックを克服し,多様な領域にまたがる異常な大域的コヒーレンスと視覚的忠実度を極端に保証する。

論文の概要: ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

関連論文リスト