Fugu-MT 論文翻訳(概要): ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

論文の概要: ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

arxiv url: http://arxiv.org/abs/2603.24270v1
Date: Wed, 25 Mar 2026 13:03:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.299262
Title: ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors
Title（参考訳）: ビデオ拡散プリミティブで32K画像生成をアンロックするScrrollScape
Authors: Haodong Yu, Yabo Zhang, Donglin Di, Ruyi Zhang, Wangmeng Zuo,
Abstract要約: ScrollScapeは、EAR画像合成を連続的なビデオ生成プロセスに変換する新しいフレームワークである。本手法は, 極端に大規模に多様な領域にまたがる異常なグローバルコヒーレンスと視覚的忠実性を確保するために, 固有の構造的ボトルネックを克服する。
参考スコア（独自算出の注目度）: 48.033666517340464
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While diffusion models excel at generating images with conventional dimensions, pushing them to synthesize ultra-high-resolution imagery at extreme aspect ratios (EAR) often triggers catastrophic structural failures, such as object repetition and spatial fragmentation.This limitation fundamentally stems from a lack of robust spatial priors, as static text-to-image models are primarily trained on image distributions with conventional dimensions.To overcome this bottleneck, we present ScrollScape, a novel framework that reformulates EAR image synthesis into a continuous video generation process through two core innovations.By mapping the spatial expansion of a massive canvas to the temporal evolution of video frames, ScrollScape leverages the inherent temporal consistency of video models as a powerful global constraint to ensure long-range structural integrity.Specifically, Scanning Positional Encoding (ScanPE) distributes global coordinates across frames to act as a flexible moving camera, while Scrolling Super-Resolution (ScrollSR) leverages video super-resolution priors to circumvent memory bottlenecks, efficiently scaling outputs to an unprecedented 32K resolution. Fine-tuned on a curated 3K multi-ratio image dataset, ScrollScape effectively aligns pre-trained video priors with the EAR generation task. Extensive evaluations demonstrate that it significantly outperforms existing image-diffusion baselines by eliminating severe localized artifacts. Consequently, our method overcomes inherent structural bottlenecks to ensure exceptional global coherence and visual fidelity across diverse domains at extreme scales.
Abstract（参考訳）: 拡散モデルは、従来の次元で画像を生成するのに優れているが、極端アスペクト比(EAR)で超高解像度画像の合成を推し進めると、しばしば破滅的な構造上の失敗を引き起こす。この制限は、静的テキスト・トゥ・イメージ・モデルは、通常次元で画像分布を主に訓練されているため、基本的には頑健な空間的事前の欠如から生じる。このボトルネックを克服するために、2つのコアイノベーションを通してEAR画像合成を連続ビデオ生成プロセスに再構成する新しいフレームワークであるScrrollScapeを提案する。キュレートされた3Kマルチ比画像データセットに基づいて微調整されたScrrollScapeは、トレーニング済みのビデオの事前処理をEAR生成タスクと効果的に整合させる。広範囲な評価は、高度に局所化されたアーティファクトを除去することで、既存の画像拡散ベースラインを著しく上回っていることを示している。その結果,本手法は構造的ボトルネックを克服し,多様な領域にまたがる異常な大域的コヒーレンスと視覚的忠実度を極端に保証する。

論文の概要: ScrollScape: Unlocking 32K Image Generation With Video Diffusion Priors

関連論文リスト