Fugu-MT 論文翻訳(概要): FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

論文の概要: FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

arxiv url: http://arxiv.org/abs/2603.17555v1
Date: Wed, 18 Mar 2026 10:02:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.630756
Title: FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion
Title（参考訳）: FrescoDiffusion: 4K Image-to-Video with pre-regularized Tiled Diffusion
Authors: Hugo Caselles-Dupré, Mathis Koroglu, Guillaume Jeanneret, Arnaud Dapogny, Matthieu Cord,
Abstract要約: 本稿ではFrescoDiffusionについて紹介する。FrescoDiffusionは1つの画像からコヒーレントな大フォーマットI2V生成のためのトレーニング不要な手法である。 4K 生成では,タイルごとの雑音予測を計算し,この基準を拡散時間毎にフューズする。 VBench-I2Vデータセットと提案したフレスコI2Vデータセットの実験により,タイル付きベースラインに対するグローバルな一貫性と忠実度が改善された。
参考スコア（独自算出の注目度）: 46.49480145234397
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Diffusion-based image-to-video (I2V) models are increasingly effective, yet they struggle to scale to ultra-high-resolution inputs (e.g., 4K). Generating videos at the model's native resolution often loses fine-grained structure, whereas high-resolution tiled denoising preserves local detail but breaks global layout consistency. This failure mode is particularly severe in the fresco animation setting: monumental artworks containing many distinct characters, objects, and semantically different sub-scenes that must remain spatially coherent over time. We introduce FrescoDiffusion, a training-free method for coherent large-format I2V generation from a single complex image. The key idea is to augment tiled denoising with a precomputed latent prior: we first generate a low-resolution video at the underlying model resolution and upsample its latent trajectory to obtain a global reference that captures long-range temporal and spatial structure. For 4K generation, we compute per-tile noise predictions and fuse them with this reference at every diffusion timestep by minimizing a single weighted least-squares objective in model-output space. The objective combines a standard tile-merging criterion with our regularization term, yielding a closed-form fusion update that strengthens global coherence while retaining fine detail. We additionally provide a spatial regularization variable that enables region-level control over where motion is allowed. Experiments on the VBench-I2V dataset and our proposed fresco I2V dataset show improved global consistency and fidelity over tiled baselines, while being computationally efficient. Our regularization enables explicit controllability of the trade-off between creativity and consistency.
Abstract（参考訳）: 拡散に基づくイメージ・トゥ・ビデオ(I2V)モデルはますます効果的になるが、超高解像度の入力(例:4K)にスケールするのに苦労する。モデルのネイティブ解像度でビデオを生成すると、微細な構造が失われることが多いが、高解像度のタイル付きデノイングは局所的な詳細を保存するが、グローバルなレイアウトの整合性を損なう。この失敗モードはフレスコアニメーションでは特に深刻で、多くの異なるキャラクター、オブジェクト、そして時間とともに空間的に一貫性を保たなければならない意味的に異なるサブシーンを含む記念碑的なアートワークである。本稿では,FrescoDiffusionについて紹介する。FrescoDiffusionは1つの複素画像からコヒーレントな大フォーマットI2V生成を行うためのトレーニング不要な手法である。まず、下層のモデル解像度で低解像度のビデオを生成し、その潜在軌道を増幅して、長距離の時間的・空間的構造を捉えた大域的な基準を得る。 4K 生成では,1 つの重み付き最小2乗の目標をモデル出力空間で最小化することにより,この基準を拡散時間毎に求める。この目的は、標準のタイルマージ基準と我々の正規化項を組み合わせることで、細部を保ちながらグローバルコヒーレンスを強化するクローズドフォームの融合更新をもたらす。また、動作が許される場所の領域レベルの制御を可能にする空間正規化変数も提供する。 VBench-I2Vデータセットと提案したフレスコI2Vデータセットによる実験は、計算効率を向上しつつ、タイル付きベースラインのグローバル一貫性と忠実度を改善した。私たちの規則化によって、創造性と一貫性の間のトレードオフを明示的にコントロールすることが可能になります。

論文の概要: FrescoDiffusion: 4K Image-to-Video with Prior-Regularized Tiled Diffusion

関連論文リスト