Fugu-MT 論文翻訳(概要): ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

論文の概要: ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

arxiv url: http://arxiv.org/abs/2603.23326v1
Date: Tue, 24 Mar 2026 15:27:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.56183
Title: ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images
Title（参考訳）: ViBe:純正画像から生まれた超高分解能ビデオ合成
Authors: Yunfeng Wu, Hongying Cheng, Zihao He, Songhua Liu,
Abstract要約: トランスフォーマーに基づくビデオ拡散モデルは、空間的および時間的トークンに対する3次元の注意に依存している。我々は,高解像度映像を合成するために,ネイティブスケールで事前学習したビデオ拡散変換器をアップグレードする純粋な画像適応フレームワークを提案する。
参考スコア（独自算出の注目度）: 30.646542711556787
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based video diffusion models rely on 3D attention over spatial and temporal tokens, which incurs quadratic time and memory complexity and makes end-to-end training for ultra-high-resolution videos prohibitively expensive. To overcome this bottleneck, we propose a pure image adaptation framework that upgrades a video Diffusion Transformer pre-trained at its native scale to synthesize higher-resolution videos. Unfortunately, naively fine-tuning with high-resolution images alone often introduces noticeable noise due to the image-video modality gap. To address this, we decouple the learning objective to separately handle modality alignment and spatial extrapolation. At the core of our approach is Relay LoRA, a two-stage adaptation strategy. In the first stage, the video diffusion model is adapted to the image domain using low-resolution images to bridge the modality gap. In the second stage, the model is further adapted with high-resolution images to acquire spatial extrapolation capability. During inference, only the high-resolution adaptation is retained to preserve the video generation modality while enabling high-resolution video synthesis. To enhance fine-grained detail synthesis, we further propose a High-Frequency-Awareness-Training-Objective, which explicitly encourages the model to recover high-frequency components from degraded latent representations via a dedicated reconstruction loss. Extensive experiments demonstrate that our method produces ultra-high-resolution videos with rich visual details without requiring any video training data, even outperforming previous state-of-the-art models trained on high-resolution videos by 0.8 on the VBench benchmark. Code will be available at https://github.com/WillWu111/ViBe.
Abstract（参考訳）: トランスフォーマーベースのビデオ拡散モデルは、空間的および時間的トークンに対する3Dの注意に依存しており、これは2次時間とメモリの複雑さを引き起こし、超高解像度ビデオに対するエンドツーエンドのトレーニングを違法に高価にする。このボトルネックを克服するため,本研究では,高解像度映像を合成するために,ネイティブスケールで事前学習したビデオ拡散変換器をアップグレードする純粋な画像適応フレームワークを提案する。残念なことに、高解像度画像のみによる微調整は、画像とビデオのモダリティのギャップによって目立ったノイズをもたらすことが多い。これを解決するために、学習対象を分離し、モダリティアライメントと空間外挿を別々に扱う。このアプローチのコアとなるのは,2段階の適応戦略であるRelay LoRAです。第1段階では、低解像度画像を用いて映像拡散モデルを画像領域に適応させ、モダリティギャップをブリッジする。第2段階では、モデルはさらに高解像度の画像に適応し、空間外挿能力を得る。推論中は、高分解能な映像合成を実現しつつ、映像生成のモダリティを維持するために、高分解能な適応のみを保持する。細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部まで細部広汎な実験により,VBenchベンチマークで0.8倍の高精細度ビデオでトレーニングされた従来の最先端モデルよりも優れ,ビデオトレーニングデータを必要としない高精細度映像を高精細度で生成できることが示されている。コードはhttps://github.com/WillWu111/ViBe.comから入手できる。

論文の概要: ViBe: Ultra-High-Resolution Video Synthesis Born from Pure Images

関連論文リスト