Fugu-MT 論文翻訳(概要): AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling

論文の概要: AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling

arxiv url: http://arxiv.org/abs/2605.16649v1
Date: Fri, 15 May 2026 21:39:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:46.893651
Title: AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling
Title（参考訳）: AtlasVid:デカップリンググローバルローカルモデリングによる高効率超高分解能長ビデオ生成
Authors: Ziyang Mai, Yuyao Zhang, Yu-Wing Tai,
Abstract要約: 超高解像度長ビデオ生成のための非結合なグローバルローカルフレームワークであるAtlaVidを提案する。 AtlaVidは、時間スケールのRoPEを介して、低解像度で低FPSのグローバルセマンティックプロキシを生成し、トレーニングトークン数を増やすことなく、時間水平線を拡大する。実験により,AtlaVidは高精細長ビデオ生成の効率を大幅に向上し,6x>0.9速,トレーニングコストの低減,ネイティブ4Kビデオジェネレータよりも高性能な高画質UHR長ビデオ生成を実現した。
参考スコア（独自算出の注目度）: 44.836661735003474
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent diffusion-based video generators have achieved remarkable visual fidelity and prompt controllability, yet scaling them to ultra-high-resolution (UHR) long videos remains prohibitively expensive. The difficulty is especially pronounced for long single-shot generation where a continuous scene must preserve global temporal coherence, and fine-grained spatial details without relying on clip transitions or autoregressive shot stitching. In this work, we revisit this challenge from the perspective of decoupled modeling. We argue that existing video diffusion models already encode strong local visual priors, while the main bottleneck lies in efficiently extending global spatiotemporal modeling as resolution and duration increase. Based on this insight, we propose AtlaVid, a decoupled global-local framework for efficient UHR long video generation. AtlaVid first generates a low-resolution and low-FPS global semantic proxy via temporally scaled RoPE, thereby extending the temporal horizon without increasing the training token count. Guided by this proxy, a high-resolution detail branch performs joint denoising with hierarchical locality-preserving attention. Reordered spatiotemporal windows preserve geometric locality and asymmetric global-local attention injects aligned semantic guidance and preserves the model's pretrained ability. This design enables resolution-agnostic training: the model is trained only at 720P with lightweight LoRA adaptation, yet generalizes directly to 4K and beyond for longer (>10s) video synthesis. Experiments show that AtlaVid substantially improves the efficiency of ultra-high-resolution long video generation, achieving high-quality UHR long video generation with 60.9x speed up and significantly less training cost and even better performance than native 4K video generators.
Abstract（参考訳）: 近年の拡散型ビデオジェネレータは、目覚ましい視覚的忠実さと迅速な制御性を実現しているが、超高解像度(UHR)の長いビデオにそれらを拡張することは違法に高価である。この難しさは、連続的なシーンが、クリップの遷移や自己回帰的なショット縫いに頼らずに、グローバルな時間的コヒーレンスを保たなければならないような、長い単発生成において特に顕著である。本稿では,この課題を疎結合モデリングの観点から再考する。既存のビデオ拡散モデルはすでに強力な局所的な視覚的先行を符号化しているが、主なボトルネックは、解像度と持続時間の増加に伴い、グローバルな時空間モデリングを効率的に拡張することにある。この知見に基づいて、効率的なUHR長ビデオ生成のための非結合なグローバルローカルフレームワークであるAtlaVidを提案する。 AtlaVidは、まず、時間スケールのRoPEを介して、低解像度で低FPSのグローバルセマンティックプロキシを生成し、トレーニングトークン数を増やすことなく、時間的水平線を拡大する。このプロキシによってガイドされた高分解能ディテールブランチは、階層的局所性保存の注意を伴って共同認知を行う。並べ替えられた時空間窓は幾何学的局所性を保持し、非対称なグローバルな注意は、整列した意味指導を注入し、事前訓練されたモデルの能力を保持する。この設計は解像度に依存しない訓練を可能にし、モデルは軽量のLoRAで720Pでのみ訓練されるが、より長い(>10s)ビデオ合成のために直接4Kに一般化される。実験により、AtlaVidは高精細長ビデオ生成の効率を大幅に改善し、60.9倍の速度で高画質のUHR長ビデオ生成を実現し、トレーニングコストを大幅に低減し、ネイティブ4Kビデオジェネレータよりもパフォーマンスも向上した。

論文の概要: AtlasVid: Efficient Ultra-High-Resolution Long Video Generation via Decoupled Global-Local Modeling

関連論文リスト