Fugu-MT 論文翻訳(概要): SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

論文の概要: SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

arxiv url: http://arxiv.org/abs/2605.06356v1
Date: Thu, 07 May 2026 14:34:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:11.910063
Title: SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation
Title（参考訳）: SwiftI2V: 条件付きセグメントワイズ生成による高解像度高分解能映像生成
Authors: YaoYang Liu, Yuechen Zhang, Wenbo Li, Yufei Zhao, Rui Liu, Long Chen,
Abstract要約: 高解像度画像対ビデオ(I2V)生成は、入力画像のきめ細かい外観を保ちながら、現実的な時間的ダイナミクスを合成することを目的としている。 1) エンド・ツー・エンドのモデルはしばしばメモリとレイテンシーにおいて著しく高価である; 2) 汎用ビデオ超解像による低解像度生成は、詳細を幻覚させ、入力固有の局所構造からドリフトする傾向がある。我々は高解像度I2Vに適した効率的なフレームワークであるSwiftI2Vを提案する。
参考スコア（独自算出の注目度）: 17.915677925345722
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High-resolution image-to-video (I2V) generation aims to synthesize realistic temporal dynamics while preserving fine-grained appearance details of the input image. At 2K resolution, it becomes extremely challenging, and existing solutions suffer from various weaknesses: 1) end-to-end models are often prohibitively expensive in memory and latency; 2) cascading low-resolution generation with a generic video super-resolution tends to hallucinate details and drift from input-specific local structures, since the super-resolution stage is not explicitly conditioned on the input image. To this end, we propose SwiftI2V, an efficient framework tailored for high-resolution I2V. Following the widely used two-stage design, it addresses the efficiency--fidelity dilemma by first generating a low-resolution motion reference to reduce token costs and ease the modeling burden, then performing a strongly image-conditioned 2K synthesis guided by the motion to recover input-faithful details with controlled overhead. Specifically, to make generation more scalable, SwiftI2V introduces Conditional Segment-wise Generation (CSG) to synthesize videos segment-by-segment with a bounded per-step token budget, and adopts bidirectional contextual interaction within each segment to improve cross-segment coherence and input fidelity. On VBench-I2V at 2K resolution, SwiftI2V achieves performance comparable to end-to-end baselines while reducing total GPU-time by 202x. Particularly, it enables practical 2K I2V generation on a single datacenter GPU (e.g., H800) or consumer GPU (e.g., RTX 4090).
Abstract（参考訳）: 高解像度画像対ビデオ(I2V)生成は、入力画像のきめ細かい外観を保ちながら、現実的な時間的ダイナミクスを合成することを目的としている。 2K解像度では、非常に困難になり、既存のソリューションは様々な弱点に悩まされる。 1) エンド・ツー・エンドのモデルはメモリとレイテンシーにおいてしばしば高額である。 2) 汎用ビデオ超解像を用いたカスケード低分解能発生は, 入力画像に高分解能ステージが明示的に条件付けられていないため, 入力固有の局所構造から細部やドリフトを幻覚させる傾向がある。この目的のために,高解像度I2Vに適した効率的なフレームワークであるSwiftI2Vを提案する。広く使われている2段階設計に続いて、トークンコストを低減し、モデリングの負担を軽減するために、まず低解像度のモーション参照を生成し、次に、モーションによって誘導される強いイメージ条件の2K合成を実行し、制御されたオーバーヘッドで入力に忠実な詳細を回復することで、効率性のジレンマに対処する。具体的には、生成をよりスケーラブルにするために、SwiftI2Vでは、Conditional Segment-wise Generation(CSG)を導入して、ビデオセグメントごとのセグメンテーションを境界付きトークン予算で合成し、各セグメント内で双方向のコンテキストインタラクションを採用して、クロスセグメントのコヒーレンスと入力フィデリティを改善する。 2K解像度でのVBench-I2Vでは、SwiftI2Vは、エンドツーエンドのベースラインに匹敵するパフォーマンスを実現し、GPU全体の時間を202倍に削減する。特に、単一のデータセンタGPU(例えば、H800)またはコンシューマGPU(例えば、RTX 4090)上で、実用的な2K I2V生成を可能にする。

論文の概要: SwiftI2V: Efficient High-Resolution Image-to-Video Generation via Conditional Segment-wise Generation

関連論文リスト