Fugu-MT 論文翻訳(概要): ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

論文の概要: ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

arxiv url: http://arxiv.org/abs/2603.15478v1
Date: Mon, 16 Mar 2026 16:10:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.564147
Title: ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer
Title（参考訳）: ViFeEdit:ビデオ拡散変換器のタナー
Authors: Ruonan Yu, Zhenxiong Tan, Zigeng Chen, Songhua Liu, Xinchao Wang,
Abstract要約: ビデオ拡散変換器用ビデオフリーチューニングフレームワークViFeEditを提案する。 ViFeEditは2D画像のみに適応した多用途のビデオ生成と編集を実現する。本手法は,2次元画像データに対する最小限のトレーニングしか行わず,制御可能な映像生成と編集の有望な結果を提供する。
参考スコア（独自算出の注目度）: 74.61793196579036
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion Transformers (DiTs) have demonstrated remarkable scalability and quality in image and video generation, prompting growing interest in extending them to controllable generation and editing tasks. However, compared to the image counterparts, progress in video control and editing remains limited, mainly due to the scarcity of paired video data and the high computational cost of training video diffusion models. To address this issue, in this paper, we propose a video-free tuning framework termed ViFeEdit for video diffusion transformers. Without requiring any forms of video training data, ViFeEdit achieves versatile video generation and editing, adapted solely with 2D images. At the core of our approach is an architectural reparameterization that decouples spatial independence from the full 3D attention in modern video diffusion transformers, which enables visually faithful editing while maintaining temporal consistency with only minimal additional parameters. Moreover, this design operates in a dual-path pipeline with separate timestep embeddings for noise scheduling, exhibiting strong adaptability to diverse conditioning signals. Extensive experiments demonstrate that our method delivers promising results of controllable video generation and editing with only minimal training on 2D image data. Codes are available https://github.com/Lexie-YU/ViFeEdit.
Abstract（参考訳）: Diffusion Transformers (DiTs) は、画像およびビデオ生成の大幅なスケーラビリティと品質を示し、それらを制御可能な生成および編集タスクに拡張することへの関心が高まっている。しかし,映像データが少ないことと,動画拡散モデルの訓練に高い計算コストがかかることから,映像制御と編集の進歩は依然として限られている。そこで本稿では,ビデオ拡散変換器用ビデオフリーチューニングフレームワークViFeEditを提案する。 ViFeEditはビデオのトレーニングデータを一切必要とせず、2D画像にのみ適応した多目的なビデオ生成と編集を実現している。我々のアプローチの核心は、現代のビデオ拡散変換器における空間的独立性から空間的独立性を分離するアーキテクチャ再パラメータ化であり、最小限の追加パラメータで時間的一貫性を維持しながら、視覚的に忠実な編集を可能にする。さらに、この設計は、ノイズスケジューリングのための別々のタイムステップ埋め込みを備えたデュアルパスパイプラインで動作し、多様な条件信号に強い適応性を示す。広汎な実験により,2次元画像データに対する最小限のトレーニングで,制御可能な映像生成と編集の有望な結果が得られた。コードはhttps://github.com/Lexie-YU/ViFeEdit.comで入手できる。

論文の概要: ViFeEdit: A Video-Free Tuner of Your Video Diffusion Transformer

関連論文リスト