Fugu-MT 論文翻訳(概要): DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

論文の概要: DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

arxiv url: http://arxiv.org/abs/2604.13509v1
Date: Wed, 15 Apr 2026 05:52:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.402137
Title: DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer
Title（参考訳）: リアルタイムレンダラーとしてのDiT - 自己回帰拡散変換器を用いたストリームビデオスティル化
Authors: Hengye Lyu, Zisu Li, Yue Hong, Yueting Weng, Jiaxin Shi, Hanwang Zhang, Chen Liang,
Abstract要約: 本稿では,Diffusion Transformer をベースとした蒸気式ビデオスタイリングフレームワーク RTR-DiT (DiT as Real-Time Rerenderer) を提案する。まず、ビデオスタイリゼーションデータセット上で双方向の教師モデルを微調整し、テキスト誘導と参照誘導の両方の動画スタイリゼーションタスクをサポートする。次に, 自己強制・分散マッチング蒸留による後処理により, 数段階の自己回帰モデルに蒸留する。実験の結果,RTR-DiTはテキスト誘導と参照誘導の両方で既存の手法よりも優れていた。
参考スコア（独自算出の注目度）: 53.36692512160234
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in video generation models has significantly accelerated video generation and related downstream tasks. Among these, video stylization holds important research value in areas such as immersive applications and artistic creation, attracting widespread attention. However, existing diffusion-based video stylization methods struggle to maintain stability and consistency when processing long videos, and their high computational cost and multi-step denoising make them difficult to apply in practical scenarios. In this work, we propose RTR-DiT (DiT as Real-Time Rerenderer), a steaming video stylization framework built upon Diffusion Transformer. We first fine-tune a bidirectional teacher model on a curated video stylization dataset, supporting both text-guided and reference-guided video stylization tasks, and subsequently distill it into a few-step autoregressive model via post-training with Self Forcing and Distribution Matching Distillation. Furthermore, we propose a reference-preserving KV cache update strategy that not only enables stable and consistent processing of long videos, but also supports real-time switching between text prompts and reference images. Experimental results show that RTR-DiT outperforms existing methods in both text-guided and reference-guided video stylization tasks, in terms of quantitative metrics and visual quality, and demonstrates excellent performance in real-time long video stylization and interactive style-switching applications.
Abstract（参考訳）: 映像生成モデルの最近の進歩は、映像生成と関連する下流タスクを著しく加速させてきた。これらのうち、ビデオスタイリングは没入的応用や芸術的創造といった分野において重要な研究価値を保ち、広く注目を集めている。しかし, 従来の拡散型ビデオスタイリング手法では, 長ビデオ処理時の安定性と一貫性の維持が困難であり, 計算コストとマルチステップのデノナイズにより, 現実的なシナリオでは適用が困難である。本研究では,Diffusion Transformer 上に構築した蒸気式ビデオスタイリングフレームワーク RTR-DiT (DiT as Real-Time Rerenderer) を提案する。まず、ビデオスタイリゼーションデータセット上で双方向の教師モデルを微調整し、テキスト誘導と参照誘導の両方の動画スタイリゼーションタスクをサポートし、その後、セルフフォーシングと分散マッチング蒸留による後学習により数ステップの自己回帰モデルに蒸留する。さらに,長いビデオの安定した一貫した処理を可能にするだけでなく,テキストプロンプトと参照画像のリアルタイム切替もサポートする参照保存型KVキャッシュ更新戦略を提案する。実験結果から,RTR-DiTは,テキスト誘導型および参照誘導型両方のビデオスタイリゼーションタスクにおいて,定量的な計測値と視覚的品質の観点から既存手法よりも優れており,リアルタイムビデオスタイリゼーションやインタラクティブなスタイルスイッチングアプリケーションにおいて優れた性能を示すことが示された。

論文の概要: DiT as Real-Time Rerenderer: Streaming Video Stylization with Autoregressive Diffusion Transformer

関連論文リスト