Fugu-MT 論文翻訳(概要): Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning

論文の概要: Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning

arxiv url: http://arxiv.org/abs/2510.19193v2
Date: Thu, 23 Oct 2025 07:07:25 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 03:08:14.941249
Title: Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning
Title（参考訳）: 映像整合性距離:逆ベースファインチューニングによる映像対映像生成のための時間整合性向上
Authors: Takehiro Aoshima, Yusuke Shinohara, Byeongseon Park,
Abstract要約: ビデオ拡散モデルの逆ベース微調整は、生成ビデオの品質向上に有効な手法である。本稿では,時間的一貫性を高めるために,ビデオ一貫性距離(VCD, Video Consistency Distance)を提案する。
参考スコア（独自算出の注目度）: 5.847416016271551
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reward-based fine-tuning of video diffusion models is an effective approach to improve the quality of generated videos, as it can fine-tune models without requiring real-world video datasets. However, it can sometimes be limited to specific performances because conventional reward functions are mainly aimed at enhancing the quality across the whole generated video sequence, such as aesthetic appeal and overall consistency. Notably, the temporal consistency of the generated video often suffers when applying previous approaches to image-to-video (I2V) generation tasks. To address this limitation, we propose Video Consistency Distance (VCD), a novel metric designed to enhance temporal consistency, and fine-tune a model with the reward-based fine-tuning framework. To achieve coherent temporal consistency relative to a conditioning image, VCD is defined in the frequency space of video frame features to capture frame information effectively through frequency-domain analysis. Experimental results across multiple I2V datasets demonstrate that fine-tuning a video generation model with VCD significantly enhances temporal consistency without degrading other performance compared to the previous method.
Abstract（参考訳）: 逆ベースによるビデオ拡散モデルの微調整は、実世界のビデオデータセットを必要とせずにモデルを微調整できるため、生成されたビデオの品質を改善する効果的なアプローチである。しかし、従来の報酬関数は主に、美的魅力や全体的な一貫性など、生成されたビデオシーケンス全体の品質向上を目的としているため、特定のパフォーマンスに制限される場合もある。特に、生成されたビデオの時間的一貫性は、イメージ・ツー・ビデオ(I2V)生成タスクに以前のアプローチを適用する際にしばしば悩まされる。この制限に対処するために,時間的整合性を高め,報酬に基づく微調整フレームワークでモデルを微調整する新しい尺度であるVCD(Video Consistency Distance)を提案する。コンディショニング画像に対するコヒーレント時間一貫性を実現するため、VCDはビデオフレーム特徴の周波数空間内で定義され、周波数領域解析によりフレーム情報を効果的にキャプチャする。複数のI2Vデータセットにまたがる実験結果から,VCDを用いたビデオ生成モデルの微調整により,従来手法と比較して他の性能を劣化させることなく,時間的一貫性が著しく向上することが示された。

論文の概要: Video Consistency Distance: Enhancing Temporal Consistency for Image-to-Video Generation via Reward-Based Fine-Tuning

関連論文リスト