Fugu-MT 論文翻訳(概要): EventDiff: A Unified and Efficient Diffusion Model Framework for Event-based Video Frame Interpolation

論文の概要: EventDiff: A Unified and Efficient Diffusion Model Framework for Event-based Video Frame Interpolation

arxiv url: http://arxiv.org/abs/2505.08235v1
Date: Tue, 13 May 2025 05:25:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-14 20:57:54.430239
Title: EventDiff: A Unified and Efficient Diffusion Model Framework for Event-based Video Frame Interpolation
Title（参考訳）: EventDiff: イベントベースのビデオフレーム補間のための統一的で効率的な拡散モデルフレームワーク
Authors: Hanle Zheng, Xujie Han, Zegang Peng, Shangbin Zhang, Guangxun Du, Zhuo Zou, Xilin Wang, Jibin Wu, Hao Guo, Lei Deng,
Abstract要約: ビデオフレーム補間(VFI)はコンピュータビジョンにおける基本的な課題である。イベントカメラの最近の進歩により、これらの課題に対処する新たな機会が開けている。 VFIのための統合的で効率的なイベントベース拡散モデルフレームワークであるEventDiffを提案する。
参考スコア（独自算出の注目度）: 7.969729040079355
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Frame Interpolation (VFI) is a fundamental yet challenging task in computer vision, particularly under conditions involving large motion, occlusion, and lighting variation. Recent advancements in event cameras have opened up new opportunities for addressing these challenges. While existing event-based VFI methods have succeeded in recovering large and complex motions by leveraging handcrafted intermediate representations such as optical flow, these designs often compromise high-fidelity image reconstruction under subtle motion scenarios due to their reliance on explicit motion modeling. Meanwhile, diffusion models provide a promising alternative for VFI by reconstructing frames through a denoising process, eliminating the need for explicit motion estimation or warping operations. In this work, we propose EventDiff, a unified and efficient event-based diffusion model framework for VFI. EventDiff features a novel Event-Frame Hybrid AutoEncoder (HAE) equipped with a lightweight Spatial-Temporal Cross Attention (STCA) module that effectively fuses dynamic event streams with static frames. Unlike previous event-based VFI methods, EventDiff performs interpolation directly in the latent space via a denoising diffusion process, making it more robust across diverse and challenging VFI scenarios. Through a two-stage training strategy that first pretrains the HAE and then jointly optimizes it with the diffusion model, our method achieves state-of-the-art performance across multiple synthetic and real-world event VFI datasets. The proposed method outperforms existing state-of-the-art event-based VFI methods by up to 1.98dB in PSNR on Vimeo90K-Triplet and shows superior performance in SNU-FILM tasks with multiple difficulty levels. Compared to the emerging diffusion-based VFI approach, our method achieves up to 5.72dB PSNR gain on Vimeo90K-Triplet and 4.24X faster inference.
Abstract（参考訳）: ビデオフレーム補間(VFI: Video Frame Interpolation)はコンピュータビジョンにおける基本的な課題であり、特に大きな動き、閉塞、照明の変化を含む条件下での課題である。イベントカメラの最近の進歩により、これらの課題に対処する新たな機会が開けている。既存のイベントベースVFI法は、光学フローのような手作りの中間表現を活用することで、大きく複雑な動きを復元することに成功したが、これらの設計はしばしば、明示的な動きモデリングに依存するため、微妙な動きシナリオ下での高忠実な画像再構成を妥協する。一方、拡散モデルは、デノナイジングプロセスを通じてフレームを再構築し、明示的な動き推定やワープ操作を不要にすることで、VFIにとって有望な代替手段を提供する。本稿では,VFIのための統合的で効率的なイベントベース拡散モデルフレームワークであるEventDiffを提案する。 EventDiffは、静的なフレームで動的イベントストリームを効果的に融合する軽量な空間的時間的クロスアテンション(STCA)モジュールを備えた、新しいイベントフレームハイブリッドオートエンコーダ(HAE)を備えている。従来のイベントベースのVFIメソッドとは異なり、EventDiffはデリゲート拡散プロセスを通じて潜在空間で直接補間を行い、多種多様な挑戦的なVFIシナリオでより堅牢になる。 HAEを事前訓練し、拡散モデルと共同で最適化する2段階のトレーニング戦略により、本手法は複数の合成および実世界のイベントVFIデータセットにまたがって最先端のパフォーマンスを実現する。提案手法は,Vimeo90K-Triplet上のPSNRにおいて,最大1.98dBの既存のイベントベースVFI手法よりも優れ,難易度の高いSNU-FILMタスクにおいて優れた性能を示す。拡散型VFI法と比較して,Vimeo90K-Tripletでは最大5.72dBPSNR,推論では4.24倍の高速化を実現している。

論文の概要: EventDiff: A Unified and Efficient Diffusion Model Framework for Event-based Video Frame Interpolation

関連論文リスト