Fugu-MT 論文翻訳(概要): EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

論文の概要: EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

arxiv url: http://arxiv.org/abs/2505.21876v1
Date: Wed, 28 May 2025 01:45:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-29 17:35:50.357789
Title: EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance
Title（参考訳）: EPiC:精密アンカー・ビデオ誘導による効率的なビデオカメラ制御学習
Authors: Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal,
Abstract要約: 本稿では,効率的なカメラ制御学習フレームワークであるEPiCを紹介する。高価なカメラ軌跡アノテーションを使わずに高品質なアンカービデオを構築する。 EPiCはI2Vカメラ制御タスクに対してRealEstate10KとMiraDataのSOTA性能を達成する。
参考スコア（独自算出の注目度）: 69.40274699401473
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent approaches on 3D camera control in video diffusion models (VDMs) often create anchor videos to guide diffusion models as a structured prior by rendering from estimated point clouds following annotated camera trajectories. However, errors inherent in point cloud estimation often lead to inaccurate anchor videos. Moreover, the requirement for extensive camera trajectory annotations further increases resource demands. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that automatically constructs high-quality anchor videos without expensive camera trajectory annotations. Concretely, we create highly precise anchor videos for training by masking source videos based on first-frame visibility. This approach ensures high alignment, eliminates the need for camera trajectory annotations, and thus can be readily applied to any in-the-wild video to generate image-to-video (I2V) training pairs. Furthermore, we introduce Anchor-ControlNet, a lightweight conditioning module that integrates anchor video guidance in visible regions to pretrained VDMs, with less than 1% of backbone model parameters. By combining the proposed anchor video data and ControlNet module, EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, without requiring modifications to the diffusion model backbone typically needed to mitigate rendering misalignments. Although being trained on masking-based anchor videos, our method generalizes robustly to anchor videos made with point clouds during inference, enabling precise 3D-informed camera control. EPiC achieves SOTA performance on RealEstate10K and MiraData for I2V camera control task, demonstrating precise and robust camera control ability both quantitatively and qualitatively. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video scenarios.
Abstract（参考訳）: ビデオ拡散モデル(VDM)における3次元カメラ制御の最近のアプローチは、しばしば、注釈付きカメラ軌道の後の推定点雲からのレンダリングにより、事前に構造化された拡散モデルをガイドするアンカービデオを生成する。しかし、ポイントクラウド推定に固有のエラーは、しばしば不正確なアンカービデオを引き起こす。さらに、広範囲なカメラトラジェクトリアノテーションの要求により、リソースの要求はさらに増大する。これらの制約に対処するために,高価なカメラ軌跡アノテーションを使わずに高品質なアンカービデオを自動的に構築する,効率的かつ正確なカメラ制御学習フレームワークであるEPiCを導入する。具体的には、第1フレームの可視性に基づいたソースビデオのマスキングによるトレーニングのための、高精度なアンカービデオを作成する。このアプローチは、高アライメントを保証し、カメラトラジェクトリアノテーションの必要性を排除し、任意のWildビデオに適用して、イメージ・ツー・ビデオ(I2V)トレーニングペアを生成する。さらに、Anchor-ControlNetは、視覚領域のアンカービデオガイダンスと事前訓練されたVDMを統合した軽量なコンディショニングモジュールであり、バックボーンモデルパラメータの1%未満である。提案されたアンカービデオデータとControlNetモジュールを組み合わせることで、EPiCは、レンダリングミスの緩和に通常必要とされる拡散モデルバックボーンの変更を必要とせずに、パラメータ、トレーニングステップ、少ないデータで効率的なトレーニングを実現する。マスクをベースとしたアンカー・ビデオのトレーニングを行ないながら,提案手法は高精度な3Dインフォームド・カメラ制御を実現するために,ポイント・クラウドを用いたアンカー・ビデオに頑強に一般化する。 EPiC は I2V カメラ制御タスクにおいて RealEstate10K と MiraData の SOTA 性能を実現し,定量的かつ定性的に,精密かつ堅牢なカメラ制御能力を示す。特筆すべきは、EPiCはビデオ対ビデオのシナリオに対して強力なゼロショットの一般化を示すことだ。

論文の概要: EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

関連論文リスト