Fugu-MT 論文翻訳(概要): Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

論文の概要: Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

arxiv url: http://arxiv.org/abs/2512.02870v1
Date: Tue, 02 Dec 2025 15:33:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-03 21:04:45.948938
Title: Taming Camera-Controlled Video Generation with Verifiable Geometry Reward
Title（参考訳）: 検証可能な幾何リワードを用いたカメラ制御ビデオ生成
Authors: Zhaoqing Wang, Xiaobo Xia, Zhuolin Bie, Jinlin Liu, Dongdong Yu, Jia-Wang Bian, Changhu Wang,
Abstract要約: 我々は、事前訓練されたビデオ生成器を正確なカメラ制御のために最適化するオンライン強化学習フレームワークを導入する。生成されたビデオと参照ビデオの両方の3次元カメラ軌跡を推定し、各軌跡を短いセグメントに分割し、セグメントの相対的なポーズを計算する。我々は、多彩な大振幅カメラの動きと、様々な主題のダイナミックスを持つシーンを特徴とする包括的データセットを構築した。
参考スコア（独自算出の注目度）: 36.31658788083449
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in video diffusion models have remarkably improved camera-controlled video generation, but most methods rely solely on supervised fine-tuning (SFT), leaving online reinforcement learning (RL) post-training largely underexplored. In this work, we introduce an online RL post-training framework that optimizes a pretrained video generator for precise camera control. To make RL effective in this setting, we design a verifiable geometry reward that delivers dense segment-level feedback to guide model optimization. Specifically, we estimate the 3D camera trajectories for both generated and reference videos, divide each trajectory into short segments, and compute segment-wise relative poses. The reward function then compares each generated-reference segment pair and assigns an alignment score as the reward signal, which helps alleviate reward sparsity and improve optimization efficiency. Moreover, we construct a comprehensive dataset featuring diverse large-amplitude camera motions and scenes with varied subject dynamics. Extensive experiments show that our online RL post-training clearly outperforms SFT baselines across multiple aspects, including camera-control accuracy, geometric consistency, and visual quality, demonstrating its superiority in advancing camera-controlled video generation.
Abstract（参考訳）: ビデオ拡散モデルの最近の進歩は、カメラ制御ビデオ生成を著しく改善しているが、ほとんどの手法は、教師付き微調整(SFT)のみに依存しており、オンライン強化学習(RL)は、ほとんど探索されていない。本研究では,事前学習した映像生成装置を最適化し,正確なカメラ制御を行うオンラインRLポストトレーニングフレームワークを提案する。この設定でRLを効果的にするために、モデルの最適化を導くために密度の高いセグメントレベルのフィードバックを提供する検証可能な幾何報酬を設計する。具体的には、生成されたビデオと参照ビデオの両方の3次元カメラ軌跡を推定し、各軌跡を短いセグメントに分割し、セグメントの相対的なポーズを計算する。次に、報酬関数は生成された参照セグメントペアを比較し、アライメントスコアを報酬信号として割り当て、報酬の分散を緩和し、最適化効率を向上させる。さらに、多彩な大振幅カメラの動きと、様々な主題のダイナミックスを持つシーンを特徴とする包括的データセットを構築した。我々のオンラインRLポストトレーニングは、カメラ制御精度、幾何整合性、視覚的品質など、さまざまな面でSFTベースラインよりも明らかに優れており、カメラ制御ビデオ生成の進歩においてその優位性を示している。

論文の概要: Taming Camera-Controlled Video Generation with Verifiable Geometry Reward

関連論文リスト