Fugu-MT 論文翻訳(概要): Probing into Camera Control of Video Models

論文の概要: Probing into Camera Control of Video Models

arxiv url: http://arxiv.org/abs/2605.14815v1
Date: Thu, 14 May 2026 13:27:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-15 21:45:34.841126
Title: Probing into Camera Control of Video Models
Title（参考訳）: 映像モデルのカメラ制御への探索
Authors: Chen Hou, Christian Rupprecht,
Abstract要約: カメラ制御は暗黙のマッピング問題としてモデル化する必要はないが、幾何学的ガイダンスの一形態として扱うことができる。我々は、カメラ制御を一連の変位場に再構成し、デノナイジング時に潜伏特徴の識別可能な再サンプリングにより適用する。我々の単純なアプローチは、微調整されたベースラインに比べて、様々な品質指標の劣化を最小限に抑え、効果的なカメラ制御を実現する。
参考スコア（独自算出の注目度）: 42.06310116603546
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video is a rich and scalable source of 3D/4D visual observations, and camera control is a key capability for video generation models to produce geometrically meaningful content. Existing approaches typically learn a mapping from camera motion to video using additional camera modules and paired data. However, such datasets are often limited in scale, diversity, and scene dynamics, which can bias the model toward a narrow output distribution and compromise the strong prior learned by the base model. These limitations motivate a different perspective on camera control. In this paper, we show that camera control need not be modeled as an implicit mapping problem, but can instead be treated as a form of geometric guidance that induces displacements across frames. Specifically, we reformulate camera control into a set of displacement fields and apply them via differentiable resampling of latent features during denoising. Our simple approach achieves effective camera control with minimal degradation across diverse quality metrics compared to fine-tuned baselines. Since our method is applicable to most video diffusion models without training, it can also serve as a probe to study the camera control capabilities of base models. Using this probe, we identify universal biases shared by representative video models, as well as disparities in their responses to camera control. Finally, we benchmark their performance in multi-view generation, offering insights into their potential for 3D/4D tasks.
Abstract（参考訳）: ビデオは3D/4D視覚観察のリッチでスケーラブルなソースであり、カメラ制御は、ビデオ生成モデルが幾何学的に意味のあるコンテンツを生成するための鍵となる能力である。既存のアプローチは通常、追加のカメラモジュールとペアデータを使用して、カメラモーションからビデオへのマッピングを学ぶ。しかし、そのようなデータセットは、しばしばスケール、多様性、シーンのダイナミクスに制限されるため、モデルが狭い出力分布に偏り、ベースモデルによって学習された強い事前性を損なう可能性がある。これらの制限は、カメラ制御に関して異なる視点を動機付けている。本稿では,カメラ制御を暗黙のマッピング問題としてモデル化する必要はなく,フレーム間の変位を誘導する幾何学的ガイダンスの一形態として扱うことができることを示す。具体的には、カメラ制御を一組の変位場に再構成し、復調中の潜伏特徴の異なる再サンプリングにより適用する。我々の単純なアプローチは、微調整されたベースラインに比べて、様々な品質指標の劣化を最小限に抑え、効果的なカメラ制御を実現する。本手法はトレーニング無しのほとんどのビデオ拡散モデルに適用できるため,ベースモデルのカメラ制御能力を調査するためのプローブとしても機能する。このプローブを用いて、代表映像モデルで共有される普遍バイアスと、カメラ制御に対する応答の相違を同定する。最後に、これらの性能をマルチビュー生成でベンチマークし、3D/4Dタスクの可能性についての洞察を提供する。

論文の概要: Probing into Camera Control of Video Models

関連論文リスト