Fugu-MT 論文翻訳(概要): ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

論文の概要: ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

arxiv url: http://arxiv.org/abs/2605.06667v1
Date: Thu, 07 May 2026 17:59:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-08 22:27:12.080743
Title: ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation
Title（参考訳）: ActCam:ビデオ生成のためのゼロショットジョイントカメラと3Dモーションコントロール
Authors: Omar El Khalifi, Thomas Rossi, Oscar Fossey, Thibault Fouque, Ulysse Mizrahi, Philip Torr, Ivan Laptev, Fabio Pizzati, Baptiste Bellot-Gurlet,
Abstract要約: ActCamは、動画生成のためのゼロショット方式で、ドライブビデオから新しいシーンにキャラクタの動きを共同で転送する。シーン深度とキャラクタポーズの条件付けを受け入れる事前訓練された画像間拡散モデルを構築した。 ActCamは、ポーズのみの制御や、他のポーズやカメラの手法と比較して、カメラの付着性や動きの忠実性を改善する。
参考スコア（独自算出の注目度）: 34.51506212196978
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: For artistic applications, video generation requires fine-grained control over both performance and cinematography, i.e., the actor's motion and the camera trajectory. We present ActCam, a zero-shot method for video generation that jointly transfers character motion from a driving video into a new scene and enables per-frame control of intrinsic and extrinsic camera parameters. ActCam builds on any pretrained image-to-video diffusion model that accepts conditioning in terms of scene depth and character pose. Given a source video with a moving character and a target camera motion, ActCam generates pose and depth conditions that remain geometrically consistent across frames. We then run a single sampling process with a two-phase conditioning schedule: early denoising steps condition on both pose and sparse depth to enforce scene structure, after which depth is dropped and pose-only guidance refines high-frequency details without over-constraining the generation. We evaluate ActCam on multiple benchmarks spanning diverse character motions and challenging viewpoint changes. We find that, compared to pose-only control and other pose and camera methods, ActCam improves camera adherence and motion fidelity, and is preferred in human evaluations, especially under large viewpoint changes. Our results highlight that careful camera-consistent conditioning and staged guidance can enable strong joint camera and motion control without training. Project page: https://elkhomar.github.io/actcam/.
Abstract（参考訳）: 芸術的応用においては、映像生成はパフォーマンスと撮影の両方、すなわち俳優の動きとカメラの軌跡のきめ細かい制御を必要とする。本稿では,動画生成のためのゼロショット方式であるActCamについて述べる。 ActCamは、シーンの深さとキャラクターのポーズの条件付けを受け付ける、事前訓練された画像とビデオの拡散モデルの上に構築されている。 ActCamは移動キャラクタとターゲットカメラモーションを備えたソースビデオから、フレーム間で幾何学的に整合したポーズと深さ条件を生成する。次に,2段階の条件付き単一サンプリングプロセスを実行する: ポーズとスパースの両方のステップ条件を早期に記述し,シーン構造を強制し,その後,深さを落としてポーズのみのガイダンスにより,生成を過剰に抑制することなく高周波の詳細を洗練する。多様なキャラクタの動きと難易度の変化にまたがる複数のベンチマークでActCamを評価した。 ActCamは、ポーズのみの制御や、他のポーズやカメラ手法と比較して、カメラの忠実度や動きの忠実度を向上し、特に大きな視点での評価において好まれる。以上の結果から,注意深いカメラコンディショニングとステージ誘導により,トレーニングなしで強力なジョイントカメラとモーションコントロールが実現できることが示唆された。プロジェクトページ: https://elkhomar.github.io/actcam/。

論文の概要: ActCam: Zero-Shot Joint Camera and 3D Motion Control for Video Generation

関連論文リスト