Fugu-MT 論文翻訳(概要): Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

論文の概要: Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

arxiv url: http://arxiv.org/abs/2606.01900v1
Date: Mon, 01 Jun 2026 08:41:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:31.623595
Title: Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation
Title（参考訳）: Auteur:人間中心のビデオ生成のための言語駆動シネマトグラフィー・フレーミング
Authors: Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Xuelin Chen, Erkut Erdem, Aykut Erdem, Duygu Ceylan,
Abstract要約: 生成ビデオにおける言語駆動型人中心カメラフレーミングの手法であるAuteurを提案する。 Auteurは、人中心のシーンの撮影フレーミングを可能にする。
参考スコア（独自算出の注目度）: 48.49793109378558
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generative video models have achieved remarkable visual fidelity and temporal coherence, yet intentional camera control remains elusive. Existing frameworks treat camera motion as a byproduct of pixel synthesis, producing trajectories that are stochastic, spatially inconsistent, and indifferent to the human subject driving the scene. In this work, we present Auteur, a method for language-driven, human-centric camera framing in generative video. Our core insight is that professional filmmakers conceive shots not as world-space trajectories but as framings defined relative to the actor, encoding shot size, angle, and composition as functions of human pose and motion. We formalize this intuition as a human-centric camera parameterization and introduce a Domain-Specific Language (DSL) that is convertible to standard 6-DoF camera parameters. A fine-tuned multimodal large language model then acts as a virtual director, mapping natural language descriptions and coarse human motion to sparse DSL keyframes that are deterministically interpolated into continuous camera trajectories, which are then provided as input to video generators. We train and evaluate Auteur on a new dataset of 34K aligned text, human motion, and DSL-annotated camera trajectories drawn from procedural synthesis and real-world movie footage from the CondensedMovies dataset. Auteur enables cinematographic framing of human-centered scenes, a capability largely absent in prior generative models. To assess this behavior, we propose new framing-focused metrics, and our experiments show that Auteur consistently outperforms existing methods
Abstract（参考訳）: 生成ビデオモデルは目覚ましい視覚的忠実さと時間的コヒーレンスを実現しているが、意図的なカメラ制御はいまだ解明されていない。既存のフレームワークは、カメラの動きをピクセル合成の副産物として扱い、確率的で空間的に一貫性がなく、シーンを駆動する人間の対象に無関心な軌跡を生成する。本研究では、生成ビデオにおける言語駆動型人間中心カメラフレーミングの手法であるAuteurを紹介する。私たちの中核的な洞察は、プロの映画製作者は、撮影を世界空間の軌跡ではなく、俳優に対するフレーミングとして捉え、人間のポーズや動きの関数としてショットのサイズ、角度、構成をエンコードしているということです。我々は、この直観を人間中心のカメラパラメータ化として形式化し、標準的な6-DoFカメラパラメータに変換可能なドメイン特化言語(DSL)を導入する。微調整されたマルチモーダルな大言語モデルは仮想ディレクタとして機能し、自然言語記述と粗い人間の動きを分離したDSLキーフレームにマッピングし、決定論的に連続したカメラトラジェクトリに補間し、ビデオジェネレータへの入力として提供される。我々は、Auteurを、手続き的な合成とCondensedMoviesデータセットからの実世界の映像から引き出された34Kのテキスト、人間の動き、デジタル一眼レフカメラの軌跡の新しいデータセットで訓練し、評価する。 Auteurは、人中心のシーンの撮影フレーミングを可能にする。この振る舞いを評価するために、フレーミングに焦点を当てた新しい指標を提案し、実験により、Auteurが既存の手法より一貫して優れていることを示す。

論文の概要: Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

関連論文リスト