Fugu-MT 論文翻訳(概要): HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

論文の概要: HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

arxiv url: http://arxiv.org/abs/2407.17438v2
Date: Sun, 28 Jul 2024 05:00:10 GMT
ステータス: 翻訳完了
システム内更新日: 2024-07-30 20:22:03.397319
Title: HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation
Title（参考訳）: HumanVid: カメラ制御可能な人間のイメージアニメーションのためのデミスティファイトトレーニングデータ
Authors: Zhenzhi Wang, Yixuan Li, Yanhong Zeng, Youqing Fang, Yuwei Guo, Wenran Liu, Jing Tan, Kai Chen, Tianfan Xue, Bo Dai, Dahua Lin,
Abstract要約: 人間の画像アニメーションに適した,最初の大規模高品質データセットであるHumanVidを紹介する。実世界のデータについては、インターネットから著作権のない実世界のビデオの膨大なコレクションをコンパイルします。合成データについては,2300件の著作権のない3Dアバター資産を収集し,既存の3D資産を拡大する。
参考スコア（独自算出の注目度）: 64.37874983401221
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human image animation involves generating videos from a character photo, allowing user control and unlocking potential for video and movie production. While recent approaches yield impressive results using high-quality training data, the inaccessibility of these datasets hampers fair and transparent benchmarking. Moreover, these approaches prioritize 2D human motion and overlook the significance of camera motions in videos, leading to limited control and unstable video generation. To demystify the training data, we present HumanVid, the first large-scale high-quality dataset tailored for human image animation, which combines crafted real-world and synthetic data. For the real-world data, we compile a vast collection of copyright-free real-world videos from the internet. Through a carefully designed rule-based filtering strategy, we ensure the inclusion of high-quality videos, resulting in a collection of 20K human-centric videos in 1080P resolution. Human and camera motion annotation is accomplished using a 2D pose estimator and a SLAM-based method. For the synthetic data, we gather 2,300 copyright-free 3D avatar assets to augment existing available 3D assets. Notably, we introduce a rule-based camera trajectory generation method, enabling the synthetic pipeline to incorporate diverse and precise camera motion annotation, which can rarely be found in real-world data. To verify the effectiveness of HumanVid, we establish a baseline model named CamAnimate, short for Camera-controllable Human Animation, that considers both human and camera motions as conditions. Through extensive experimentation, we demonstrate that such simple baseline training on our HumanVid achieves state-of-the-art performance in controlling both human pose and camera motions, setting a new benchmark. Code and data will be publicly available at https://github.com/zhenzhiwang/HumanVid/.
Abstract（参考訳）: 人間の画像アニメーションは、キャラクタ写真からビデオを生成し、ユーザーが制御し、ビデオや映画制作の可能性を解き放つ。最近のアプローチでは、高品質なトレーニングデータを使用して印象的な結果が得られるが、これらのデータセットがアクセスできないことは、公正で透明なベンチマークを妨げている。さらに、これらの手法は2次元の人間の動きを優先し、ビデオにおけるカメラの動きの重要性を見落とし、限られた制御と不安定な映像生成につながる。トレーニングデータをデミスティフィケートするために,人工現実データと合成データを組み合わせた人間の画像アニメーションに適した,最初の大規模高品質データセットであるHumanVidを提案する。実世界のデータについては、インターネットから著作権のない実世界のビデオの膨大なコレクションをコンパイルします。慎重に設計されたルールベースのフィルタリング戦略により、高品質なビデオが確実に含まれ、結果として1080P解像度で20万本もの人間中心のビデオが集められる。ヒトとカメラの動作アノテーションは2次元ポーズ推定器とSLAMに基づく手法を用いて達成される。合成データについては,2300件の著作権のない3Dアバター資産を収集し,既存の3D資産を拡大する。特に,ルールに基づくカメラ軌跡生成手法を導入し,実世界のデータにはほとんど見つからない,多種多様な高精度なカメラモーションアノテーションを合成パイプラインに組み込むことを可能にした。 HumanVidの有効性を検証するため,カメラ制御可能なヒューマンアニメーションのベースラインモデルCamAnimateを構築し,人間とカメラの両方の動きを条件とする。広範にわたる実験を通じて、人間のポーズとカメラの動きを制御し、新しいベンチマークを設定できるようなシンプルなHumanVidのベースライントレーニングが、最先端のパフォーマンスを実現することを実証した。コードとデータはhttps://github.com/zhenzhiwang/HumanVid/.comで公開される。

論文の概要: HumanVid: Demystifying Training Data for Camera-controllable Human Image Animation

関連論文リスト