Fugu-MT 論文翻訳(概要): HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

論文の概要: HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

arxiv url: http://arxiv.org/abs/2606.02573v1
Date: Mon, 01 Jun 2026 17:58:11 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:32.565526
Title: HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image
Title（参考訳）: HumanNOVA: 単一画像からのフォトリアリスティック、ユニバーサル、ラピッド3Dヒューマンアバターモデリング
Authors: Hezhen Hu, Wangbo Zhao, Lanqing Guo, Hanwen Jiang, Jonathan C. Liu, Zhiwen Fan, Kai Wang, Zhangyang Wang, Georgios Pavlakos,
Abstract要約: 我々は,1枚のRGB画像から3次元アバターを生成するための,フォトリアリスティックで普遍的で高速なモデルであるHumanNOVAを提案する。アーキテクチャの面では、HumanNOVAは1秒未満で高速な推論を可能にする、フィードフォワード、トークン条件付きアバターモデリングフレームワークを採用している。
参考スコア（独自算出の注目度）: 84.81016200801153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we present HumanNOVA, a photorealistic, universal, and rapid model for generating 3D human avatars from a single RGB image. Achieving both photorealism and generalization is challenging due to the scarcity of diverse, high-quality 3D human data. To address this, we build a scalable data generation pipeline that follows two strategies. The first one is to leverage existing rigged assets and animate them with extensive poses from daily life. The second strategy is to utilize existing multi-camera captures of humans and employ fitting to generate more diverse views for training. These two strategies enable us to scale up to 100k assets, significantly enhancing both the quantity and the diversity of data for robust model training. In terms of the architecture, HumanNOVA adopts a feed-forward, token-conditioned avatar modeling framework that allows fast inference in less than one second and requires no test-time optimization. Given an input image and an estimated simplified human mesh (SMPL) without detailed geometry or appearance, the model first encodes both inputs into compact token representations. These tokens then act as conditioning signals and are fused through cross-attention to construct a triplane-based 3D avatar representation. Extensive experiments on multiple benchmarks demonstrate the superiority of our approach, both quantitatively and qualitatively, as well as its robustness under diverse input image conditions. Project page at https://HumanNOVA.github.io .
Abstract（参考訳）: 本稿では,1枚のRGB画像から3次元アバターを生成するための,フォトリアリスティック,ユニバーサル,高速なモデルであるHumanNOVAを提案する。多様な高品質な3Dデータの不足のため、フォトリアリズムと一般化の両面での達成は困難である。これを解決するために、私たちは2つの戦略に従うスケーラブルなデータ生成パイプラインを構築しました。 1つ目は、既存の密閉資産を活用し、それらを日常生活から広範囲のポーズでアニメーション化することだ。第2の戦略は、既存の人間のマルチカメラキャプチャーを利用して、トレーニングのためにより多様なビューを生成することである。これら2つの戦略により、100kまでの資産をスケールアップすることが可能となり、堅牢なモデルトレーニングのためのデータ量と多様性の両方を著しく向上させます。アーキテクチャの面では、HumanNOVAはフィードフォワードでトークン条件のアバターモデリングフレームワークを採用しており、1秒未満で高速な推論が可能で、テストタイムの最適化は不要である。入力画像と、詳細な幾何学や外観のない推定単純化されたヒューマンメッシュ(SMPL)が与えられた後、モデルはまず両方の入力をコンパクトなトークン表現に符号化する。これらのトークンはコンディショニング信号として機能し、三面体ベースの3Dアバター表現を構築するためにクロスアテンションを通して融合される。複数のベンチマークでの大規模な実験は、様々な入力画像条件下での頑健さと同様に、定量的かつ質的に、我々のアプローチの優越性を実証している。 Project page at https://HumanNOVA.github.io を参照。

論文の概要: HumanNOVA: Photorealistic, Universal and Rapid 3D Human Avatar Modeling from a Single Image

関連論文リスト