Fugu-MT 論文翻訳(概要): GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

論文の概要: GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

arxiv url: http://arxiv.org/abs/2505.23085v1
Date: Thu, 29 May 2025 04:41:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-30 18:14:07.680435
Title: GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion
Title（参考訳）: GeoMan: Image-to-Video Diffusion を用いた時間的に連続した人体形状推定
Authors: Gwanghyun Kim, Xueting Li, Ye Yuan, Koki Nagano, Tianye Li, Jan Kautz, Se Young Chun, Umar Iqbal,
Abstract要約: GeoManは、単眼の人間のビデオから正確で時間的に一貫した深さと正常な推定を生成するように設計された新しいアーキテクチャである。高品質な4Dトレーニングデータの不足と、人間のサイズを正確にモデル化するための計量深度推定の必要性に対処する。定性評価と定量的評価の両方において最先端の性能を達成する。
参考スコア（独自算出の注目度）: 61.992868017910645
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.
Abstract（参考訳）: ビデオから正確で時間的に一貫した3D人間の幾何学を推定することは、コンピュータビジョンにおいて難しい問題である。既存の手法は、主に単一画像に最適化されており、時間的不整合に悩まされ、きめ細かなダイナミックな詳細を捉えることができないことが多い。これらの制約に対処するため,単眼映像から正確な時間的に一貫した深度と通常の推定値を生成するように設計された新しいアーキテクチャであるGeoManを提案する。 GeoManは、高品質な4Dトレーニングデータの不足と、人間のサイズを正確にモデル化するためのメートル法深度推定の必要性という、2つの大きな課題に対処する。最初の課題を克服するために、GeoManはビデオの第1フレームの深さと正規度を推定するために画像ベースモデルを使用し、ビデオ拡散モデルを条件化し、ビデオ幾何学推定タスクを画像からビデオ生成問題として再定義する。この設計は、画像モデルに幾何推定の重い持ち上げをオフロードし、大規模ビデオデータセットから学習した先行データを使用しながら、複雑な詳細に集中するために、ビデオモデルの役割を単純化する。その結果、GeoManは、最小限の4Dトレーニングデータを必要としながら、時間的一貫性と一般化性を改善する。正確な人体サイズ推定の課題に対処するために,従来のアフィン不変量やメートル法深度表現の限界を克服し,重要な人体規模の詳細を保ち,単分子入力から容易に推定できるルート相対深度表現を導入する。 GeoManは、質的および定量的評価の両方において最先端のパフォーマンスを達成し、ビデオからの3次元人間の幾何推定における長年の課題を克服する効果を実証する。

論文の概要: GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion

関連論文リスト