GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion
- URL: http://arxiv.org/abs/2505.23085v1
- Date: Thu, 29 May 2025 04:41:04 GMT
- Title: GeoMan: Temporally Consistent Human Geometry Estimation using Image-to-Video Diffusion
- Authors: Gwanghyun Kim, Xueting Li, Ye Yuan, Koki Nagano, Tianye Li, Jan Kautz, Se Young Chun, Umar Iqbal,
- Abstract summary: GeoMan is a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos.<n>It addresses the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size.<n>It achieves state-of-the-art performance in both qualitative and quantitative evaluations.
- Score: 61.992868017910645
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating accurate and temporally consistent 3D human geometry from videos is a challenging problem in computer vision. Existing methods, primarily optimized for single images, often suffer from temporal inconsistencies and fail to capture fine-grained dynamic details. To address these limitations, we present GeoMan, a novel architecture designed to produce accurate and temporally consistent depth and normal estimations from monocular human videos. GeoMan addresses two key challenges: the scarcity of high-quality 4D training data and the need for metric depth estimation to accurately model human size. To overcome the first challenge, GeoMan employs an image-based model to estimate depth and normals for the first frame of a video, which then conditions a video diffusion model, reframing video geometry estimation task as an image-to-video generation problem. This design offloads the heavy lifting of geometric estimation to the image model and simplifies the video model's role to focus on intricate details while using priors learned from large-scale video datasets. Consequently, GeoMan improves temporal consistency and generalizability while requiring minimal 4D training data. To address the challenge of accurate human size estimation, we introduce a root-relative depth representation that retains critical human-scale details and is easier to be estimated from monocular inputs, overcoming the limitations of traditional affine-invariant and metric depth representations. GeoMan achieves state-of-the-art performance in both qualitative and quantitative evaluations, demonstrating its effectiveness in overcoming longstanding challenges in 3D human geometry estimation from videos.
Related papers
- E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z) - GRACE: Estimating Geometry-level 3D Human-Scene Contact from 2D Images [54.602947113980655]
Estimating the geometry level of human-scene contact aims to ground specific contact surface points at 3D human geometries.<n> GRACE (Geometry-level Reasoning for 3D Human-scene Contact Estimation) is a new paradigm for 3D human contact estimation.<n>It incorporates a point cloud encoder-decoder architecture along with a hierarchical feature extraction and fusion module.
arXiv Detail & Related papers (2025-05-10T09:25:46Z) - Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction [72.54905331756076]
We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes.<n>By leveraging the strong dynamic prior captured by such video models, Geo4D can be trained using only synthetic data.
arXiv Detail & Related papers (2025-04-10T17:59:55Z) - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.<n>By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.<n>We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction.
Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z) - TAPE: Temporal Attention-based Probabilistic human pose and shape
Estimation [7.22614468437919]
Existing methods ignore the ambiguities of the reconstruction and provide a single deterministic estimate for the 3D pose.
We present a Temporal Attention based Probabilistic human pose and shape Estimation method (TAPE) that operates on an RGB video.
We show that TAPE outperforms state-of-the-art methods in standard benchmarks.
arXiv Detail & Related papers (2023-04-29T06:08:43Z) - Learning High Fidelity Depths of Dressed Humans by Watching Social Media
Dance Videos [21.11427729302936]
We present a new method to use the local transformation that warps the predicted local geometry of the person from an image to that of another image at a different time instant.
Our method is end-to-end trainable, resulting in high fidelity depth estimation that predicts fine geometry faithful to the input real image.
arXiv Detail & Related papers (2021-03-04T20:46:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.