Related papers: HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images

HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images

URL: http://arxiv.org/abs/2508.16465v2
Date: Mon, 25 Aug 2025 16:02:08 GMT
Title: HOSt3R: Keypoint-free Hand-Object 3D Reconstruction from RGB images
Authors: Anilkumar Swamy, Vincent Leroy, Philippe Weinzaepfel, Jean-Sébastien Franco, Grégory Rogez,
Abstract summary: We propose a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images.<n>We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape.<n>Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance.
Score: 27.025336665386735
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Hand-object 3D reconstruction has become increasingly important for applications in human-robot interaction and immersive AR/VR experiences. A common approach for object-agnostic hand-object reconstruction from RGB sequences involves a two-stage pipeline: hand-object 3D tracking followed by multi-view 3D reconstruction. However, existing methods rely on keypoint detection techniques, such as Structure from Motion (SfM) and hand-keypoint optimization, which struggle with diverse object geometries, weak textures, and mutual hand-object occlusions, limiting scalability and generalization. As a key enabler to generic and seamless, non-intrusive applicability, we propose in this work a robust, keypoint detector-free approach to estimating hand-object 3D transformations from monocular motion video/images. We further integrate this with a multi-view reconstruction pipeline to accurately recover hand-object 3D shape. Our method, named HOSt3R, is unconstrained, does not rely on pre-scanned object templates or camera intrinsics, and reaches state-of-the-art performance for the tasks of object-agnostic hand-object 3D transformation and shape estimation on the SHOWMe benchmark. We also experiment on sequences from the HO3D dataset, demonstrating generalization to unseen object categories.

Related papers

E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z)
SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping [7.222966501323922]
We propose a training free framework SR3D that enables robotic grasping of transparent and specular objects from a single view observation.<n>Specifically, given single view RGB and depth images, SR3D first uses the external visual models to generate 3D reconstructed object mesh.<n>Then, the key idea is to determine the 3D object's pose and scale to accurately localize the reconstructed object back into its original depth corrupted 3D scene.
arXiv Detail & Related papers (2025-05-30T07:38:46Z)
Reconstructing Hand-Held Objects in 3D from Images and Videos [53.277402172488735]
Given a monocular RGB video, we aim to reconstruct hand-held object geometry in 3D, over time. We present MCC-Hand-Object (MCC-HO), which jointly reconstructs hand and object geometry given a single RGB image. We then prompt a text-to-3D generative model using GPT-4(V) to retrieve a 3D object model that matches the object in the image.
arXiv Detail & Related papers (2024-04-09T17:55:41Z)
HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z)
Decaf: Monocular Deformation Capture for Face and Hand Interactions [77.75726740605748]
This paper introduces the first method that allows tracking human hands interacting with human faces in 3D from single monocular RGB videos. We model hands as articulated objects inducing non-rigid face deformations during an active interaction. Our method relies on a new hand-face motion and interaction capture dataset with realistic face deformations acquired with a markerless multi-view camera system.
arXiv Detail & Related papers (2023-09-28T17:59:51Z)
SHOWMe: Benchmarking Object-agnostic Hand-Object 3D Reconstruction [13.417086460511696]
We introduce the SHOWMe dataset which consists of 96 videos, annotated with real and detailed hand-object 3D textured meshes. We consider a rigid hand-object scenario, in which the pose of the hand with respect to the object remains constant during the whole video sequence. This assumption allows us to register sub-millimetre-precise groundtruth 3D scans to the image sequences in SHOWMe.
arXiv Detail & Related papers (2023-09-19T16:48:29Z)
Anything-3D: Towards Single-view Anything Reconstruction in the Wild [61.090129285205805]
We introduce Anything-3D, a methodical framework that ingeniously combines a series of visual-language models and the Segment-Anything object segmentation model. Our approach employs a BLIP model to generate textural descriptions, utilize the Segment-Anything model for the effective extraction of objects of interest, and leverages a text-to-image diffusion model to lift object into a neural radiance field.
arXiv Detail & Related papers (2023-04-19T16:39:51Z)
What's in your hands? 3D Reconstruction of Generic Objects in Hands [49.12461675219253]
Our work aims to reconstruct hand-held objects given a single RGB image. In contrast to prior works that typically assume known 3D templates and reduce the problem to 3D pose estimation, our work reconstructs generic hand-held object without knowing their 3D templates.
arXiv Detail & Related papers (2022-04-14T17:59:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.