Related papers: UniHOI: Learning Fast, Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction Videos

UniHOI: Learning Fast, Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction Videos

URL: http://arxiv.org/abs/2411.09145v2
Date: Fri, 15 Nov 2024 12:27:39 GMT
Title: UniHOI: Learning Fast, Dense and Generalizable 4D Reconstruction for Egocentric Hand Object Interaction Videos
Authors: Chengbo Yuan, Geng Chen, Li Yi, Yang Gao,
Abstract summary: We introduce UniHOI, a model that unifies the estimation of all variables necessary for dense 4D reconstruction. UniHOI is the first approach to offer fast, dense, and general monocular egocentric HOI scene reconstruction in the presence of motion.
Score: 25.41337525728398
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Egocentric Hand Object Interaction (HOI) videos provide valuable insights into human interactions with the physical world, attracting growing interest from the computer vision and robotics communities. A key task in fully understanding the geometry and dynamics of HOI scenes is dense pointclouds sequence reconstruction. However, the inherent motion of both hands and the camera makes this challenging. Current methods often rely on time-consuming test-time optimization, making them impractical for reconstructing internet-scale videos. To address this, we introduce UniHOI, a model that unifies the estimation of all variables necessary for dense 4D reconstruction, including camera intrinsic, camera poses, and video depth, for egocentric HOI scene in a fast feed-forward manner. We end-to-end optimize all these variables to improve their consistency in 3D space. Furthermore, our model could be trained solely on large-scale monocular video dataset, overcoming the limitation of scarce labeled HOI data. We evaluate UniHOI with both in-domain and zero-shot generalization setting, surpassing all baselines in pointclouds sequence reconstruction and long-term 3D scene flow recovery. UniHOI is the first approach to offer fast, dense, and generalizable monocular egocentric HOI scene reconstruction in the presence of motion. Code and trained model will be released in the future.

Related papers

LaVR: Scene Latent Conditioned Generative Video Trajectory Re-Rendering using Large 4D Reconstruction Models [52.656349227001925]
Given a monocular video, the goal of video re-rendering is to generate views of the scene from a novel camera trajectory.<n>Existing methods face two distinct challenges.<n>We propose to address these challenges by using the implicit geometric knowledge embedded in the latent space of a large 4D reconstruction model.
arXiv Detail & Related papers (2026-01-21T05:46:03Z)
EgoGrasp: World-Space Hand-Object Interaction Estimation from Egocentric Videos [25.047225764745978]
We propose EgoGrasp, the first method to reconstruct world-space hand-object interactions (W-HOI) from egocentric monocular videos with dynamic cameras in the wild.<n>In experiments, we prove our method achieving state-of-the-art performance in W-HOI reconstruction.
arXiv Detail & Related papers (2026-01-03T03:08:48Z)
EgoThinker: Unveiling Egocentric Reasoning with Spatio-Temporal CoT [56.24624833924252]
EgoThinker is a framework that endows MLs with robust egocentric reasoning capabilities through-temporal chain-of-thought supervision and a two-stage learning curriculum.<n>EgoThinker outperforms existing methods across multiple egocentric benchmarks, while achieving substantial improvements in fine-grained-temporal localization tasks.
arXiv Detail & Related papers (2025-10-27T17:38:17Z)
Fine-grained Spatiotemporal Grounding on Egocentric Videos [13.319346673043286]
We introduce EgoMask, the first pixel-level benchmark for fine-temporal grounding in egocentric videos.<n>EgoMask is constructed by our proposed automatic annotation pipeline, which annotates referring expressions and object masks.<n>We also create EgoMask-Train, a large-scale training dataset to facilitate model development.
arXiv Detail & Related papers (2025-08-01T10:53:27Z)
PlayerOne: Egocentric World Simulator [73.88786358213694]
PlayerOne is the first egocentric realistic world simulator.<n>It generates egocentric videos that are strictly aligned with the real scene human motion of the user captured by an exocentric camera.
arXiv Detail & Related papers (2025-06-11T17:59:53Z)
EgoM2P: Egocentric Multimodal Multitask Pretraining [55.259234688003545]
Building large-scale egocentric multimodal and multitask models presents unique challenges.<n> EgoM2P is a masked modeling framework that learns from temporally-aware multimodal tokens to train a large, general-purpose model for egocentric 4D understanding.<n>We will fully open-source EgoM2P to support the community and advance egocentric vision research.
arXiv Detail & Related papers (2025-06-09T15:59:25Z)
Dyn-HaMR: Recovering 4D Interacting Hand Motion from a Dynamic Camera [49.82535393220003]
Dyn-HaMR is the first approach to reconstruct 4D global hand motion from monocular videos recorded by dynamic cameras in the wild. We show that our approach significantly outperforms state-of-the-art methods in terms of 4D global mesh recovery. This establishes a new benchmark for hand motion reconstruction from monocular video with moving cameras.
arXiv Detail & Related papers (2024-12-17T12:43:10Z)
Ego3DT: Tracking Every 3D Object in Ego-centric Videos [20.96550148331019]
This paper introduces a novel zero-shot approach for the 3D reconstruction and tracking of all objects from the ego-centric video. We present Ego3DT, a novel framework that initially identifies and extracts detection and segmentation information of objects within the ego environment. We have also innovated a dynamic hierarchical association mechanism for creating stable 3D tracking trajectories of objects in ego-centric videos.
arXiv Detail & Related papers (2024-10-11T05:02:31Z)
MM-Ego: Towards Building Egocentric Multimodal LLMs for Video QA [72.47344411599322]
This research aims to explore building a multimodal foundation model for egocentric video understanding. We automatically generate 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long in Ego4D based on human-annotated data. We contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths.
arXiv Detail & Related papers (2024-10-09T17:59:59Z)
EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z)
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects [89.95728475983263]
holistic 3Dunderstanding of such interactions from egocentric views is important for tasks in robotics, AR/VR, action recognition and motion generation. We design the HANDS23 challenge based on the AssemblyHands and ARCTIC datasets with carefully designed training and testing splits. Based on the results of the top submitted methods and more recent baselines on the leaderboards, we perform a thorough analysis on 3D hand(-object) reconstruction tasks.
arXiv Detail & Related papers (2024-03-25T05:12:21Z)
DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos. Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion. Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z)
EgoGen: An Egocentric Synthetic Data Generator [53.32942235801499]
EgoGen is a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. We demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras, egocentric camera tracking, and human mesh recovery from egocentric views.
arXiv Detail & Related papers (2024-01-16T18:55:22Z)
3D Human Pose Perception from Egocentric Stereo Videos [67.9563319914377]
We propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation. Our method is able to accurately estimate human poses even in challenging scenarios, such as crouching and sitting. We will release UnrealEgo2, UnrealEgo-RW, and trained models on our project page.
arXiv Detail & Related papers (2023-12-30T21:21:54Z)
HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z)
Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips [38.02945794078731]
We tackle the task of reconstructing hand-object interactions from short video clips. Our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape. We empirically evaluate our approach on egocentric videos, and observe significant improvements over prior single-view and multi-view methods.
arXiv Detail & Related papers (2023-09-11T17:58:30Z)
EgoHumans: An Egocentric 3D Multi-Human Benchmark [37.375846688453514]
We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. We propose a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild. We leverage consumer-grade wearable camera-equipped glasses for the egocentric view, which enables us to capture dynamic activities like playing tennis, fencing, volleyball, etc.
arXiv Detail & Related papers (2023-05-25T21:37:36Z)
Learning Dynamic View Synthesis With Few RGBD Cameras [60.36357774688289]
We propose to utilize RGBD cameras to synthesize free-viewpoint videos of dynamic indoor scenes. We generate point clouds from RGBD frames and then render them into free-viewpoint videos via a neural feature. We introduce a simple Regional Depth-Inpainting module that adaptively inpaints missing depth values to render complete novel views.
arXiv Detail & Related papers (2022-04-22T03:17:35Z)
Class-agnostic Reconstruction of Dynamic Objects from Videos [127.41336060616214]
We introduce REDO, a class-agnostic framework to REconstruct the Dynamic Objects from RGBD or calibrated videos. We develop two novel modules. First, we introduce a canonical 4D implicit function which is pixel-aligned with aggregated temporal visual cues. Second, we develop a 4D transformation module which captures object dynamics to support temporal propagation and aggregation.
arXiv Detail & Related papers (2021-12-03T18:57:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.