Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
- URL: http://arxiv.org/abs/2503.00986v1
- Date: Sun, 02 Mar 2025 18:49:48 GMT
- Title: Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
- Authors: Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang,
- Abstract summary: In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.<n>In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process.<n>We propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information.
- Score: 71.02843679746563
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dynamics between hands and objects. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. Since no suitable data is available, we introduce HOD, a novel pipeline employing a hand-object detector and a large language model to generate high-quality narrations with detailed descriptions of hand-object dynamics. To learn these fine-grained dynamics, we propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information. Through our co-training strategy, EgoVideo effectively and efficiently leverages the fine-grained hand-object dynamics in the HOD data. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple egocentric downstream tasks, including improvements of 6.3% in EK-100 multi-instance retrieval, 5.7% in EK-100 classification, and 16.3% in EGTEA classification in zero-shot settings. Furthermore, our model exhibits robust generalization capabilities in hand-object interaction and robot manipulation tasks. Code and data are available at https://github.com/OpenRobotLab/EgoHOD/.
Related papers
- TASTE-Rob: Advancing Video Generation of Task-Oriented Hand-Object Interaction for Generalizable Robotic Manipulation [18.083105886634115]
TASTE-Rob is a dataset of 100,856 ego-centric hand-object interaction videos.
Each video is meticulously aligned with language instructions and recorded from a consistent camera viewpoint.
To enhance realism, we introduce a three-stage pose-refinement pipeline.
arXiv Detail & Related papers (2025-03-14T14:09:31Z) - Improving Dynamic Object Interactions in Text-to-Video Generation with AI Feedback [130.090296560882]
We investigate the use of feedback to enhance the object dynamics in text-to-video models.<n>We show that our method can effectively optimize a wide variety of rewards, with binary AI feedback driving the most significant improvements in video quality for dynamic interactions.
arXiv Detail & Related papers (2024-12-03T17:44:23Z) - Dynamic Reconstruction of Hand-Object Interaction with Distributed Force-aware Contact Representation [52.36691633451968]
ViTaM-D is a visual-tactile framework for dynamic hand-object interaction reconstruction.
DF-Field is a distributed force-aware contact representation model.
Our results highlight the superior performance of ViTaM-D in both rigid and deformable object reconstruction.
arXiv Detail & Related papers (2024-11-14T16:29:45Z) - EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting [95.44545809256473]
EgoGaussian is a method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone.
We show significant improvements in terms of both dynamic object and background reconstruction quality compared to the state-of-the-art.
arXiv Detail & Related papers (2024-06-28T10:39:36Z) - DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object
Video Generation [26.292052071093945]
We propose an unsupervised method to generate videos from a single frame and a sparse motion input.
Our trained model can generate unseen realistic object-to-object interactions.
We show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.
arXiv Detail & Related papers (2023-06-06T19:50:02Z) - Interaction Region Visual Transformer for Egocentric Action Anticipation [18.873728614415946]
We propose a novel way to represent human-object interactions for egocentric action anticipation.
We model interactions between hands and objects using Spatial Cross-Attention.
We then infuse contextual information using Trajectory Cross-Attention to obtain environment-refined interaction tokens.
Using these tokens, we construct an interaction-centric video representation for action anticipation.
arXiv Detail & Related papers (2022-11-25T15:00:51Z) - Fine-Grained Egocentric Hand-Object Segmentation: Dataset, Model, and
Applications [20.571026014771828]
We provide a labeled dataset consisting of 11,243 egocentric images with per-pixel segmentation labels of hands and objects being interacted with.
Our dataset is the first to label detailed hand-object contact boundaries.
We show that our robust hand-object segmentation model and dataset can serve as a foundational tool to boost or enable several downstream vision applications.
arXiv Detail & Related papers (2022-08-07T21:43:40Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.