Related papers: Fine-Grained Egocentric Hand-Object Segmentation: Dataset, Model, and Applications

Related papers

EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video [7.1221123957033905]
EgoDex is the largest and most diverse dataset of dexterous human manipulation to date.<n>It has 829 hours of egocentric video with paired 3D hand and finger tracking data collected at the time of recording.<n>The dataset covers a wide range of diverse manipulation behaviors with everyday household objects in 194 different tabletop tasks.
arXiv Detail & Related papers (2025-05-16T21:34:47Z)
Object-Shot Enhanced Grounding Network for Egocentric Video [60.97916755629796]
We propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video.<n>Specifically, we extract object information from videos to enrich video representation.<n>We analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information.
arXiv Detail & Related papers (2025-05-07T09:20:12Z)
SIGHT: Single-Image Conditioned Generation of Hand Trajectories for Hand-Object Interaction [86.54738165527502]
We introduce a novel task of generating realistic and diverse 3D hand trajectories given a single image of an object. Hand-object interaction trajectory priors can greatly benefit applications in robotics, embodied AI, augmented reality and related fields.
arXiv Detail & Related papers (2025-03-28T20:53:20Z)
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning [71.02843679746563]
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature. In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process. We propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information.
arXiv Detail & Related papers (2025-03-02T18:49:48Z)
ManiVideo: Generating Hand-Object Manipulation Video with Dexterous and Generalizable Grasping [37.40475678197331]
We introduce ManiVideo, a method for generating consistent and temporally coherent bimanual hand-object manipulation videos. By embedding the MLO structure into the UNet in two forms, the model enhances the 3D consistency of dexterous hand-object manipulation. We propose an innovative training strategy that effectively integrates multiple datasets, supporting downstream tasks such as human-centric hand-object manipulation video generation.
arXiv Detail & Related papers (2024-12-18T00:37:55Z)
Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI. We introduce an expertly curated dataset in the Universal Scene Description (USD) format featuring high-quality manual annotations. With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models.
arXiv Detail & Related papers (2024-12-02T11:33:55Z)
Dense Hand-Object(HO) GraspNet with Full Grasping Taxonomy and Dynamics [43.30868393851785]
HOGraspNet is a training dataset for 3D hand-object interaction. The dataset includes diverse hand shapes from 99 participants aged 10 to 74. It offers labels for 3D hand and object meshes, 3D keypoints, contact maps, and emphgrasp labels
arXiv Detail & Related papers (2024-09-06T05:49:38Z)
CaRe-Ego: Contact-aware Relationship Modeling for Egocentric Interactive Hand-object Segmentation [14.765419467710812]
Egocentric Interactive hand-object segmentation (EgoIHOS) is crucial for understanding human behavior in assistive systems.<n>Previous methods recognize hands and interacting objects as distinct semantic categories based solely on visual features.<n>We propose CaRe-Ego, which emphasizes the contact between hands and objects from two aspects.
arXiv Detail & Related papers (2024-07-08T03:17:10Z)
HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video [70.11702620562889]
HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can disentangled 3D hand and object from 2D images. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings.
arXiv Detail & Related papers (2023-11-30T10:50:35Z)
Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips [38.02945794078731]
We tackle the task of reconstructing hand-object interactions from short video clips. Our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape. We empirically evaluate our approach on egocentric videos, and observe significant improvements over prior single-view and multi-view methods.
arXiv Detail & Related papers (2023-09-11T17:58:30Z)
HANDAL: A Dataset of Real-World Manipulable Object Categories with Pose Annotations, Affordances, and Reconstructions [17.9178233068395]
We present the HANDAL dataset for category-level object pose estimation and affordance prediction. The dataset consists of 308k annotated image frames from 2.2k videos of 212 real-world objects in 17 categories. We outline the usefulness of our dataset for 6-DoF category-level pose+scale estimation and related tasks.
arXiv Detail & Related papers (2023-08-02T23:59:59Z)
Learning Fine-grained View-Invariant Representations from Unpaired Ego-Exo Videos via Temporal Alignment [71.16699226211504]
We propose to learn fine-grained action features that are invariant to the viewpoints by aligning egocentric and exocentric videos in time. To this end, we propose AE2, a self-supervised embedding approach with two key designs. For evaluation, we establish a benchmark for fine-grained video understanding in the ego-exo context.
arXiv Detail & Related papers (2023-06-08T19:54:08Z)
SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model. Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z)
Estimating 3D Motion and Forces of Human-Object Interactions from Internet Videos [49.52070710518688]
We introduce a method to reconstruct the 3D motion of a person interacting with an object from a single RGB video. Our method estimates the 3D poses of the person together with the object pose, the contact positions and the contact forces on the human body.
arXiv Detail & Related papers (2021-11-02T13:40:18Z)
H2O: Two Hands Manipulating Objects for First Person Interaction Recognition [70.46638409156772]
We present a comprehensive framework for egocentric interaction recognition using markerless 3D annotations of two hands manipulating objects. Our method produces annotations of the 3D pose of two hands and the 6D pose of the manipulated objects, along with their interaction labels for each frame. Our dataset, called H2O (2 Hands and Objects), provides synchronized multi-view RGB-D images, interaction labels, object classes, ground-truth 3D poses for left & right hands, 6D object poses, ground-truth camera poses, object meshes and scene point clouds.
arXiv Detail & Related papers (2021-04-22T17:10:42Z)
Ego-Exo: Transferring Visual Representations from Third-person to First-person Videos [92.38049744463149]
We introduce an approach for pre-training egocentric video models using large-scale third-person video datasets. Our idea is to discover latent signals in third-person video that are predictive of key egocentric-specific properties. Our experiments show that our Ego-Exo framework can be seamlessly integrated into standard video models.
arXiv Detail & Related papers (2021-04-16T06:10:10Z)
The IKEA ASM Dataset: Understanding People Assembling Furniture through Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose. We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset. The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z)
A Deep Learning Approach to Object Affordance Segmentation [31.221897360610114]
We design an autoencoder that infers pixel-wise affordance labels in both videos and static images. Our model surpasses the need for object labels and bounding boxes by using a soft-attention mechanism. We show that our model achieves competitive results compared to strongly supervised methods on SOR3D-AFF.
arXiv Detail & Related papers (2020-04-18T15:34:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.