2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos
- URL: http://arxiv.org/abs/2503.09320v2
- Date: Thu, 13 Mar 2025 06:35:58 GMT
- Title: 2HandedAfforder: Learning Precise Actionable Bimanual Affordances from Human Videos
- Authors: Marvin Heidinger, Snehal Jauhri, Vignesh Prasad, Georgia Chalvatzaki,
- Abstract summary: We propose a framework for extracting affordance data from human activity video datasets.<n>We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset.
- Score: 6.562256987706128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When interacting with objects, humans effectively reason about which regions of objects are viable for an intended action, i.e., the affordance regions of the object. They can also account for subtle differences in object regions based on the task to be performed and whether one or two hands need to be used. However, current vision-based affordance prediction methods often reduce the problem to naive object part segmentation. In this work, we propose a framework for extracting affordance data from human activity video datasets. Our extracted 2HANDS dataset contains precise object affordance region segmentations and affordance class-labels as narrations of the activity performed. The data also accounts for bimanual actions, i.e., two hands co-ordinating and interacting with one or more objects. We present a VLM-based affordance prediction model, 2HandedAfforder, trained on the dataset and demonstrate superior performance over baselines in affordance region segmentation for various activities. Finally, we show that our predicted affordance regions are actionable, i.e., can be used by an agent performing a task, through demonstration in robotic manipulation scenarios.
Related papers
- SPOC: Spatially-Progressing Object State Change Segmentation in Video [52.65373395382122]
We introduce the spatially-progressing object state change segmentation task.
The goal is to segment at the pixel-level those regions of an object that are actionable and those that are transformed.
We demonstrate useful implications for tracking activity progress to benefit robotic agents.
arXiv Detail & Related papers (2025-03-15T01:48:54Z) - Holistic Understanding of 3D Scenes as Universal Scene Description [56.69740649781989]
3D scene understanding is a long-standing challenge in computer vision and a key component in enabling mixed reality, wearable computing, and embodied AI.<n>We introduce an expertly curated dataset in the Universal Scene Description (USD) format featuring high-quality manual annotations.<n>With its broad and high-quality annotations, the data provides the basis for holistic 3D scene understanding models.
arXiv Detail & Related papers (2024-12-02T11:33:55Z) - ADL4D: Towards A Contextually Rich Dataset for 4D Activities of Daily
Living [4.221961702292134]
ADL4D is a dataset of up to two subjects inter- acting with different sets of objects performing Activities of Daily Living (ADL)
Our dataset consists of 75 sequences with a total of 1.1M RGB-D frames, hand and object poses, and per-hand fine-grained action annotations.
We develop an automatic system for multi-view multi-hand 3D pose an- notation capable of tracking hand poses over time.
arXiv Detail & Related papers (2024-02-27T18:51:52Z) - Multi-Granularity Hand Action Detection [58.88274905101276]
FHA-Kitchens dataset comprises 2,377 video clips and 30,047 frames, annotated with approximately 200k bounding boxes and 880 action categories.
This dataset comprises 2,377 video clips and 30,047 frames, annotated with approximately 200k bounding boxes and 880 action categories.
We propose MG-HAD, an End-to-End Multi-Granularity Hand Action Detection method.
arXiv Detail & Related papers (2023-06-19T11:21:59Z) - ATTACH Dataset: Annotated Two-Handed Assembly Actions for Human Action
Understanding [8.923830513183882]
We present the ATTACH dataset, which contains 51.6 hours of assembly with 95.2k annotated fine-grained actions monitored by three cameras.
In the ATTACH dataset, more than 68% of annotations overlap with other annotations, which is many times more than in related datasets.
We report the performance of state-of-the-art methods for action recognition as well as action detection on video and skeleton-sequence inputs.
arXiv Detail & Related papers (2023-04-17T12:31:24Z) - Learning Action-Effect Dynamics from Pairs of Scene-graphs [50.72283841720014]
We propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language.
Our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
arXiv Detail & Related papers (2022-12-07T03:36:37Z) - Effective Actor-centric Human-object Interaction Detection [20.564689533862524]
We propose a novel actor-centric framework to detect Human-Object Interaction in images.
Our method achieves the state-of-the-art on the challenging V-COCO and HICO-DET benchmarks.
arXiv Detail & Related papers (2022-02-24T10:24:44Z) - Co-segmentation Inspired Attention Module for Video-based Computer
Vision Tasks [11.61956970623165]
We propose a generic module called "Co-Segmentation Module Activation" (COSAM) to promote the notion of co-segmentation based attention among a sequence of video frame features.
We show the application of COSAM in three video based tasks namely 1) Video-based person re-ID, 2) Video captioning, & 3) Video action classification.
arXiv Detail & Related papers (2021-11-14T15:35:37Z) - Learning Visual Affordance Grounding from Demonstration Videos [76.46484684007706]
Affordance grounding aims to segment all possible interaction regions between people and objects from an image/video.
We propose a Hand-aided Affordance Grounding Network (HAGNet) that leverages the aided clues provided by the position and action of the hand in demonstration videos.
arXiv Detail & Related papers (2021-08-12T11:45:38Z) - DyStaB: Unsupervised Object Segmentation via Dynamic-Static
Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole.
Our method first partitions the motion field by minimizing the mutual information between segments.
It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z) - The IKEA ASM Dataset: Understanding People Assembling Furniture through
Actions, Objects and Pose [108.21037046507483]
IKEA ASM is a three million frame, multi-view, furniture assembly video dataset that includes depth, atomic actions, object segmentation, and human pose.
We benchmark prominent methods for video action recognition, object segmentation and human pose estimation tasks on this challenging dataset.
The dataset enables the development of holistic methods, which integrate multi-modal and multi-view data to better perform on these tasks.
arXiv Detail & Related papers (2020-07-01T11:34:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.