Learning Precise Affordances from Egocentric Videos for Robotic Manipulation
- URL: http://arxiv.org/abs/2408.10123v2
- Date: Mon, 15 Sep 2025 15:39:03 GMT
- Title: Learning Precise Affordances from Egocentric Videos for Robotic Manipulation
- Authors: Gen Li, Nikolaos Tsagkas, Jifei Song, Ruaridh Mon-Williams, Sethu Vijayakumar, Kun Shao, Laura Sevilla-Lara,
- Abstract summary: Affordance, defined as the potential actions that an object offers, is crucial for embodied AI agents.<n>We propose a complete affordance learning system that takes in egocentric videos and outputs precise affordance annotations without human labeling.<n>We also introduce a framework that facilitates affordance-oriented robotic manipulation such as tool grasping and robot-to-human tool handover.
- Score: 25.929092988536087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Affordance, defined as the potential actions that an object offers, is crucial for embodied AI agents. For example, such knowledge directs an agent to grasp a knife by the handle for cutting or by the blade for safe handover. While existing approaches have made notable progress, affordance research still faces three key challenges: data scarcity, poor generalization, and real-world deployment. Specifically, there is a lack of large-scale affordance datasets with precise segmentation maps, existing models struggle to generalize across different domains or novel object and affordance classes, and little work demonstrates deployability in real-world scenarios. In this work, we address these issues by proposing a complete affordance learning system that (1) takes in egocentric videos and outputs precise affordance annotations without human labeling, (2) leverages geometric information and vision foundation models to improve generalization, and (3) introduces a framework that facilitates affordance-oriented robotic manipulation such as tool grasping and robot-to-human tool handover. Experimental results show that our model surpasses the state-of-the-art by 13.8% in mIoU, and the framework achieves 77.1% successful grasping among 179 trials, including evaluations on seen, unseen classes, and cluttered scenes. Project page: https://reagan1311.github.io/affgrasp.
Related papers
- Generalizable Geometric Prior and Recurrent Spiking Feature Learning for Humanoid Robot Manipulation [90.90219129619344]
This paper presents a novel R-prior-S, Recurrent Geometric-priormodal Policy with Spiking features.<n>To ground high-level reasoning in physical reality, we leverage lightweight 2D geometric inductive biases.<n>For the data efficiency issue in robotic action generation, we introduce a Recursive Adaptive Spiking Network.
arXiv Detail & Related papers (2026-01-13T23:36:30Z) - A Survey on Efficient Vision-Language-Action Models [153.11669266922993]
Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction.<n>Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models.
arXiv Detail & Related papers (2025-10-27T17:57:33Z) - O$^3$Afford: One-Shot 3D Object-to-Object Affordance Grounding for Generalizable Robotic Manipulation [8.1159855043566]
We address the challenge of object-to-object affordance grounding under limited data contraints.<n>Inspired by recent advances in few-shot learning with 2D vision foundation models, we propose a novel one-shot 3D object-to-object affordance learning approach for robotic manipulation.<n>Our experiments on 3D object-to-object affordance grounding and robotic manipulation demonstrate that our O$3$Afford significantly outperforms existing baselines in terms of both accuracy and generalization capability.
arXiv Detail & Related papers (2025-09-07T22:45:06Z) - Information-Theoretic Graph Fusion with Vision-Language-Action Model for Policy Reasoning and Dual Robotic Control [22.74768543283102]
Graph-Fused Vision-Language-Action (GF-VLA) is a framework that enables dual-arm robotic systems to perform task-level reasoning and execution.<n>GF-VLA first extracts Shannon-information-based cues to identify hands and objects with the highest task relevance.<n>Cross-hand selection policy infers optimal assignment without explicit geometric reasoning.
arXiv Detail & Related papers (2025-08-07T12:48:09Z) - Object Affordance Recognition and Grounding via Multi-scale Cross-modal Representation Learning [64.32618490065117]
A core problem of Embodied AI is to learn object manipulation from observation, as humans do.<n>We propose a novel approach that learns an affordance-aware 3D representation and employs a stage-wise inference strategy.<n> Experiments demonstrate the effectiveness of our method, showing improved performance in both affordance grounding and classification.
arXiv Detail & Related papers (2025-08-02T04:14:18Z) - Being-H0: Vision-Language-Action Pretraining from Large-Scale Human Videos [66.62109400603394]
We introduce Being-H0, a dexterous Vision-Language-Action model trained on large-scale human videos.<n>Our approach centers on physical instruction tuning, a novel training paradigm that combines large-scale VLA pretraining from human videos, physical space alignment for 3D reasoning, and post-training adaptation for robotic tasks.<n>We empirically show the excellence of Being-H0 in hand motion generation and instruction following, and it also scales well with model and data sizes.
arXiv Detail & Related papers (2025-07-21T13:19:09Z) - Web2Grasp: Learning Functional Grasps from Web Images of Hand-Object Interactions [37.334138196925025]
Functional grasp is essential for enabling dexterous multi-finger robot hands to manipulate objects effectively.<n>We propose extracting human grasp information from web images since they depict natural and functional object interactions.<n>We show that these relatively low-quality HOI data from inexpensive web sources can effectively train a functional grasping model.
arXiv Detail & Related papers (2025-05-07T16:13:17Z) - A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear.
We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck.
Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z) - Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning [71.02843679746563]
In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.
In this work, we aim to integrate the modeling of fine-grained hand-object dynamics into the video representation learning process.
We propose EgoVideo, a model with a new lightweight motion adapter to capture fine-grained hand-object motion information.
arXiv Detail & Related papers (2025-03-02T18:49:48Z) - ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration [10.558622685760346]
We present a simple yet effective approach for achieving object generalization through Vision-Language-Action models.<n>Our method provides a lightweight and scalable way to inject knowledge about the target object.<n>We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across 100 novel objects with a 64% success rate.
arXiv Detail & Related papers (2025-02-26T15:56:36Z) - Affordance-Guided Reinforcement Learning via Visual Prompting [51.361977466993345]
Keypoint-based Affordance Guidance for Improvements (KAGI) is a method leveraging rewards shaped by vision-language models (VLMs) for autonomous RL.
On real-world manipulation tasks specified by natural language descriptions, KAGI improves the sample efficiency of autonomous RL and enables successful task completion in 30K online fine-tuning steps.
arXiv Detail & Related papers (2024-07-14T21:41:29Z) - HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes [10.237077867790612]
We present HOIMotion, a novel approach for human motion forecasting during human-object interactions.
Our method integrates information about past body poses and egocentric 3D object bounding boxes.
We show that HOIMotion consistently outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2024-07-02T19:58:35Z) - Information-driven Affordance Discovery for Efficient Robotic Manipulation [14.863105174430087]
We argue that well-directed interactions with the environment can mitigate this problem.
We provide a theoretical justification of our approach and we empirically validate the approach both in simulation and real-world tasks.
Our method, which we dub IDA, enables the efficient discovery of visual affordances for several action primitives.
arXiv Detail & Related papers (2024-05-06T21:25:51Z) - Ag2Manip: Learning Novel Manipulation Skills with Agent-Agnostic Visual and Action Representations [77.31328397965653]
We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting challenges through two key innovations.
A novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability.
An agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object.
arXiv Detail & Related papers (2024-04-26T16:40:17Z) - One-Shot Open Affordance Learning with Foundation Models [54.15857111929812]
We introduce One-shot Open Affordance Learning (OOAL), where a model is trained with just one example per base object category.
We propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings.
Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data.
arXiv Detail & Related papers (2023-11-29T16:23:06Z) - Human Activity Recognition Using Self-Supervised Representations of
Wearable Data [0.0]
Development of accurate algorithms for human activity recognition (HAR) is hindered by the lack of large real-world labeled datasets.
Here we develop a 6-class HAR model with strong performance when evaluated on real-world datasets not seen during training.
arXiv Detail & Related papers (2023-04-26T07:33:54Z) - Adversarial Auto-Augment with Label Preservation: A Representation
Learning Principle Guided Approach [95.74102207187545]
We show that a prior-free autonomous data augmentation's objective can be derived from a representation learning principle.
We then propose a practical surrogate to the objective that can be efficiently optimized and integrated seamlessly into existing methods.
arXiv Detail & Related papers (2022-11-02T02:02:51Z) - H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding
Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations.
We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework.
We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z) - Sim-to-Real 6D Object Pose Estimation via Iterative Self-training for
Robotic Bin-picking [98.5984733963713]
We propose an iterative self-training framework for sim-to-real 6D object pose estimation to facilitate cost-effective robotic grasping.
We establish a photo-realistic simulator to synthesize abundant virtual data, and use this to train an initial pose estimation network.
This network then takes the role of a teacher model, which generates pose predictions for unlabeled real data.
arXiv Detail & Related papers (2022-04-14T15:54:01Z) - Understanding Egocentric Hand-Object Interactions from Hand Pose
Estimation [24.68535915849555]
We propose a method to label a dataset which contains the egocentric images pair-wisely.
We also use the collected pairwise data to train our encoder-decoder style network which has been proven efficient in.
arXiv Detail & Related papers (2021-09-29T18:34:06Z) - One-Shot Object Affordance Detection in the Wild [76.46484684007706]
Affordance detection refers to identifying the potential action possibilities of objects in an image.
We devise a One-Shot Affordance Detection Network (OSAD-Net) that estimates the human action purpose and then transfers it to help detect the common affordance from all candidate images.
With complex scenes and rich annotations, our PADv2 dataset can be used as a test bed to benchmark affordance detection methods.
arXiv Detail & Related papers (2021-08-08T14:53:10Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.