AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation
- URL: http://arxiv.org/abs/2510.01433v1
- Date: Wed, 01 Oct 2025 20:13:39 GMT
- Title: AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation
- Authors: Anukriti Singh, Kasra Torshizi, Khuzema Habib, Kelin Yu, Ruohan Gao, Pratap Tokekar,
- Abstract summary: AFFORD2ACT is an affordance-guided framework that distills a minimal set of semantic 2D keypoints from a text prompt and a single image.<n>It consistently improves data efficiency, achieving an 82% success rate on unseen objects, novel categories, backgrounds, and distractors.
- Score: 19.253841162440267
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-based robot learning often relies on dense image or point-cloud inputs, which are computationally heavy and entangle irrelevant background features. Existing keypoint-based approaches can focus on manipulation-centric features and be lightweight, but either depend on manual heuristics or task-coupled selection, limiting scalability and semantic understanding. To address this, we propose AFFORD2ACT, an affordance-guided framework that distills a minimal set of semantic 2D keypoints from a text prompt and a single image. AFFORD2ACT follows a three-stage pipeline: affordance filtering, category-level keypoint construction, and transformer-based policy learning with embedded gating to reason about the most relevant keypoints, yielding a compact 38-dimensional state policy that can be trained in 15 minutes, which performs well in real-time without proprioception or dense representations. Across diverse real-world manipulation tasks, AFFORD2ACT consistently improves data efficiency, achieving an 82% success rate on unseen objects, novel categories, backgrounds, and distractors.
Related papers
- ContextFusion and Bootstrap: An Effective Approach to Improve Slot Attention-Based Object-Centric Learning [53.19029595226767]
Slot attention-based framework has emerged as a leading approach in object-centric learning.<n>Current methods require a stable feature space throughout training to enable reconstruction from slots.<n>We propose a novel ContextFusion stage and a Bootstrap Branch, both of which can be seamlessly integrated into existing slot attention models.
arXiv Detail & Related papers (2025-09-02T07:19:25Z) - Multi-Keypoint Affordance Representation for Functional Dexterous Grasping [26.961157077703756]
We propose a multi-keypoint affordance representation for functional dexterous grasping.<n>Our method encodes task-driven grasp configurations by localizing functional contact points.<n>Our method significantly improves affordance localization accuracy, grasp consistency, and generalization to unseen tools and tasks.
arXiv Detail & Related papers (2025-02-27T11:54:53Z) - CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World [20.52894595103719]
CordViP is a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception.<n>Our method demonstrates exceptional dexterous manipulation capabilities, achieving state-of-the-art performance in six real-world tasks.
arXiv Detail & Related papers (2025-02-12T14:41:14Z) - UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision [10.587237925455211]
We present UniPLV, a robust framework that unifies point clouds, images, and text within a single learning paradigm for comprehensive 3D scene understanding.<n>We show that UniPLV significantly surpasses state-of-the-art methods, with average improvements of 15.6% and 14.8% in semantic segmentation for Base-Annotated and.<n>Free tasks.
arXiv Detail & Related papers (2024-12-24T03:40:05Z) - Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - VAPO: Visibility-Aware Keypoint Localization for Efficient 6DoF Object Pose Estimation [52.81869878956534]
Localizing 3D keypoints in a 2D image is an effective way to establish 3D-2D correspondences for instance-level 6DoF object pose estimation.<n>In this paper, we address this issue by localizing the important keypoints in terms of visibility.<n>We construct VAPO (Visibility-Aware POse estimator) by integrating the visibility-aware importance with a state-of-the-art pose estimation algorithm.
arXiv Detail & Related papers (2024-03-21T16:59:45Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Feature Decoupling-Recycling Network for Fast Interactive Segmentation [79.22497777645806]
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input.
We propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies.
arXiv Detail & Related papers (2023-08-07T12:26:34Z) - Few-Shot Keypoint Detection as Task Adaptation via Latent Embeddings [17.04471874483516]
Existing approaches either compute dense keypoint embeddings in a single forward pass, or allocate their full capacity to a sparse set of points.
In this paper we explore a middle ground based on the observation that the number of relevant points at a given time are typically relatively few.
Our main contribution is a novel architecture, inspired by few-shot task adaptation, which allows a sparse-style network to condition on a keypoint embedding.
arXiv Detail & Related papers (2021-12-09T13:25:42Z) - S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via
Multi-View Consistency [11.357804868755155]
We advocate semantic 3D keypoints as a visual representation, and present a semi-supervised training objective.
Unlike local texture-based approaches, our model integrates contextual information from a large area.
We demonstrate that this ability to locate semantic keypoints enables high level scripting of human understandable behaviours.
arXiv Detail & Related papers (2020-09-30T14:44:54Z) - Towards High Performance Human Keypoint Detection [87.1034745775229]
We find that context information plays an important role in reasoning human body configuration and invisible keypoints.
Inspired by this, we propose a cascaded context mixer ( CCM) which efficiently integrates spatial and channel context information.
To maximize CCM's representation capability, we develop a hard-negative person detection mining strategy and a joint-training strategy.
We present several sub-pixel refinement techniques for postprocessing keypoint predictions to improve detection accuracy.
arXiv Detail & Related papers (2020-02-03T02:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.