Related papers: AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation

AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation

URL: http://arxiv.org/abs/2510.01433v1
Date: Wed, 01 Oct 2025 20:13:39 GMT
Title: AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation
Authors: Anukriti Singh, Kasra Torshizi, Khuzema Habib, Kelin Yu, Ruohan Gao, Pratap Tokekar,
Abstract summary: AFFORD2ACT is an affordance-guided framework that distills a minimal set of semantic 2D keypoints from a text prompt and a single image.<n>It consistently improves data efficiency, achieving an 82% success rate on unseen objects, novel categories, backgrounds, and distractors.
Score: 19.253841162440267
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-based robot learning often relies on dense image or point-cloud inputs, which are computationally heavy and entangle irrelevant background features. Existing keypoint-based approaches can focus on manipulation-centric features and be lightweight, but either depend on manual heuristics or task-coupled selection, limiting scalability and semantic understanding. To address this, we propose AFFORD2ACT, an affordance-guided framework that distills a minimal set of semantic 2D keypoints from a text prompt and a single image. AFFORD2ACT follows a three-stage pipeline: affordance filtering, category-level keypoint construction, and transformer-based policy learning with embedded gating to reason about the most relevant keypoints, yielding a compact 38-dimensional state policy that can be trained in 15 minutes, which performs well in real-time without proprioception or dense representations. Across diverse real-world manipulation tasks, AFFORD2ACT consistently improves data efficiency, achieving an 82% success rate on unseen objects, novel categories, backgrounds, and distractors.

Related papers

ContextFusion and Bootstrap: An Effective Approach to Improve Slot Attention-Based Object-Centric Learning [53.19029595226767]
Slot attention-based framework has emerged as a leading approach in object-centric learning.<n>Current methods require a stable feature space throughout training to enable reconstruction from slots.<n>We propose a novel ContextFusion stage and a Bootstrap Branch, both of which can be seamlessly integrated into existing slot attention models.
arXiv Detail & Related papers (2025-09-02T07:19:25Z)
Multi-Keypoint Affordance Representation for Functional Dexterous Grasping [26.961157077703756]
We propose a multi-keypoint affordance representation for functional dexterous grasping.<n>Our method encodes task-driven grasp configurations by localizing functional contact points.<n>Our method significantly improves affordance localization accuracy, grasp consistency, and generalization to unseen tools and tasks.
arXiv Detail & Related papers (2025-02-27T11:54:53Z)
CordViP: Correspondence-based Visuomotor Policy for Dexterous Manipulation in Real-World [20.52894595103719]
CordViP is a novel framework that constructs and learns correspondences by leveraging the robust 6D pose estimation of objects and robot proprioception.<n>Our method demonstrates exceptional dexterous manipulation capabilities, achieving state-of-the-art performance in six real-world tasks.
arXiv Detail & Related papers (2025-02-12T14:41:14Z)
UniPLV: Towards Label-Efficient Open-World 3D Scene Understanding by Regional Visual Language Supervision [10.587237925455211]
We present UniPLV, a robust framework that unifies point clouds, images, and text within a single learning paradigm for comprehensive 3D scene understanding.<n>We show that UniPLV significantly surpasses state-of-the-art methods, with average improvements of 15.6% and 14.8% in semantic segmentation for Base-Annotated and.<n>Free tasks.
arXiv Detail & Related papers (2024-12-24T03:40:05Z)
Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features. We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z)
VAPO: Visibility-Aware Keypoint Localization for Efficient 6DoF Object Pose Estimation [52.81869878956534]
Localizing 3D keypoints in a 2D image is an effective way to establish 3D-2D correspondences for instance-level 6DoF object pose estimation.<n>In this paper, we address this issue by localizing the important keypoints in terms of visibility.<n>We construct VAPO (Visibility-Aware POse estimator) by integrating the visibility-aware importance with a state-of-the-art pose estimation algorithm.
arXiv Detail & Related papers (2024-03-21T16:59:45Z)
Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.<n>To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.<n>In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z)
Feature Decoupling-Recycling Network for Fast Interactive Segmentation [79.22497777645806]
Recent interactive segmentation methods iteratively take source image, user guidance and previously predicted mask as the input. We propose the Feature Decoupling-Recycling Network (FDRN), which decouples the modeling components based on their intrinsic discrepancies.
arXiv Detail & Related papers (2023-08-07T12:26:34Z)
Few-Shot Keypoint Detection as Task Adaptation via Latent Embeddings [17.04471874483516]
Existing approaches either compute dense keypoint embeddings in a single forward pass, or allocate their full capacity to a sparse set of points. In this paper we explore a middle ground based on the observation that the number of relevant points at a given time are typically relatively few. Our main contribution is a novel architecture, inspired by few-shot task adaptation, which allows a sparse-style network to condition on a keypoint embedding.
arXiv Detail & Related papers (2021-12-09T13:25:42Z)
S3K: Self-Supervised Semantic Keypoints for Robotic Manipulation via Multi-View Consistency [11.357804868755155]
We advocate semantic 3D keypoints as a visual representation, and present a semi-supervised training objective. Unlike local texture-based approaches, our model integrates contextual information from a large area. We demonstrate that this ability to locate semantic keypoints enables high level scripting of human understandable behaviours.
arXiv Detail & Related papers (2020-09-30T14:44:54Z)
Towards High Performance Human Keypoint Detection [87.1034745775229]
We find that context information plays an important role in reasoning human body configuration and invisible keypoints. Inspired by this, we propose a cascaded context mixer ( CCM) which efficiently integrates spatial and channel context information. To maximize CCM's representation capability, we develop a hard-negative person detection mining strategy and a joint-training strategy. We present several sub-pixel refinement techniques for postprocessing keypoint predictions to improve detection accuracy.
arXiv Detail & Related papers (2020-02-03T02:24:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.