Related papers: FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

URL: http://arxiv.org/abs/2509.19102v1
Date: Tue, 23 Sep 2025 14:49:05 GMT
Title: FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation
Authors: Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang,
Abstract summary: We introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks.<n>These chunks focus policy learning on the actions themselves, rather than isolated tasks.<n>Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment.
Score: 25.631729484747087
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

Related papers

ImaginationPolicy: Towards Generalizable, Precise and Reliable End-to-End Policy for Robotic Manipulation [46.06124092071133]
We propose a novel Chain of Moving Oriented Keypoints (CoMOK) formulation for robotic manipulation.<n>Our formulation is used as the action representation of a neural policy, which can be trained in an end-to-end fashion.
arXiv Detail & Related papers (2025-09-25T07:29:07Z)
MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence [18.953496415412335]
Imitating tool manipulation from human videos offers an intuitive approach to teaching robots.<n>We propose MimicFunc, a framework that establishes functional correspondences with function frame.<n>MimicFunc effectively enables the robot to generalize the skill from a single RGB-D human video to manipulating novel tools.
arXiv Detail & Related papers (2025-08-19T05:49:47Z)
Adaptive Articulated Object Manipulation On The Fly with Foundation Model Reasoning and Part Grounding [18.52792284421002]
Articulated objects pose diverse manipulation challenges for robots.<n>Since their internal structures are not directly observable, robots must adaptively explore and refine actions to generate successful manipulation trajectories.<n>AdaRPG is a novel framework that leverages foundation models to extract object parts, which exhibit greater local geometric similarity than entire objects.
arXiv Detail & Related papers (2025-07-24T10:25:58Z)
Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics. Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features. We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
Learning from 10 Demos: Generalisable and Sample-Efficient Policy Learning with Oriented Affordance Frames [10.738838923944876]
Existing methods require a substantial number of demonstrations to cover possible task variations.<n>We introduce oriented affordance frames, a structured representation for state and action spaces.<n>We show how this abstraction allows for compositional generalisation of independently trained sub-policies.<n>We validate our method across three real-world tasks, each requiring multi-step, multi-object interactions.
arXiv Detail & Related papers (2024-10-15T23:57:35Z)
Learning Reusable Manipulation Strategies [86.07442931141634]
Humans demonstrate an impressive ability to acquire and generalize manipulation "tricks" We present a framework that enables machines to acquire such manipulation skills through a single demonstration and self-play. These learned mechanisms and samplers can be seamlessly integrated into standard task and motion planners.
arXiv Detail & Related papers (2023-11-06T17:35:42Z)
Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs [53.66070434419739]
Generalizable articulated object manipulation is essential for home-assistant robots. We propose a kinematic-aware prompting framework that prompts Large Language Models with kinematic knowledge of objects to generate low-level motion waypoints. Our framework outperforms traditional methods on 8 categories seen and shows a powerful zero-shot capability for 8 unseen articulated object categories.
arXiv Detail & Related papers (2023-11-06T03:26:41Z)
Programmatically Grounded, Compositionally Generalizable Robotic Manipulation [35.12811184353626]
We show that the conventional pretraining-finetuning pipeline for integrating semantic representations entangles the learning of domain-specific action information. We propose a modular approach to better leverage pretrained models by exploiting the syntactic and semantic structures of language instructions. Our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors.
arXiv Detail & Related papers (2023-04-26T20:56:40Z)
Inferring Versatile Behavior from Demonstrations by Matching Geometric Descriptors [72.62423312645953]
Humans intuitively solve tasks in versatile ways, varying their behavior in terms of trajectory-based planning and for individual steps. Current Imitation Learning algorithms often only consider unimodal expert demonstrations and act in a state-action-based setting. Instead, we combine a mixture of movement primitives with a distribution matching objective to learn versatile behaviors that match the expert's behavior and versatility.
arXiv Detail & Related papers (2022-10-17T16:42:59Z)
Plug and Play, Model-Based Reinforcement Learning [60.813074750879615]
We introduce an object-based representation that allows zero-shot integration of new objects from known object classes. This is achieved by representing the global transition dynamics as a union of local transition functions. Experiments show that our representation can achieve sample-efficiency in a variety of set-ups.
arXiv Detail & Related papers (2021-08-20T01:20:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.