Related papers: ImaginationPolicy: Towards Generalizable, Precise and Reliable End-to-End Policy for Robotic Manipulation

ImaginationPolicy: Towards Generalizable, Precise and Reliable End-to-End Policy for Robotic Manipulation

URL: http://arxiv.org/abs/2509.20841v1
Date: Thu, 25 Sep 2025 07:29:07 GMT
Title: ImaginationPolicy: Towards Generalizable, Precise and Reliable End-to-End Policy for Robotic Manipulation
Authors: Dekun Lu, Wei Gao, Kui Jia,
Abstract summary: We propose a novel Chain of Moving Oriented Keypoints (CoMOK) formulation for robotic manipulation.<n>Our formulation is used as the action representation of a neural policy, which can be trained in an end-to-end fashion.
Score: 46.06124092071133
License: http://creativecommons.org/licenses/by/4.0/
Abstract: End-to-end robot manipulation policies offer significant potential for enabling embodied agents to understand and interact with the world. Unlike traditional modular pipelines, end-to-end learning mitigates key limitations such as information loss between modules and feature misalignment caused by isolated optimization targets. Despite these advantages, existing end-to-end neural networks for robotic manipulation--including those based on large VLM/VLA models--remain insufficiently performant for large-scale practical deployment. In this paper, we take a step towards an end-to-end manipulation policy that is generalizable, accurate and reliable. To achieve this goal, we propose a novel Chain of Moving Oriented Keypoints (CoMOK) formulation for robotic manipulation. Our formulation is used as the action representation of a neural policy, which can be trained in an end-to-end fashion. Such an action representation is general, as it extends the standard end-effector pose action representation and supports a diverse set of manipulation tasks in a unified manner. The oriented keypoint in our method enables natural generalization to objects with different shapes and sizes, while achieving sub-centimeter accuracy. Moreover, our formulation can easily handle multi-stage tasks, multi-modal robot behaviors, and deformable objects. Extensive simulated and hardware experiments demonstrate the effectiveness of our method.

Related papers

Multi-Modal Manipulation via Multi-Modal Policy Consensus [62.49978559936122]
We propose a new approach to integrate diverse sensory modalities for robotic manipulation.<n>Our method factorizes the policy into a set of diffusion models, each specialized for a single representation.<n>We evaluate our approach on simulated manipulation tasks in RLBench, as well as real-world tasks such as occluded object picking, in-hand spoon reorientation, and puzzle insertion.
arXiv Detail & Related papers (2025-09-27T19:43:04Z)
Dexplore: Scalable Neural Control for Dexterous Manipulation from Reference-Scoped Exploration [58.4036440289082]
Hand-object motion-capture (MoCap) offer large-scale, contact-rich demonstrations and hold promise for dexterous robotic scopes.<n>We introduce Dexplore, a unified single-loop optimization that performs repositories and tracking to learn robot control policies directly from MoCap at scale.
arXiv Detail & Related papers (2025-09-11T17:59:07Z)
Sim-to-Real Reinforcement Learning for Vision-Based Dexterous Manipulation on Humanoids [56.892520712892804]
We introduce a practical sim-to-real RL recipe that trains a humanoid robot to perform three dexterous manipulation tasks.<n>We demonstrate high success rates on unseen objects and robust, adaptive policy behaviors.
arXiv Detail & Related papers (2025-02-27T18:59:52Z)
Track2Act: Predicting Point Tracks from Internet Videos enables Generalizable Robot Manipulation [65.46610405509338]
We seek to learn a generalizable goal-conditioned policy that enables zero-shot robot manipulation. Our framework,Track2Act predicts tracks of how points in an image should move in future time-steps based on a goal. We show that this approach of combining scalably learned track prediction with a residual policy enables diverse generalizable robot manipulation.
arXiv Detail & Related papers (2024-05-02T17:56:55Z)
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation [22.071450379253235]
We introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs) By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. Experiments in simulator and real-world show the promising performance of ManipLLM.
arXiv Detail & Related papers (2023-12-24T06:38:11Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
GAN-MPC: Training Model Predictive Controllers with Parameterized Cost Functions using Demonstrations from Non-identical Experts [14.291720751625585]
We propose a generative adversarial network (GAN) to minimize the Jensen-Shannon divergence between the state-trajectory distributions of the demonstrator and the imitator. We evaluate our approach on a variety of simulated robotics tasks of DeepMind Control suite.
arXiv Detail & Related papers (2023-05-30T15:15:30Z)
Programmatically Grounded, Compositionally Generalizable Robotic Manipulation [35.12811184353626]
We show that the conventional pretraining-finetuning pipeline for integrating semantic representations entangles the learning of domain-specific action information. We propose a modular approach to better leverage pretrained models by exploiting the syntactic and semantic structures of language instructions. Our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors.
arXiv Detail & Related papers (2023-04-26T20:56:40Z)
Canonical mapping as a general-purpose object descriptor for robotic manipulation [0.0]
We propose using canonical mapping as a near-universal and flexible object descriptor. We demonstrate that common object representations can be derived from a single pre-trained canonical mapping model. We perform a multi-stage experiment using two robot arms that demonstrate the robustness of the perception approach.
arXiv Detail & Related papers (2023-03-02T15:09:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.