Planning-Guided Diffusion Policy Learning for Generalizable Contact-Rich Bimanual Manipulation
- URL: http://arxiv.org/abs/2412.02676v2
- Date: Fri, 14 Feb 2025 22:01:30 GMT
- Title: Planning-Guided Diffusion Policy Learning for Generalizable Contact-Rich Bimanual Manipulation
- Authors: Xuanlin Li, Tong Zhao, Xinghao Zhu, Jiuguang Wang, Tao Pang, Kuan Fang,
- Abstract summary: Generalizable Planning-Guided Diffusion Policy Learning (GLIDE) is an approach that learns to solve contact-rich bimanual manipulation tasks.<n>We propose a set of essential design options in feature extraction, task representation, action prediction, and data augmentation.<n>Our approach can enable a bimanual robotic system to effectively manipulate objects of diverse geometries, dimensions, and physical properties.
- Score: 16.244250979166214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contact-rich bimanual manipulation involves precise coordination of two arms to change object states through strategically selected contacts and motions. Due to the inherent complexity of these tasks, acquiring sufficient demonstration data and training policies that generalize to unseen scenarios remain a largely unresolved challenge. Building on recent advances in planning through contacts, we introduce Generalizable Planning-Guided Diffusion Policy Learning (GLIDE), an approach that effectively learns to solve contact-rich bimanual manipulation tasks by leveraging model-based motion planners to generate demonstration data in high-fidelity physics simulation. Through efficient planning in randomized environments, our approach generates large-scale and high-quality synthetic motion trajectories for tasks involving diverse objects and transformations. We then train a task-conditioned diffusion policy via behavior cloning using these demonstrations. To tackle the sim-to-real gap, we propose a set of essential design options in feature extraction, task representation, action prediction, and data augmentation that enable learning robust prediction of smooth action sequences and generalization to unseen scenarios. Through experiments in both simulation and the real world, we demonstrate that our approach can enable a bimanual robotic system to effectively manipulate objects of diverse geometries, dimensions, and physical properties. Website: https://glide-manip.github.io/
Related papers
- Latent Diffusion Planning for Imitation Learning [78.56207566743154]
Latent Diffusion Planning (LDP) is a modular approach consisting of a planner and inverse dynamics model.
By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data.
On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches.
arXiv Detail & Related papers (2025-04-23T17:53:34Z) - Learning Coordinated Bimanual Manipulation Policies using State Diffusion and Inverse Dynamics Models [22.826115023573205]
We infuse the predictive nature of human manipulation strategies into robot imitation learning.
We train a diffusion model to predict future states and compute robot actions that achieve the predicted states.
Our framework consistently outperforms state-of-the-art state-to-action mapping policies.
arXiv Detail & Related papers (2025-03-30T01:25:35Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.
Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.
Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - Geometrically-Aware One-Shot Skill Transfer of Category-Level Objects [18.978751760636563]
We propose a new skill transfer framework, which enables a robot to transfer complex object manipulation skills and constraints from a single human demonstration.
Our approach addresses the challenge of skill acquisition and task execution by deriving geometric representations from demonstrations focusing on object-centric interactions.
We validate the effectiveness and adaptability of our approach through extensive experiments, demonstrating successful skill transfer and task execution in diverse real-world environments without requiring additional training.
arXiv Detail & Related papers (2025-03-19T16:10:17Z) - Physics-Driven Data Generation for Contact-Rich Manipulation via Trajectory Optimization [22.234170426206987]
We present a low-cost data generation pipeline that integrates physics-based simulation, human demonstrations, and model-based planning.
We validate the pipeline's effectiveness by training diffusion policies for challenging contact-rich manipulation tasks.
The trained policies are deployed zero-shot on hardware for bimanual iiwa arms, achieving high success rates with minimal human input.
arXiv Detail & Related papers (2025-02-27T18:56:01Z) - Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs [38.281562732050084]
GenSim2 is a scalable framework for complex and realistic simulation task creation.
The pipeline can generate data for up to 100 articulated tasks with 200 objects and reduce the required human efforts.
We show a promising usage of GenSim2 that the generated data can be used for zero-shot transfer or co-train with real-world collected data.
arXiv Detail & Related papers (2024-10-04T17:51:33Z) - Learning Task Planning from Multi-Modal Demonstration for Multi-Stage Contact-Rich Manipulation [26.540648608911308]
In this paper, we introduce an in-context learning framework that incorporates tactile and force-torque information from human demonstrations.
We propose a bootstrapped reasoning pipeline that sequentially integrates each modality into a comprehensive task plan.
This task plan is then used as a reference for planning in new task configurations.
arXiv Detail & Related papers (2024-09-18T10:36:47Z) - Learning Extrinsic Dexterity with Parameterized Manipulation Primitives [8.7221770019454]
We learn a sequence of actions that utilize the environment to change the object's pose.
Our approach can control the object's state through exploiting interactions between the object, the gripper, and the environment.
We evaluate our approach on picking box-shaped objects of various weight, shape, and friction properties from a constrained table-top workspace.
arXiv Detail & Related papers (2023-10-26T21:28:23Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Model-Based Reinforcement Learning with Multi-Task Offline Pretraining [59.82457030180094]
We present a model-based RL method that learns to transfer potentially useful dynamics and action demonstrations from offline data to a novel task.
The main idea is to use the world models not only as simulators for behavior learning but also as tools to measure the task relevance.
We demonstrate the advantages of our approach compared with the state-of-the-art methods in Meta-World and DeepMind Control Suite.
arXiv Detail & Related papers (2023-06-06T02:24:41Z) - Inferring Versatile Behavior from Demonstrations by Matching Geometric
Descriptors [72.62423312645953]
Humans intuitively solve tasks in versatile ways, varying their behavior in terms of trajectory-based planning and for individual steps.
Current Imitation Learning algorithms often only consider unimodal expert demonstrations and act in a state-action-based setting.
Instead, we combine a mixture of movement primitives with a distribution matching objective to learn versatile behaviors that match the expert's behavior and versatility.
arXiv Detail & Related papers (2022-10-17T16:42:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.