Related papers: Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

Programmatically Grounded, Compositionally Generalizable Robotic Manipulation

URL: http://arxiv.org/abs/2304.13826v1
Date: Wed, 26 Apr 2023 20:56:40 GMT
Title: Programmatically Grounded, Compositionally Generalizable Robotic Manipulation
Authors: Renhao Wang, Jiayuan Mao, Joy Hsu, Hang Zhao, Jiajun Wu, Yang Gao
Abstract summary: We show that the conventional pretraining-finetuning pipeline for integrating semantic representations entangles the learning of domain-specific action information. We propose a modular approach to better leverage pretrained models by exploiting the syntactic and semantic structures of language instructions. Our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors.
Score: 35.12811184353626
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Robots operating in the real world require both rich manipulation skills as well as the ability to semantically reason about when to apply those skills. Towards this goal, recent works have integrated semantic representations from large-scale pretrained vision-language (VL) models into manipulation models, imparting them with more general reasoning capabilities. However, we show that the conventional pretraining-finetuning pipeline for integrating such representations entangles the learning of domain-specific action information and domain-general visual information, leading to less data-efficient training and poor generalization to unseen objects and tasks. To this end, we propose ProgramPort, a modular approach to better leverage pretrained VL models by exploiting the syntactic and semantic structures of language instructions. Our framework uses a semantic parser to recover an executable program, composed of functional modules grounded on vision and action across different modalities. Each functional module is realized as a combination of deterministic computation and learnable neural networks. Program execution produces parameters to general manipulation primitives for a robotic end-effector. The entire modular network can be trained with end-to-end imitation learning objectives. Experiments show that our model successfully disentangles action and perception, translating to improved zero-shot and compositional generalization in a variety of manipulation behaviors. Project webpage at: \url{https://progport.github.io}.

Related papers

Improving Generalization of Language-Conditioned Robot Manipulation [29.405161073483175]
We present a framework that learns object-arrangement tasks from just a few demonstrations.<n>We validate our method on both simulation and real-world robotic environments.
arXiv Detail & Related papers (2025-08-04T13:29:26Z)
$π_{0.5}$: a Vision-Language-Action Model with Open-World Generalization [81.73746512639283]
We describe a new model based on $pi_0.5$ that uses co-training on heterogeneous tasks to enable broad generalization. We demonstrate for the first time that an end-to-end learning-enabled robotic system can perform long-horizon and dexterous manipulation skills.
arXiv Detail & Related papers (2025-04-22T17:31:29Z)
RoboAct-CLIP: Video-Driven Pre-training of Atomic Action Understanding for Robotics [22.007302996282085]
This paper presents a temporal-decoupling fine-tuning strategy based on Contrastive Language-Image Pretraining (CLIP) architecture. Results in simulated environments demonstrate that the RoboAct-CLIP pretrained model achieves a 12% higher success rate than baseline Visual Language Models.
arXiv Detail & Related papers (2025-04-02T19:02:08Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA) LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z)
A Pattern Language for Machine Learning Tasks [0.0]
We view objective functions as constraints on the behaviour of learners. We develop a formal graphical language that allows us to separate the core tasks of a behaviour from its implementation details. As proof-of-concept, we design a novel task that enables converting classifiers into generative models we call "manipulators"
arXiv Detail & Related papers (2024-07-02T16:50:27Z)
Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction. The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z)
ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation [22.071450379253235]
We introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs) By fine-tuning the injected adapters, we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. Experiments in simulator and real-world show the promising performance of ManipLLM.
arXiv Detail & Related papers (2023-12-24T06:38:11Z)
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control. Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z)
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation. We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects. We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models. Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning. Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings. We demonstrate that this framework enables effective generalization across different environments. For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)
Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation. A core challenge is to generalize the manipulation skills to objects in different locations. We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.