CLIPort: What and Where Pathways for Robotic Manipulation
- URL: http://arxiv.org/abs/2109.12098v1
- Date: Fri, 24 Sep 2021 17:44:28 GMT
- Title: CLIPort: What and Where Pathways for Robotic Manipulation
- Authors: Mohit Shridhar, Lucas Manuelli, Dieter Fox
- Abstract summary: We present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding with the spatial precision of Transporter.
Our framework is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures.
- Score: 35.505615833638124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: How can we imbue robots with the ability to manipulate objects precisely but
also to reason about them in terms of abstract concepts? Recent works in
manipulation have shown that end-to-end networks can learn dexterous skills
that require precise spatial reasoning, but these methods often fail to
generalize to new goals or quickly learn transferable concepts across tasks. In
parallel, there has been great progress in learning generalizable semantic
representations for vision and language by training on large-scale internet
data, however these representations lack the spatial understanding necessary
for fine-grained manipulation. To this end, we propose a framework that
combines the best of both worlds: a two-stream architecture with semantic and
spatial pathways for vision-based manipulation. Specifically, we present
CLIPort, a language-conditioned imitation-learning agent that combines the
broad semantic understanding (what) of CLIP [1] with the spatial precision
(where) of Transporter [2]. Our end-to-end framework is capable of solving a
variety of language-specified tabletop tasks from packing unseen objects to
folding cloths, all without any explicit representations of object poses,
instance segmentations, memory, symbolic states, or syntactic structures.
Experiments in simulated and real-world settings show that our approach is data
efficient in few-shot settings and generalizes effectively to seen and unseen
semantic concepts. We even learn one multi-task policy for 10 simulated and 9
real-world tasks that is better or comparable to single-task policies.
Related papers
- Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Object-Centric Instruction Augmentation for Robotic Manipulation [29.491990994901666]
We introduce the textitObject-Centric Instruction Augmentation (OCI) framework to augment highly semantic and information-dense language instruction with position cues.
We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction.
We demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.
arXiv Detail & Related papers (2024-01-05T13:54:45Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Embodied Concept Learner: Self-supervised Learning of Concepts and
Mapping through Instruction Following [101.55727845195969]
We propose Embodied Learner Concept (ECL) in an interactive 3D environment.
A robot agent can ground visual concepts, build semantic maps and plan actions to complete tasks.
ECL is fully transparent and step-by-step interpretable in long-term planning.
arXiv Detail & Related papers (2023-04-07T17:59:34Z) - Enhancing Interpretability and Interactivity in Robot Manipulation: A
Neurosymbolic Approach [0.0]
We present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation.
A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA) or a grasp action instruction.
We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes.
arXiv Detail & Related papers (2022-10-03T12:21:45Z) - Self-Supervised Visual Representation Learning with Semantic Grouping [50.14703605659837]
We tackle the problem of learning visual representations from unlabeled scene-centric data.
We propose contrastive learning from data-driven semantic slots, namely SlotCon, for joint semantic grouping and representation learning.
arXiv Detail & Related papers (2022-05-30T17:50:59Z) - Visuomotor Control in Multi-Object Scenes Using Object-Aware
Representations [25.33452947179541]
We show the effectiveness of object-aware representation learning techniques for robotic tasks.
Our model learns control policies in a sample-efficient manner and outperforms state-of-the-art object techniques.
arXiv Detail & Related papers (2022-05-12T19:48:11Z) - Where2Act: From Pixels to Actions for Articulated 3D Objects [54.19638599501286]
We extract highly localized actionable information related to elementary actions such as pushing or pulling for articulated objects with movable parts.
We propose a learning-from-interaction framework with an online data sampling strategy that allows us to train the network in simulation.
Our learned models even transfer to real-world data.
arXiv Detail & Related papers (2021-01-07T18:56:38Z) - Following Instructions by Imagining and Reaching Visual Goals [8.19944635961041]
We present a novel framework for learning to perform temporally extended tasks using spatial reasoning.
Our framework operates on raw pixel images, assumes no prior linguistic or perceptual knowledge, and learns via intrinsic motivation.
We validate our method in two environments with a robot arm in a simulated interactive 3D environment.
arXiv Detail & Related papers (2020-01-25T23:26:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.