Embodied Concept Learner: Self-supervised Learning of Concepts and
Mapping through Instruction Following
- URL: http://arxiv.org/abs/2304.03767v1
- Date: Fri, 7 Apr 2023 17:59:34 GMT
- Title: Embodied Concept Learner: Self-supervised Learning of Concepts and
Mapping through Instruction Following
- Authors: Mingyu Ding, Yan Xu, Zhenfang Chen, David Daniel Cox, Ping Luo, Joshua
B. Tenenbaum, Chuang Gan
- Abstract summary: We propose Embodied Learner Concept (ECL) in an interactive 3D environment.
A robot agent can ground visual concepts, build semantic maps and plan actions to complete tasks.
ECL is fully transparent and step-by-step interpretable in long-term planning.
- Score: 101.55727845195969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans, even at a very early age, can learn visual concepts and understand
geometry and layout through active interaction with the environment, and
generalize their compositions to complete tasks described by natural languages
in novel scenes. To mimic such capability, we propose Embodied Concept Learner
(ECL) in an interactive 3D environment. Specifically, a robot agent can ground
visual concepts, build semantic maps and plan actions to complete tasks by
learning purely from human demonstrations and language instructions, without
access to ground-truth semantic and depth supervisions from simulations. ECL
consists of: (i) an instruction parser that translates the natural languages
into executable programs; (ii) an embodied concept learner that grounds visual
concepts based on language descriptions; (iii) a map constructor that estimates
depth and constructs semantic maps by leveraging the learned concepts; and (iv)
a program executor with deterministic policies to execute each program. ECL has
several appealing benefits thanks to its modularized design. Firstly, it
enables the robotic agent to learn semantics and depth unsupervisedly acting
like babies, e.g., ground concepts through active interaction and perceive
depth by disparities when moving forward. Secondly, ECL is fully transparent
and step-by-step interpretable in long-term planning. Thirdly, ECL could be
beneficial for the embodied instruction following (EIF), outperforming previous
works on the ALFRED benchmark when the semantic label is not provided. Also,
the learned concept can be reused for other downstream tasks, such as reasoning
of object states. Project page: http://ecl.csail.mit.edu/
Related papers
- Can Language Models Understand Physical Concepts? [45.30953251294797]
Language models gradually become general-purpose interfaces in the interactive and embodied world.
It is not yet clear whether LMs can understand physical concepts in the human world.
arXiv Detail & Related papers (2023-05-23T13:36:55Z) - Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation [124.07372905781696]
Actional Atomic-Concept Learning (AACL) maps visual observations to actional atomic concepts for facilitating the alignment.
AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks.
arXiv Detail & Related papers (2023-02-13T03:08:05Z) - CLIP also Understands Text: Prompting CLIP for Phrase Understanding [65.59857372525664]
Contrastive Language-Image Pretraining (CLIP) efficiently learns visual concepts by pre-training with natural language supervision.
In this paper, we find that the text encoder of CLIP actually demonstrates strong ability for phrase understanding, and can even significantly outperform popular language models such as BERT with a properly designed prompt.
arXiv Detail & Related papers (2022-10-11T23:35:18Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - Identifying concept libraries from language about object structure [56.83719358616503]
We leverage natural language descriptions for a diverse set of 2K procedurally generated objects to identify the parts people use.
We formalize our problem as search over a space of program libraries that contain different part concepts.
By combining naturalistic language at scale with structured program representations, we discover a fundamental information-theoretic tradeoff governing the part concepts people name.
arXiv Detail & Related papers (2022-05-11T17:49:25Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z) - CLIPort: What and Where Pathways for Robotic Manipulation [35.505615833638124]
We present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding with the spatial precision of Transporter.
Our framework is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures.
arXiv Detail & Related papers (2021-09-24T17:44:28Z) - Language (Re)modelling: Towards Embodied Language Understanding [33.50428967270188]
This work proposes an approach to representation and learning based on the tenets of embodied cognitive linguistics (ECL)
According to ECL, natural language is inherently executable (like programming languages)
This position paper argues that the use of grounding by metaphoric inference and simulation will greatly benefit NLU systems.
arXiv Detail & Related papers (2020-05-01T10:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.