Embodied Concept Learner: Self-supervised Learning of Concepts and
Mapping through Instruction Following
- URL: http://arxiv.org/abs/2304.03767v1
- Date: Fri, 7 Apr 2023 17:59:34 GMT
- Title: Embodied Concept Learner: Self-supervised Learning of Concepts and
Mapping through Instruction Following
- Authors: Mingyu Ding, Yan Xu, Zhenfang Chen, David Daniel Cox, Ping Luo, Joshua
B. Tenenbaum, Chuang Gan
- Abstract summary: We propose Embodied Learner Concept (ECL) in an interactive 3D environment.
A robot agent can ground visual concepts, build semantic maps and plan actions to complete tasks.
ECL is fully transparent and step-by-step interpretable in long-term planning.
- Score: 101.55727845195969
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans, even at a very early age, can learn visual concepts and understand
geometry and layout through active interaction with the environment, and
generalize their compositions to complete tasks described by natural languages
in novel scenes. To mimic such capability, we propose Embodied Concept Learner
(ECL) in an interactive 3D environment. Specifically, a robot agent can ground
visual concepts, build semantic maps and plan actions to complete tasks by
learning purely from human demonstrations and language instructions, without
access to ground-truth semantic and depth supervisions from simulations. ECL
consists of: (i) an instruction parser that translates the natural languages
into executable programs; (ii) an embodied concept learner that grounds visual
concepts based on language descriptions; (iii) a map constructor that estimates
depth and constructs semantic maps by leveraging the learned concepts; and (iv)
a program executor with deterministic policies to execute each program. ECL has
several appealing benefits thanks to its modularized design. Firstly, it
enables the robotic agent to learn semantics and depth unsupervisedly acting
like babies, e.g., ground concepts through active interaction and perceive
depth by disparities when moving forward. Secondly, ECL is fully transparent
and step-by-step interpretable in long-term planning. Thirdly, ECL could be
beneficial for the embodied instruction following (EIF), outperforming previous
works on the ALFRED benchmark when the semantic label is not provided. Also,
the learned concept can be reused for other downstream tasks, such as reasoning
of object states. Project page: http://ecl.csail.mit.edu/
Related papers
- SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation [49.858348469657784]
We introduce the concept of semantic orientation, which defines object orientations using natural language in a reference-frame-free manner.
By integrating semantic orientation into a VLM system, we enable robots to generate manipulation actions with both positional and orientational constraints.
arXiv Detail & Related papers (2025-02-18T18:59:02Z) - ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension [71.03445074045092]
We propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens.
Our method unifies the prompt and answer of visual referential tasks without using additional syntax.
ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
arXiv Detail & Related papers (2024-06-17T08:39:16Z) - Can Language Models Understand Physical Concepts? [45.30953251294797]
Language models gradually become general-purpose interfaces in the interactive and embodied world.
It is not yet clear whether LMs can understand physical concepts in the human world.
arXiv Detail & Related papers (2023-05-23T13:36:55Z) - Actional Atomic-Concept Learning for Demystifying Vision-Language Navigation [124.07372905781696]
Actional Atomic-Concept Learning (AACL) maps visual observations to actional atomic concepts for facilitating the alignment.
AACL establishes new state-of-the-art results on both fine-grained (R2R) and high-level (REVERIE and R2R-Last) VLN benchmarks.
arXiv Detail & Related papers (2023-02-13T03:08:05Z) - Visual Superordinate Abstraction for Robust Concept Learning [80.15940996821541]
Concept learning constructs visual representations that are connected to linguistic semantics.
We ascribe the bottleneck to a failure of exploring the intrinsic semantic hierarchy of visual concepts.
We propose a visual superordinate abstraction framework for explicitly modeling semantic-aware visual subspaces.
arXiv Detail & Related papers (2022-05-28T14:27:38Z) - Identifying concept libraries from language about object structure [56.83719358616503]
We leverage natural language descriptions for a diverse set of 2K procedurally generated objects to identify the parts people use.
We formalize our problem as search over a space of program libraries that contain different part concepts.
By combining naturalistic language at scale with structured program representations, we discover a fundamental information-theoretic tradeoff governing the part concepts people name.
arXiv Detail & Related papers (2022-05-11T17:49:25Z) - Explainable Semantic Space by Grounding Language to Vision with
Cross-Modal Contrastive Learning [3.441021278275805]
We design a two-stream model for grounding language learning in vision.
The model first learns to align visual and language representations with the MS COCO dataset.
After training, the language stream of this model is a stand-alone language model capable of embedding concepts in a visually grounded semantic space.
arXiv Detail & Related papers (2021-11-13T19:54:15Z) - CLIPort: What and Where Pathways for Robotic Manipulation [35.505615833638124]
We present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding with the spatial precision of Transporter.
Our framework is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures.
arXiv Detail & Related papers (2021-09-24T17:44:28Z) - Language (Re)modelling: Towards Embodied Language Understanding [33.50428967270188]
This work proposes an approach to representation and learning based on the tenets of embodied cognitive linguistics (ECL)
According to ECL, natural language is inherently executable (like programming languages)
This position paper argues that the use of grounding by metaphoric inference and simulation will greatly benefit NLU systems.
arXiv Detail & Related papers (2020-05-01T10:57:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.