EC^2: Emergent Communication for Embodied Control
- URL: http://arxiv.org/abs/2304.09448v1
- Date: Wed, 19 Apr 2023 06:36:02 GMT
- Title: EC^2: Emergent Communication for Embodied Control
- Authors: Yao Mu, Shunyu Yao, Mingyu Ding, Ping Luo, Chuang Gan
- Abstract summary: Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments.
We propose Emergent Communication for Embodied Control (EC2), a novel scheme to pre-train video-language representations for few-shot embodied control.
EC2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs.
- Score: 72.99894347257268
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Embodied control requires agents to leverage multi-modal pre-training to
quickly learn how to act in new environments, where video demonstrations
contain visual and motion details needed for low-level perception and control,
and language instructions support generalization with abstract, symbolic
structures. While recent approaches apply contrastive learning to force
alignment between the two modalities, we hypothesize better modeling their
complementary differences can lead to more holistic representations for
downstream adaption. To this end, we propose Emergent Communication for
Embodied Control (EC^2), a novel scheme to pre-train video-language
representations for few-shot embodied control. The key idea is to learn an
unsupervised "language" of videos via emergent communication, which bridges the
semantics of video details and structures of natural language. We learn
embodied representations of video trajectories, emergent language, and natural
language using a language model, which is then used to finetune a lightweight
policy network for downstream control. Through extensive experiments in
Metaworld and Franka Kitchen embodied benchmarks, EC^2 is shown to consistently
outperform previous contrastive learning methods for both videos and texts as
task inputs. Further ablations confirm the importance of the emergent language,
which is beneficial for both video and language learning, and significantly
superior to using pre-trained video captions. We also present a quantitative
and qualitative analysis of the emergent language and discuss future directions
toward better understanding and leveraging emergent communication in embodied
tasks.
Related papers
- VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - Accessible Instruction-Following Agent [0.0]
We introduce UVLN, a novel machine-translation instructional augmented framework for cross-lingual vision-language navigation.
We extend the standard VLN training objectives to a multilingual setting via a cross-lingual language encoder.
Experiments over Room Across Room dataset prove the effectiveness of our approach.
arXiv Detail & Related papers (2023-05-08T23:57:26Z) - Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection.
We propose to learn contextualized, joint representations through vision-language pre-training.
The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Vokenization: Improving Language Understanding with Contextualized,
Visual-Grounded Supervision [110.66085917826648]
We develop a technique that extrapolates multimodal alignments to language-only data by contextually mapping language tokens to their related images.
"vokenization" is trained on relatively small image captioning datasets and we then apply it to generate vokens for large language corpora.
Trained with these contextually generated vokens, our visually-supervised language models show consistent improvements over self-supervised alternatives on multiple pure-language tasks.
arXiv Detail & Related papers (2020-10-14T02:11:51Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.