Think Before You Act: Unified Policy for Interleaving Language Reasoning
with Actions
- URL: http://arxiv.org/abs/2304.11063v1
- Date: Tue, 18 Apr 2023 16:12:38 GMT
- Title: Think Before You Act: Unified Policy for Interleaving Language Reasoning
with Actions
- Authors: Lina Mezghani and Piotr Bojanowski and Karteek Alahari and Sainbayar
Sukhbaatar
- Abstract summary: We show how to train transformers with a similar next-step prediction objective on offline data.
We propose a novel method for unifying language reasoning with actions in a single policy.
Specifically, we augment a transformer policy with word outputs, so it can generate textual captions interleaved with actions.
- Score: 21.72567982148215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The success of transformer models trained with a language modeling objective
brings a promising opportunity to the reinforcement learning framework.
Decision Transformer is a step towards this direction, showing how to train
transformers with a similar next-step prediction objective on offline data.
Another important development in this area is the recent emergence of
large-scale datasets collected from the internet, such as the ones composed of
tutorial videos with captions where people talk about what they are doing. To
take advantage of this language component, we propose a novel method for
unifying language reasoning with actions in a single policy. Specifically, we
augment a transformer policy with word outputs, so it can generate textual
captions interleaved with actions. When tested on the most challenging task in
BabyAI, with captions describing next subgoals, our reasoning policy
consistently outperforms the caption-free baseline.
Related papers
- LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning [19.801187860991117]
This work introduces LaMP, a novel Language-Motion Pretraining model.
LaMP generates motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences.
For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model.
arXiv Detail & Related papers (2024-10-09T17:33:03Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - Goal Representations for Instruction Following: A Semi-Supervised
Language Interface to Control [58.06223121654735]
We show a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data.
Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image.
We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data.
arXiv Detail & Related papers (2023-06-30T20:09:39Z) - Summarize the Past to Predict the Future: Natural Language Descriptions
of Context Boost Multimodal Object Interaction Anticipation [72.74191015833397]
We propose TransFusion, a multimodal transformer-based architecture.
It exploits the representational power of language by summarizing the action context.
Our model enables more efficient end-to-end learning.
arXiv Detail & Related papers (2023-01-22T21:30:12Z) - Instruction-Following Agents with Multimodal Transformer [95.70039658112873]
We propose a simple yet effective model for robots to solve instruction-following tasks in vision-based environments.
Our method consists of a multimodal transformer that encodes visual observations and language instructions.
We show that this unified transformer model outperforms all state-of-the-art pre-trained or trained-from-scratch methods in both single-task and multi-task settings.
arXiv Detail & Related papers (2022-10-24T17:46:47Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Survey: Transformer based Video-Language Pre-training [28.870441287367825]
This survey aims to give a comprehensive overview on transformer-based pre-training methods for Video-Language learning.
We first briefly introduce the transformer tructure as the background knowledge, including attention mechanism, position encoding etc.
We categorize transformer models into Single-Stream and Multi-Stream structures, highlight their innovations and compare their performances.
arXiv Detail & Related papers (2021-09-21T02:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.