VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement
Learning
- URL: http://arxiv.org/abs/2010.13839v1
- Date: Mon, 26 Oct 2020 18:51:02 GMT
- Title: VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement
Learning
- Authors: Thomas Carta, Subhajit Chaudhury, Kartik Talamadupula and Michiaki
Tatsubori
- Abstract summary: We present VisualHints, a novel environment for multimodal reinforcement learning (RL) involving text-based interactions along with visual hints (obtained from the environment)
We introduce an extension of the TextWorld cooking environment with the addition of visual clues interspersed throughout the environment.
The goal is to force an RL agent to use both text and visual features to predict natural language action commands for solving the final task of cooking a meal.
- Score: 14.553086325168803
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present VisualHints, a novel environment for multimodal reinforcement
learning (RL) involving text-based interactions along with visual hints
(obtained from the environment). Real-life problems often demand that agents
interact with the environment using both natural language information and
visual perception towards solving a goal. However, most traditional RL
environments either solve pure vision-based tasks like Atari games or
video-based robotic manipulation; or entirely use natural language as a mode of
interaction, like Text-based games and dialog systems. In this work, we aim to
bridge this gap and unify these two approaches in a single environment for
multimodal RL. We introduce an extension of the TextWorld cooking environment
with the addition of visual clues interspersed throughout the environment. The
goal is to force an RL agent to use both text and visual features to predict
natural language action commands for solving the final task of cooking a meal.
We enable variations and difficulties in our environment to emulate various
interactive real-world scenarios. We present a baseline multimodal agent for
solving such problems using CNN-based feature extraction from visual hints and
LSTMs for textual feature extraction. We believe that our proposed
visual-lingual environment will facilitate novel problem settings for the RL
community.
Related papers
- Bridging Environments and Language with Rendering Functions and Vision-Language Models [7.704773649029078]
Vision-language models (VLMs) have tremendous potential for grounding language.
This paper introduces a novel decomposition of the problem of building language-conditioned agents (LCAs)
We also explore several enhancements to the speed and quality of VLM-based LCAs.
arXiv Detail & Related papers (2024-09-24T12:24:07Z) - Scaling Instructable Agents Across Many Simulated Worlds [70.97268311053328]
Our goal is to develop an agent that can accomplish anything a human can do in any simulated 3D environment.
Our approach focuses on language-driven generality while imposing minimal assumptions.
Our agents interact with environments in real-time using a generic, human-like interface.
arXiv Detail & Related papers (2024-03-13T17:50:32Z) - LARP: Language-Agent Role Play for Open-World Games [19.80040627487576]
Language Agent for Role-Playing (LARP) is a cognitive architecture that encompasses memory processing and a decision-making assistant.
The framework refines interactions between users and agents, predefined with unique backgrounds and personalities.
It highlights the diverse uses of language models in a range of areas such as entertainment, education, and various simulation scenarios.
arXiv Detail & Related papers (2023-12-24T10:08:59Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - ScriptWorld: Text Based Environment For Learning Procedural Knowledge [2.0491741153610334]
ScriptWorld is a text-based environment for teaching agents about real-world daily chores.
We provide gaming environments for 10 daily activities and perform a detailed analysis of the proposed environment.
We leverage features obtained from pre-trained language models in the RL agents.
arXiv Detail & Related papers (2023-07-08T05:43:03Z) - Inner Monologue: Embodied Reasoning through Planning with Language
Models [81.07216635735571]
Large Language Models (LLMs) can be applied to domains beyond natural language processing.
LLMs planning in embodied environments need to consider not just what skills to do, but also how and when to do them.
We propose that by leveraging environment feedback, LLMs are able to form an inner monologue that allows them to more richly process and plan in robotic control scenarios.
arXiv Detail & Related papers (2022-07-12T15:20:48Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - SILG: The Multi-environment Symbolic Interactive Language Grounding
Benchmark [62.34200575624785]
We propose the multi-environment Interactive Language Grounding benchmark (SILG)
SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, NetHack)
We evaluate recent advances such as egocentric local convolution, recurrent state-tracking, entity-centric attention, and pretrained LM using SILG.
arXiv Detail & Related papers (2021-10-20T17:02:06Z) - Semantic Tracklets: An Object-Centric Representation for Visual
Multi-Agent Reinforcement Learning [126.57680291438128]
We study whether scalability can be achieved via a disentangled representation.
We evaluate semantic tracklets' on the visual multi-agent particle environment (VMPE) and on the challenging visual multi-agent GFootball environment.
Notably, this method is the first to successfully learn a strategy for five players in the GFootball environment using only visual data.
arXiv Detail & Related papers (2021-08-06T22:19:09Z) - Zero-Shot Compositional Policy Learning via Language Grounding [13.45138913186308]
Humans can adapt to new tasks quickly by leveraging prior knowledge about the world such as language descriptions.
We introduce a new research platform BabyAI++ in which the dynamics of environments are disentangled from visual appearance.
We find that current language-guided RL/IL techniques overfit to the training environments and suffer from a huge performance drop when facing unseen combinations.
arXiv Detail & Related papers (2020-04-15T16:58:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.