Zero-shot cross-modal transfer of Reinforcement Learning policies
through a Global Workspace
- URL: http://arxiv.org/abs/2403.04588v1
- Date: Thu, 7 Mar 2024 15:35:29 GMT
- Title: Zero-shot cross-modal transfer of Reinforcement Learning policies
through a Global Workspace
- Authors: L\'eopold Mayti\'e, Benjamin Devillers, Alexandre Arnold, Rufin
VanRullen
- Abstract summary: We train a 'Global Workspace' to exploit information collected about the environment via two input modalities.
In two distinct environments and tasks, our results reveal the model's ability to perform zero-shot cross-modal transfer between input modalities.
- Score: 48.24821328103934
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Humans perceive the world through multiple senses, enabling them to create a
comprehensive representation of their surroundings and to generalize
information across domains. For instance, when a textual description of a scene
is given, humans can mentally visualize it. In fields like robotics and
Reinforcement Learning (RL), agents can also access information about the
environment through multiple sensors; yet redundancy and complementarity
between sensors is difficult to exploit as a source of robustness (e.g. against
sensor failure) or generalization (e.g. transfer across domains). Prior
research demonstrated that a robust and flexible multimodal representation can
be efficiently constructed based on the cognitive science notion of a 'Global
Workspace': a unique representation trained to combine information across
modalities, and to broadcast its signal back to each modality. Here, we explore
whether such a brain-inspired multimodal representation could be advantageous
for RL agents. First, we train a 'Global Workspace' to exploit information
collected about the environment via two input modalities (a visual input, or an
attribute vector representing the state of the agent and/or its environment).
Then, we train a RL agent policy using this frozen Global Workspace. In two
distinct environments and tasks, our results reveal the model's ability to
perform zero-shot cross-modal transfer between input modalities, i.e. to apply
to image inputs a policy previously trained on attribute vectors (and
vice-versa), without additional training or fine-tuning. Variants and ablations
of the full Global Workspace (including a CLIP-like multimodal representation
trained via contrastive learning) did not display the same generalization
abilities.
Related papers
- Online Decision MetaMorphFormer: A Casual Transformer-Based Reinforcement Learning Framework of Universal Embodied Intelligence [2.890656584329591]
Online Decision MetaMorphFormer (ODM) aims to achieve self-awareness, environment recognition, and action planning.
ODM can be applied to any arbitrary agent with a multi-joint body, located in different environments, and trained with different types of tasks using large-scale pre-trained datasets.
arXiv Detail & Related papers (2024-09-11T15:22:43Z) - Vision-Language Models Provide Promptable Representations for Reinforcement Learning [67.40524195671479]
We propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied reinforcement learning (RL)
We show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.
arXiv Detail & Related papers (2024-02-05T00:48:56Z) - Agent AI: Surveying the Horizons of Multimodal Interaction [83.18367129924997]
"Agent AI" is a class of interactive systems that can perceive visual stimuli, language inputs, and other environmentally-grounded data.
We envision a future where people can easily create any virtual reality or simulated scene and interact with agents embodied within the virtual environment.
arXiv Detail & Related papers (2024-01-07T19:11:18Z) - Invariance is Key to Generalization: Examining the Role of
Representation in Sim-to-Real Transfer for Visual Navigation [35.01394611106655]
Key to generalization is representations that are rich enough to capture all task-relevant information.
We experimentally study such a representation for visual navigation.
We show that our representation reduces the A-distance between the training and test domains.
arXiv Detail & Related papers (2023-10-23T15:15:19Z) - Semi-supervised Multimodal Representation Learning through a Global Workspace [2.8948274245812335]
"Global Workspace" is a shared representation for two input modalities.
This architecture is amenable to self-supervised training via cycle-consistency.
We show that such an architecture can be trained to align and translate between two modalities with very little need for matched data.
arXiv Detail & Related papers (2023-06-27T12:41:36Z) - Adaptive action supervision in reinforcement learning from real-world
multi-agent demonstrations [10.174009792409928]
We propose a method for adaptive action supervision in RL from real-world demonstrations in multi-agent scenarios.
In the experiments, using chase-and-escape and football tasks with the different dynamics between the unknown source and target environments, we show that our approach achieved a balance between the generalization and the generalization ability compared with the baselines.
arXiv Detail & Related papers (2023-05-22T13:33:37Z) - Denoised MDPs: Learning World Models Better Than the World Itself [94.74665254213588]
This work categorizes information out in the wild into four types based on controllability and relation with reward, and formulates useful information as that which is both controllable and reward-relevant.
Experiments on variants of DeepMind Control Suite and RoboDesk demonstrate superior performance of our denoised world model over using raw observations alone.
arXiv Detail & Related papers (2022-06-30T17:59:49Z) - Semantic Tracklets: An Object-Centric Representation for Visual
Multi-Agent Reinforcement Learning [126.57680291438128]
We study whether scalability can be achieved via a disentangled representation.
We evaluate semantic tracklets' on the visual multi-agent particle environment (VMPE) and on the challenging visual multi-agent GFootball environment.
Notably, this method is the first to successfully learn a strategy for five players in the GFootball environment using only visual data.
arXiv Detail & Related papers (2021-08-06T22:19:09Z) - PsiPhi-Learning: Reinforcement Learning with Demonstrations using
Successor Features and Inverse Temporal Difference Learning [102.36450942613091]
We propose an inverse reinforcement learning algorithm, called emphinverse temporal difference learning (ITD)
We show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called $Psi Phi$-learning.
arXiv Detail & Related papers (2021-02-24T21:12:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.