CLIP4MC: An RL-Friendly Vision-Language Model for Minecraft
- URL: http://arxiv.org/abs/2303.10571v1
- Date: Sun, 19 Mar 2023 05:20:52 GMT
- Title: CLIP4MC: An RL-Friendly Vision-Language Model for Minecraft
- Authors: Ziluo Ding, Hao Luo, Ke Li, Junpeng Yue, Tiejun Huang, and Zongqing Lu
- Abstract summary: In this paper, we propose a novel cross-modal contrastive learning framework architecture, CLIP4MC.
We learn an RL-friendly vision-language model that serves as a reward function for open-ended tasks.
We construct a neat YouTube dataset based on the large-scale YouTube database provided by MineDojo.
- Score: 32.447102147806206
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the essential missions in the AI research community is to build an
autonomous embodied agent that can attain high-level performance across a wide
spectrum of tasks. However, acquiring reward/penalty in all open-ended tasks is
unrealistic, making the Reinforcement Learning (RL) training procedure
impossible. In this paper, we propose a novel cross-modal contrastive learning
framework architecture, CLIP4MC, aiming to learn an RL-friendly vision-language
model that serves as a reward function for open-ended tasks. Therefore, no
further task-specific reward design is needed. Intuitively, it is more
reasonable for the model to address the similarity between the video snippet
and the language prompt at both the action and entity levels. To this end, a
motion encoder is proposed to capture the motion embeddings across different
intervals. The correlation scores are then used to construct the auxiliary
reward signal for RL agents. Moreover, we construct a neat YouTube dataset
based on the large-scale YouTube database provided by MineDojo. Specifically,
two rounds of filtering operations guarantee that the dataset covers enough
essential information and that the video-text pair is highly correlated.
Empirically, we show that the proposed method achieves better performance on RL
tasks compared with baselines.
Related papers
- Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning [79.38140606606126]
We propose an algorithmic framework that fine-tunes vision-language models (VLMs) with reinforcement learning (RL)
Our framework provides a task description and then prompts the VLM to generate chain-of-thought (CoT) reasoning.
We demonstrate that our proposed framework enhances the decision-making capabilities of VLM agents across various tasks.
arXiv Detail & Related papers (2024-05-16T17:50:19Z) - RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback [24.759613248409167]
Reward engineering has long been a challenge in Reinforcement Learning research.
We propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks.
We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains.
arXiv Detail & Related papers (2024-02-06T04:06:06Z) - Deep Reinforcement Learning from Hierarchical Preference Design [99.46415116087259]
This paper shows by exploiting certain structures, one can ease the reward design process.
We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
arXiv Detail & Related papers (2023-09-06T00:44:29Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - INFOrmation Prioritization through EmPOWERment in Visual Model-Based RL [90.06845886194235]
We propose a modified objective for model-based reinforcement learning (RL)
We integrate a term inspired by variational empowerment into a state-space model based on mutual information.
We evaluate the approach on a suite of vision-based robot control tasks with natural video backgrounds.
arXiv Detail & Related papers (2022-04-18T23:09:23Z) - Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a
First-person Simulated 3D Environment [73.9469267445146]
First-person object-interaction tasks in high-fidelity, 3D, simulated environments such as the AI2Thor pose significant sample-efficiency challenges for reinforcement learning agents.
We show that one can learn object-interaction tasks from scratch without supervision by learning an attentive object-model as an auxiliary task.
arXiv Detail & Related papers (2020-10-28T19:27:26Z) - Video Moment Retrieval via Natural Language Queries [7.611718124254329]
We propose a novel method for video moment retrieval (VMR) that achieves state of the arts (SOTA) performance on R@1 metrics.
Our model has a simple architecture, which enables faster training and inference while maintaining.
arXiv Detail & Related papers (2020-09-04T22:06:34Z) - Forgetful Experience Replay in Hierarchical Reinforcement Learning from
Demonstrations [55.41644538483948]
In this paper, we propose a combination of approaches that allow the agent to use low-quality demonstrations in complex vision-based environments.
Our proposed goal-oriented structuring of replay buffer allows the agent to automatically highlight sub-goals for solving complex hierarchical tasks in demonstrations.
The solution based on our algorithm beats all the solutions for the famous MineRL competition and allows the agent to mine a diamond in the Minecraft environment.
arXiv Detail & Related papers (2020-06-17T15:38:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.