Tackling Visual Control via Multi-View Exploration Maximization
- URL: http://arxiv.org/abs/2211.15233v1
- Date: Mon, 28 Nov 2022 11:29:56 GMT
- Title: Tackling Visual Control via Multi-View Exploration Maximization
- Authors: Mingqi Yuan, Xin Jin, Bo Li, Wenjun Zeng
- Abstract summary: MEM is the first approach that combines multi-view representation learning and reward-driven exploration in reinforcement learning (RL)
We evaluate MEM on various tasks from DeepMind Control Suite and Procgen games.
- Score: 64.8463574294237
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MEM: Multi-view Exploration Maximization for tackling complex
visual control tasks. To the best of our knowledge, MEM is the first approach
that combines multi-view representation learning and intrinsic reward-driven
exploration in reinforcement learning (RL). More specifically, MEM first
extracts the specific and shared information of multi-view observations to form
high-quality features before performing RL on the learned features, enabling
the agent to fully comprehend the environment and yield better actions.
Furthermore, MEM transforms the multi-view features into intrinsic rewards
based on entropy maximization to encourage exploration. As a result, MEM can
significantly promote the sample-efficiency and generalization ability of the
RL agent, facilitating solving real-world problems with high-dimensional
observations and spare-reward space. We evaluate MEM on various tasks from
DeepMind Control Suite and Procgen games. Extensive simulation results
demonstrate that MEM can achieve superior performance and outperform the
benchmarking schemes with simple architecture and higher efficiency.
Related papers
- Brain-Inspired Stepwise Patch Merging for Vision Transformers [6.108377966393714]
We propose a novel technique called Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to'see' better.
Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models.
arXiv Detail & Related papers (2024-09-11T03:04:46Z) - MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct [148.39859547619156]
We propose MMEvol, a novel multimodal instruction data evolution framework.
MMEvol iteratively improves data quality through a refined combination of fine-grained perception, cognitive reasoning, and interaction evolution.
Our approach reaches state-of-the-art (SOTA) performance in nine tasks using significantly less data compared to state-of-the-art models.
arXiv Detail & Related papers (2024-09-09T17:44:00Z) - SG-MIM: Structured Knowledge Guided Efficient Pre-training for Dense Prediction [17.44991827937427]
Masked Image Modeling techniques have redefined the landscape of computer vision.
Despite their success, the full potential of MIM-based methods in dense prediction tasks, particularly in depth estimation, remains untapped.
We propose SG-MIM, a novel Structured knowledge Guided Masked Image Modeling framework designed to enhance dense prediction tasks by utilizing structured knowledge alongside images.
arXiv Detail & Related papers (2024-09-04T08:24:53Z) - Sample Efficient Myopic Exploration Through Multitask Reinforcement
Learning with Diverse Tasks [53.44714413181162]
This paper shows that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design can be sample-efficient.
To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL.
arXiv Detail & Related papers (2024-03-03T22:57:44Z) - UMIE: Unified Multimodal Information Extraction with Instruction Tuning [12.777967562175437]
We propose UMIE, a unified multimodal information extractor, to unify three MIE tasks as a generation problem using instruction tuning.
Extensive experiments show that our single UMIE outperforms various state-of-the-art (SoTA) methods across six MIE datasets on three tasks.
Our research serves as an initial step towards a unified MIE model and initiates the exploration into both instruction tuning and large language models within the MIE domain.
arXiv Detail & Related papers (2024-01-05T22:52:15Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Summary-Oriented Vision Modeling for Multimodal Abstractive
Summarization [63.320005222549646]
Multimodal abstractive summarization (MAS) aims to produce a concise summary given the multimodal data (text and vision)
We propose to improve the summary quality through summary-oriented visual features.
Experiments on 44 languages, covering mid-high, low-, and zero-resource scenarios, verify the effectiveness and superiority of the proposed approach.
arXiv Detail & Related papers (2022-12-15T09:05:26Z) - Collaborative Attention Mechanism for Multi-View Action Recognition [75.33062629093054]
We propose a collaborative attention mechanism (CAM) for solving the multi-view action recognition problem.
The proposed CAM detects the attention differences among multi-view, and adaptively integrates frame-level information to benefit each other.
Experiments on four action datasets illustrate the proposed CAM achieves better results for each view and also boosts multi-view performance.
arXiv Detail & Related papers (2020-09-14T17:33:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.