MOOSS: Mask-Enhanced Temporal Contrastive Learning for Smooth State Evolution in Visual Reinforcement Learning
- URL: http://arxiv.org/abs/2409.02714v1
- Date: Mon, 2 Sep 2024 18:57:53 GMT
- Title: MOOSS: Mask-Enhanced Temporal Contrastive Learning for Smooth State Evolution in Visual Reinforcement Learning
- Authors: Jiarui Sun, M. Ugur Akcal, Wei Zhang, Girish Chowdhary,
- Abstract summary: In visual Reinforcement Learning (RL), learning from pixel-based observations poses significant challenges on sample efficiency.
We introduce MOOSS, a novel framework that leverages a temporal contrastive objective with the help of graph-based spatial-temporal masking.
Our evaluation on multiple continuous and discrete control benchmarks shows that MOOSS outperforms previous state-of-the-art visual RL methods in terms of sample efficiency.
- Score: 8.61492882526007
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In visual Reinforcement Learning (RL), learning from pixel-based observations poses significant challenges on sample efficiency, primarily due to the complexity of extracting informative state representations from high-dimensional data. Previous methods such as contrastive-based approaches have made strides in improving sample efficiency but fall short in modeling the nuanced evolution of states. To address this, we introduce MOOSS, a novel framework that leverages a temporal contrastive objective with the help of graph-based spatial-temporal masking to explicitly model state evolution in visual RL. Specifically, we propose a self-supervised dual-component strategy that integrates (1) a graph construction of pixel-based observations for spatial-temporal masking, coupled with (2) a multi-level contrastive learning mechanism that enriches state representations by emphasizing temporal continuity and change of states. MOOSS advances the understanding of state dynamics by disrupting and learning from spatial-temporal correlations, which facilitates policy learning. Our comprehensive evaluation on multiple continuous and discrete control benchmarks shows that MOOSS outperforms previous state-of-the-art visual RL methods in terms of sample efficiency, demonstrating the effectiveness of our method. Our code is released at https://github.com/jsun57/MOOSS.
Related papers
- Learning Interpretable Policies in Hindsight-Observable POMDPs through
Partially Supervised Reinforcement Learning [57.67629402360924]
We introduce the Partially Supervised Reinforcement Learning (PSRL) framework.
At the heart of PSRL is the fusion of both supervised and unsupervised learning.
We show that PSRL offers a potent balance, enhancing model interpretability while preserving, and often significantly outperforming, the performance benchmarks set by traditional methods.
arXiv Detail & Related papers (2024-02-14T16:23:23Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Learning Generalizable Representations for Reinforcement Learning via
Adaptive Meta-learner of Behavioral Similarities [43.327357653393015]
We propose a novel meta-learner-based framework for representation learning regarding behavioral similarities for reinforcement learning.
We empirically demonstrate that our proposed framework outperforms state-of-the-art baselines on several benchmarks.
arXiv Detail & Related papers (2022-12-26T11:11:23Z) - Value-Consistent Representation Learning for Data-Efficient
Reinforcement Learning [105.70602423944148]
We propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making.
Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values.
It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.
arXiv Detail & Related papers (2022-06-25T03:02:25Z) - Integrating Contrastive Learning with Dynamic Models for Reinforcement
Learning from Images [31.413588478694496]
We argue that explicitly improving Markovianity of the learned embedding is desirable.
We propose a self-supervised representation learning method which integrates contrastive learning with dynamic models.
arXiv Detail & Related papers (2022-03-02T14:39:17Z) - Mask-based Latent Reconstruction for Reinforcement Learning [58.43247393611453]
Mask-based Latent Reconstruction (MLR) is proposed to predict the complete state representations in the latent space from the observations with spatially and temporally masked pixels.
Extensive experiments show that our MLR significantly improves the sample efficiency in deep reinforcement learning.
arXiv Detail & Related papers (2022-01-28T13:07:11Z) - Accelerating Representation Learning with View-Consistent Dynamics in
Data-Efficient Reinforcement Learning [12.485293708638292]
We propose to accelerate state representation learning by enforcing view-consistency on the dynamics.
We introduce a formalism of Multi-view Markov Decision Process (MMDP) that incorporates multiple views of the state.
Following the structure of MMDP, our method, View-Consistent Dynamics (VCD), learns state representations by training a view-consistent dynamics model in the latent space.
arXiv Detail & Related papers (2022-01-18T14:28:30Z) - Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised
Video Representation Learning [6.523119805288132]
We present a novel technique for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding.
arXiv Detail & Related papers (2020-11-23T08:05:39Z) - Self-supervised learning using consistency regularization of
spatio-temporal data augmentation for action recognition [15.701647552427708]
We present a novel way to obtain the surrogate supervision signal based on high-level feature maps under consistency regularization.
Our method achieves substantial improvements compared with state-of-the-art self-supervised learning methods for action recognition.
arXiv Detail & Related papers (2020-08-05T12:41:59Z) - Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person
Re-Identification [208.1227090864602]
Visible-infrared person re-identification (VI-ReID) is a challenging cross-modality pedestrian retrieval problem.
Existing VI-ReID methods tend to learn global representations, which have limited discriminability and weak robustness to noisy images.
We propose a novel dynamic dual-attentive aggregation (DDAG) learning method by mining both intra-modality part-level and cross-modality graph-level contextual cues for VI-ReID.
arXiv Detail & Related papers (2020-07-18T03:08:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.