Masked Contrastive Representation Learning for Reinforcement Learning
- URL: http://arxiv.org/abs/2010.07470v1
- Date: Thu, 15 Oct 2020 02:00:10 GMT
- Title: Masked Contrastive Representation Learning for Reinforcement Learning
- Authors: Jinhua Zhu, Yingce Xia, Lijun Wu, Jiajun Deng, Wengang Zhou, Tao Qin,
Houqiang Li
- Abstract summary: CURL, which uses contrastive learning to extract high-level features from raw pixels of individual video frames, is an efficient algorithm.
We propose a new algorithm, masked contrastive representation learning for RL, that takes the correlation among consecutive inputs into consideration.
Our method achieves consistent improvements over CURL on $14$ out of $16$ environments from DMControl suite and $21$ out of $26$ environments from Atari 2600 Games.
- Score: 202.8261654227565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Improving sample efficiency is a key research problem in reinforcement
learning (RL), and CURL, which uses contrastive learning to extract high-level
features from raw pixels of individual video frames, is an efficient
algorithm~\citep{srinivas2020curl}. We observe that consecutive video frames in
a game are highly correlated but CURL deals with them independently. To further
improve data efficiency, we propose a new algorithm, masked contrastive
representation learning for RL, that takes the correlation among consecutive
inputs into consideration. In addition to the CNN encoder and the policy
network in CURL, our method introduces an auxiliary Transformer module to
leverage the correlations among video frames. During training, we randomly mask
the features of several frames, and use the CNN encoder and Transformer to
reconstruct them based on the context frames. The CNN encoder and Transformer
are jointly trained via contrastive learning where the reconstructed features
should be similar to the ground-truth ones while dissimilar to others. During
inference, the CNN encoder and the policy network are used to take actions, and
the Transformer module is discarded. Our method achieves consistent
improvements over CURL on $14$ out of $16$ environments from DMControl suite
and $21$ out of $26$ environments from Atari 2600 Games. The code is available
at https://github.com/teslacool/m-curl.
Related papers
- Progressive Fourier Neural Representation for Sequential Video
Compilation [75.43041679717376]
Motivated by continual learning, this work investigates how to accumulate and transfer neural implicit representations for multiple complex video data over sequential encoding sessions.
We propose a novel method, Progressive Fourier Neural Representation (PFNR), that aims to find an adaptive and compact sub-module in Fourier space to encode videos in each training session.
We validate our PFNR method on the UVG8/17 and DAVIS50 video sequence benchmarks and achieve impressive performance gains over strong continual learning baselines.
arXiv Detail & Related papers (2023-06-20T06:02:19Z) - ConvTransSeg: A Multi-resolution Convolution-Transformer Network for
Medical Image Segmentation [14.485482467748113]
We propose a hybrid encoder-decoder segmentation model (ConvTransSeg)
It consists of a multi-layer CNN as the encoder for feature learning and the corresponding multi-level Transformer as the decoder for segmentation prediction.
Our method achieves the best performance in terms of Dice coefficient and average symmetric surface distance measures with low model complexity and memory consumption.
arXiv Detail & Related papers (2022-10-13T14:59:23Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Cross-Architecture Self-supervised Video Representation Learning [42.267775859095664]
We present a new cross-architecture contrastive learning framework for self-supervised video representation learning.
We introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences.
We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2022-05-26T12:41:19Z) - FILM: Frame Interpolation for Large Motion [20.04001872133824]
We present a frame algorithm that synthesizes multiple intermediate frames from two input images with large in-between motion.
Our approach outperforms state-of-the-art methods on the Xiph large motion benchmark.
arXiv Detail & Related papers (2022-02-10T08:48:18Z) - Dense Interaction Learning for Video-based Person Re-identification [75.03200492219003]
We propose a hybrid framework, Dense Interaction Learning (DenseIL), to tackle video-based person re-ID difficulties.
DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder.
Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
arXiv Detail & Related papers (2021-03-16T12:22:08Z) - CURL: Contrastive Unsupervised Representations for Reinforcement
Learning [93.57637441080603]
CURL extracts high-level features from raw pixels using contrastive learning.
On the DeepMind Control Suite, CURL is the first image-based algorithm to nearly match the sample-efficiency of methods that use state-based features.
arXiv Detail & Related papers (2020-04-08T17:40:43Z) - Content Adaptive and Error Propagation Aware Deep Video Compression [110.31693187153084]
We propose a content adaptive and error propagation aware video compression system.
Our method employs a joint training strategy by considering the compression performance of multiple consecutive frames instead of a single frame.
Instead of using the hand-crafted coding modes in the traditional compression systems, we design an online encoder updating scheme in our system.
arXiv Detail & Related papers (2020-03-25T09:04:24Z) - Temporally Coherent Embeddings for Self-Supervised Video Representation
Learning [2.216657815393579]
This paper presents TCE: Temporally Coherent Embeddings for self-supervised video representation learning.
The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space.
With a simple but effective 2D-CNN backbone and only RGB stream inputs, TCE pre-trained representations outperform all previous selfsupervised 2D-CNN and 3D-CNN pre-trained on UCF101.
arXiv Detail & Related papers (2020-03-21T12:25:50Z) - An Information-rich Sampling Technique over Spatio-Temporal CNN for
Classification of Human Actions in Videos [5.414308305392762]
We propose a novel scheme for human action recognition in videos, using a 3-dimensional Convolutional Neural Network (3D CNN) based classifier.
In this paper, a 3D CNN architecture is proposed to extract featuresweighted and follows Long Short-Term Memory (LSTM) to recognize human actions.
Experiments are performed with KTH and WEIZMANN human actions datasets, whereby it is shown to produce comparable results with the state-of-the-art techniques.
arXiv Detail & Related papers (2020-02-06T05:07:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.