CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition
- URL: http://arxiv.org/abs/2301.06018v1
- Date: Sun, 15 Jan 2023 05:07:41 GMT
- Title: CMAE-V: Contrastive Masked Autoencoders for Video Action Recognition
- Authors: Cheng-Ze Lu, Xiaojie Jin, Zhicheng Huang, Qibin Hou, Ming-Ming Cheng,
Jiashi Feng
- Abstract summary: CMAE for visual action recognition can generate stronger feature representations than its counterpart based on pure masked autoencoders.
With a hybrid architecture, CMAE-V, can achieve 82.2% and 71.6% top-1 accuracy on the Kinetics-400 and Something-something V2 datasets.
- Score: 140.22700085735215
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Masked Autoencoder (CMAE), as a new self-supervised framework,
has shown its potential of learning expressive feature representations in
visual image recognition. This work shows that CMAE also trivially generalizes
well on video action recognition without modifying the architecture and the
loss criterion. By directly replacing the original pixel shift with the
temporal shift, our CMAE for visual action recognition, CMAE-V for short, can
generate stronger feature representations than its counterpart based on pure
masked autoencoders. Notably, CMAE-V, with a hybrid architecture, can achieve
82.2% and 71.6% top-1 accuracy on the Kinetics-400 and Something-something V2
datasets, respectively. We hope this report could provide some informative
inspiration for future works.
Related papers
- MV-CC: Mask Enhanced Video Model for Remote Sensing Change Caption [8.062368743143388]
We introduce a novel video model-based paradigm without design of the fusion module.
Specifically, we use the off-the-shelf video encoder to simultaneously extract the temporal and spatial features of bi-temporal images.
Our proposed method can obtain better performance compared with other state-of-the-art RSICC methods.
arXiv Detail & Related papers (2024-10-31T14:02:40Z) - Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones.
In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - Contrastive Masked Autoencoders are Stronger Vision Learners [114.16568579208216]
Contrastive Masked Autoencoders (CMAE) is a new self-supervised pre-training method for learning more comprehensive and capable vision representations.
CMAE achieves the state-of-the-art performance on highly competitive benchmarks of image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2022-07-27T14:04:22Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - An Emerging Coding Paradigm VCM: A Scalable Coding Approach Beyond
Feature and Signal [99.49099501559652]
Video Coding for Machine (VCM) aims to bridge the gap between visual feature compression and classical video coding.
We employ a conditional deep generation network to reconstruct video frames with the guidance of learned motion pattern.
By learning to extract sparse motion pattern via a predictive model, the network elegantly leverages the feature representation to generate the appearance of to-be-coded frames.
arXiv Detail & Related papers (2020-01-09T14:18:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.