MoDist: Motion Distillation for Self-supervised Video Representation
Learning
- URL: http://arxiv.org/abs/2106.09703v1
- Date: Thu, 17 Jun 2021 17:57:11 GMT
- Title: MoDist: Motion Distillation for Self-supervised Video Representation
Learning
- Authors: Fanyi Xiao and Joseph Tighe and Davide Modolo
- Abstract summary: MoDist is a novel method to distill motion information into self-supervised video representations.
We show that MoDist focus more on foreground motion regions and thus generalizes better to downstream tasks.
- Score: 27.05772951598066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MoDist as a novel method to explicitly distill motion information
into self-supervised video representations. Compared to previous video
representation learning methods that mostly focus on learning motion cues
implicitly from RGB inputs, we show that the representation learned with our
MoDist method focus more on foreground motion regions and thus generalizes
better to downstream tasks. To achieve this, MoDist enriches standard
contrastive learning objectives for RGB video clips with a cross-modal learning
objective between a Motion pathway and a Visual pathway. We evaluate MoDist on
several datasets for both action recognition (UCF101/HMDB51/SSv2) as well as
action detection (AVA), and demonstrate state-of-the-art self-supervised
performance on all datasets. Furthermore, we show that MoDist representation
can be as effective as (in some cases even better than) representations learned
with full supervision. Given its simplicity, we hope MoDist could serve as a
strong baseline for future research in self-supervised video representation
learning.
Related papers
- DynaMo: In-Domain Dynamics Pretraining for Visuo-Motor Control [18.737628473949048]
Imitation learning has proven to be a powerful tool for training complex visuomotor policies.
Current methods often require hundreds to thousands of expert demonstrations to handle high-dimensional visual observations.
We present DynaMo, a new in-domain, self-supervised method for learning visual representations.
arXiv Detail & Related papers (2024-09-18T17:59:43Z) - HomE: Homography-Equivariant Video Representation Learning [62.89516761473129]
We propose a novel method for representation learning of multi-view videos.
Our method learns an implicit mapping between different views, culminating in a representation space that maintains the homography relationship between neighboring views.
On action classification, our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than most state-of-the-art self-supervised learning methods.
arXiv Detail & Related papers (2023-06-02T15:37:43Z) - Video Action Recognition with Attentive Semantic Units [23.384091957466588]
We exploit the semantic units () hiding behind the action labels for more accurate action recognition.
We introduce a multi-region module (MRA) to the visual branch of the Visual-Language Models (VLMs)
In fully-supervised learning, our method achieved 87.8% top-1 accuracy on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T03:44:15Z) - Masked Video Distillation: Rethinking Masked Feature Modeling for
Self-supervised Video Representation Learning [123.63301596019522]
Masked video distillation (MVD) is a simple yet effective two-stage masked feature modeling framework for video representation learning.
For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks.
We design a spatial-temporal co-teaching method for MVD to leverage the advantage of different teachers.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - Self-supervised Amodal Video Object Segmentation [57.929357732733926]
Amodal perception requires inferring the full shape of an object that is partially occluded.
This paper develops a new framework of amodal Video object segmentation (SaVos)
arXiv Detail & Related papers (2022-10-23T14:09:35Z) - ViA: View-invariant Skeleton Action Representation Learning via Motion
Retargeting [10.811088895926776]
ViA is a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning.
We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data.
Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy.
arXiv Detail & Related papers (2022-08-31T18:49:38Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - RSPNet: Relative Speed Perception for Unsupervised Video Representation
Learning [100.76672109782815]
We study unsupervised video representation learning that seeks to learn both motion and appearance features from unlabeled video only.
It is difficult to construct a suitable self-supervised task to well model both motion and appearance features.
We propose a new way to perceive the playback speed and exploit the relative speed between two video clips as labels.
arXiv Detail & Related papers (2020-10-27T16:42:50Z) - Exploring Relations in Untrimmed Videos for Self-Supervised Learning [17.670226952829506]
Existing self-supervised learning methods mainly rely on trimmed videos for model training.
We propose a novel self-supervised method, referred to as Exploring Relations in Untemporal Videos (ERUV)
ERUV is able to learn richer representations and it outperforms state-of-the-art self-supervised methods with significant margins.
arXiv Detail & Related papers (2020-08-06T15:29:25Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.