Representation learning from videos in-the-wild: An object-centric
approach
- URL: http://arxiv.org/abs/2010.02808v2
- Date: Tue, 9 Feb 2021 17:08:17 GMT
- Title: Representation learning from videos in-the-wild: An object-centric
approach
- Authors: Rob Romijnders, Aravindh Mahendran, Michael Tschannen, Josip Djolonga,
Marvin Ritter, Neil Houlsby, Mario Lucic
- Abstract summary: We propose a method to learn image representations from uncurated videos.
We combine a supervised loss from off-the-shelf object detectors and self-supervised losses which naturally arise from the video-shot-frame-object hierarchy present in each video.
- Score: 40.46013713992305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a method to learn image representations from uncurated videos. We
combine a supervised loss from off-the-shelf object detectors and
self-supervised losses which naturally arise from the video-shot-frame-object
hierarchy present in each video. We report competitive results on 19 transfer
learning tasks of the Visual Task Adaptation Benchmark (VTAB), and on 8
out-of-distribution-generalization tasks, and discuss the benefits and
shortcomings of the proposed approach. In particular, it improves over the
baseline on all 18/19 few-shot learning tasks and 8/8 out-of-distribution
generalization tasks. Finally, we perform several ablation studies and analyze
the impact of the pretrained object detector on the performance across this
suite of tasks.
Related papers
- Zero Shot Open-ended Video Inference [54.04466746939197]
We introduce an adaptable framework for conducting zero-shot open-ended inference tasks.
Our experiments span various video action datasets for goal inference and action recognition tasks.
Notably, the proposed framework exhibits the capability to generalize effectively to action recognition tasks.
arXiv Detail & Related papers (2024-01-23T03:45:05Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Contrastive Losses Are Natural Criteria for Unsupervised Video
Summarization [27.312423653997087]
Video summarization aims to select the most informative subset of frames in a video to facilitate efficient video browsing.
We propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness.
We show that by refining the pre-trained features with a lightweight contrastively learned projection module, the frame-level importance scores can be further improved.
arXiv Detail & Related papers (2022-11-18T07:01:28Z) - SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from
Video [61.21388780334379]
This work focuses on the apparent emotional reaction recognition from the video-only input, conducted in a self-supervised fashion.
The network is first pre-trained on different self-supervised pretext tasks and later fine-tuned on the downstream target task.
arXiv Detail & Related papers (2022-10-20T15:21:51Z) - Hierarchical Self-supervised Representation Learning for Movie
Understanding [24.952866206036536]
We propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model.
Specifically, we propose to pretrain the low-level video backbone using a contrastive learning objective, while pretrain the higher-level video contextualizer using an event mask prediction task.
We first show that our self-supervised pretraining strategies are effective and lead to improved performance on all tasks and metrics on VidSitu benchmark [37] (e.g., improving on semantic role prediction from 47% to 61% CIDEr scores)
arXiv Detail & Related papers (2022-04-06T21:28:41Z) - Learning Actor-centered Representations for Action Localization in
Streaming Videos using Predictive Learning [18.757368441841123]
Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks.
We tackle the problem of learning textitactor-centered representations through the notion of continual hierarchical predictive learning.
Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework.
arXiv Detail & Related papers (2021-04-29T06:06:58Z) - Memory-augmented Dense Predictive Coding for Video Representation
Learning [103.69904379356413]
We propose a new architecture and learning framework Memory-augmented Predictive Coding (MemDPC) for the task.
We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both.
In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
arXiv Detail & Related papers (2020-08-03T17:57:01Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.