Spatio-Temporal Crop Aggregation for Video Representation Learning
- URL: http://arxiv.org/abs/2211.17042v1
- Date: Wed, 30 Nov 2022 14:43:35 GMT
- Title: Spatio-Temporal Crop Aggregation for Video Representation Learning
- Authors: Sepehr Sameni, Simon Jenni, Paolo Favaro
- Abstract summary: Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone.
We demonstrate that our video representation yields state-of-the-art performance with linear, non-linear, and $k$-NN probing on common action classification datasets.
- Score: 33.296154476701055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose Spatio-temporal Crop Aggregation for video representation LEarning
(SCALE), a novel method that enjoys high scalability at both training and
inference time. Our model builds long-range video features by learning from
sets of video clip-level features extracted with a pre-trained backbone. To
train the model, we propose a self-supervised objective consisting of masked
clip feature prediction. We apply sparsity to both the input, by extracting a
random set of video clips, and to the loss function, by only reconstructing the
sparse inputs. Moreover, we use dimensionality reduction by working in the
latent space of a pre-trained backbone applied to single video clips. The video
representation is then obtained by taking the ensemble of the concatenation of
embeddings of separate video clips with a video clip set summarization token.
These techniques make our method not only extremely efficient to train, but
also highly effective in transfer learning. We demonstrate that our video
representation yields state-of-the-art performance with linear, non-linear, and
$k$-NN probing on common action classification datasets.
Related papers
- Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Probabilistic Representations for Video Contrastive Learning [64.47354178088784]
This paper presents a self-supervised representation learning method that bridges contrastive learning with probabilistic representation.
By sampling embeddings from the whole video distribution, we can circumvent the careful sampling strategy or transformations to generate augmented views of the clips.
arXiv Detail & Related papers (2022-04-08T09:09:30Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Self-Supervised Video Representation Learning by Video Incoherence
Detection [28.540645395066434]
This paper introduces a novel self-supervised method that leverages incoherence detection for video representation learning.
It roots from the observation that visual systems of human beings can easily identify video incoherence based on their comprehensive understanding of videos.
arXiv Detail & Related papers (2021-09-26T04:58:13Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Less is More: ClipBERT for Video-and-Language Learning via Sparse
Sampling [98.41300980759577]
A canonical approach to video-and-language learning dictates a neural model to learn from offline-extracted dense video features.
We propose a generic framework ClipBERT that enables affordable end-to-end learning for video-and-language tasks.
Experiments on text-to-video retrieval and video question answering on six datasets demonstrate that ClipBERT outperforms existing methods.
arXiv Detail & Related papers (2021-02-11T18:50:16Z) - NUTA: Non-uniform Temporal Aggregation for Action Recognition [29.75987323741384]
We propose a method called the non-uniform temporal aggregation (NUTA), which aggregates features only from informative temporal segments.
Our model has achieved state-of-the-art performance on four widely used large-scale action-recognition datasets.
arXiv Detail & Related papers (2020-12-15T02:03:37Z) - TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization
Tasks [79.01176229586855]
We propose a novel supervised pretraining paradigm for clip features that considers background clips and global video information to improve temporal sensitivity.
Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks.
arXiv Detail & Related papers (2020-11-23T15:40:15Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.