HomE: Homography-Equivariant Video Representation Learning
- URL: http://arxiv.org/abs/2306.01623v1
- Date: Fri, 2 Jun 2023 15:37:43 GMT
- Title: HomE: Homography-Equivariant Video Representation Learning
- Authors: Anirudh Sriram, Adrien Gaidon, Jiajun Wu, Juan Carlos Niebles, Li
Fei-Fei, Ehsan Adeli
- Abstract summary: We propose a novel method for representation learning of multi-view videos.
Our method learns an implicit mapping between different views, culminating in a representation space that maintains the homography relationship between neighboring views.
On action classification, our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than most state-of-the-art self-supervised learning methods.
- Score: 62.89516761473129
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in self-supervised representation learning have enabled more
efficient and robust model performance without relying on extensive labeled
data. However, most works are still focused on images, with few working on
videos and even fewer on multi-view videos, where more powerful inductive
biases can be leveraged for self-supervision. In this work, we propose a novel
method for representation learning of multi-view videos, where we explicitly
model the representation space to maintain Homography Equivariance (HomE). Our
method learns an implicit mapping between different views, culminating in a
representation space that maintains the homography relationship between
neighboring views. We evaluate our HomE representation via action recognition
and pedestrian intent prediction as downstream tasks. On action classification,
our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than
most state-of-the-art self-supervised learning methods. Similarly, on the STIP
dataset, we outperform the state-of-the-art by 6% for pedestrian intent
prediction one second into the future while also obtaining an accuracy of 91.2%
for pedestrian action (cross vs. not-cross) classification. Code is available
at https://github.com/anirudhs123/HomE.
Related papers
- Early Action Recognition with Action Prototypes [62.826125870298306]
We propose a novel model that learns a prototypical representation of the full action for each class.
We decompose the video into short clips, where a visual encoder extracts features from each clip independently.
Later, a decoder aggregates together in an online fashion features from all the clips for the final class prediction.
arXiv Detail & Related papers (2023-12-11T18:31:13Z) - Expanding Language-Image Pretrained Models for General Video Recognition [136.0948049010682]
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data.
We present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly.
Our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols.
arXiv Detail & Related papers (2022-08-04T17:59:54Z) - Auxiliary Learning for Self-Supervised Video Representation via
Similarity-based Knowledge Distillation [2.6519061087638014]
We propose a novel approach to complement self-supervised pretraining via an auxiliary pretraining phase, based on knowledge similarity distillation, auxSKD.
Our method deploys a teacher network that iteratively distils its knowledge to the student model by capturing the similarity information between segments of unlabelled video data.
We also introduce a novel pretext task, Video Segment Pace Prediction or VSPP, which requires our model to predict the playback speed of a randomly selected segment of the input video to provide more reliable self-supervised representations.
arXiv Detail & Related papers (2021-12-07T21:50:40Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - Contrastive Learning of Image Representations with Cross-Video
Cycle-Consistency [13.19476138523546]
Cross-video relation has barely been explored for visual representation learning.
We propose a novel contrastive learning method which explores the cross-video relation by using cycle-consistency for general image representation learning.
We show significant improvement over state-of-the-art contrastive learning methods.
arXiv Detail & Related papers (2021-05-13T17:59:11Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Naive-Student: Leveraging Semi-Supervised Learning in Video Sequences
for Urban Scene Segmentation [57.68890534164427]
In this work, we ask if we may leverage semi-supervised learning in unlabeled video sequences and extra images to improve the performance on urban scene segmentation.
We simply predict pseudo-labels for the unlabeled data and train subsequent models with both human-annotated and pseudo-labeled data.
Our Naive-Student model, trained with such simple yet effective iterative semi-supervised learning, attains state-of-the-art results at all three Cityscapes benchmarks.
arXiv Detail & Related papers (2020-05-20T18:00:05Z) - Learning Spatiotemporal Features via Video and Text Pair Discrimination [30.64670449131973]
Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
arXiv Detail & Related papers (2020-01-16T08:28:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.