Learning by Aligning Videos in Time
- URL: http://arxiv.org/abs/2103.17260v2
- Date: Thu, 17 Aug 2023 07:29:40 GMT
- Title: Learning by Aligning Videos in Time
- Authors: Sanjay Haresh and Sateesh Kumar and Huseyin Coskun and Shahram Najam
Syed and Andrey Konin and Muhammad Zeeshan Zia and Quoc-Huy Tran
- Abstract summary: We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
- Score: 10.075645944474287
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a self-supervised approach for learning video representations
using temporal video alignment as a pretext task, while exploiting both
frame-level and video-level information. We leverage a novel combination of
temporal alignment loss and temporal regularization terms, which can be used as
supervision signals for training an encoder network. Specifically, the temporal
alignment loss (i.e., Soft-DTW) aims for the minimum cost for temporally
aligning videos in the embedding space. However, optimizing solely for this
term leads to trivial solutions, particularly, one where all frames get mapped
to a small cluster in the embedding space. To overcome this problem, we propose
a temporal regularization term (i.e., Contrastive-IDM) which encourages
different frames to be mapped to different points in the embedding space.
Extensive evaluations on various tasks, including action phase classification,
action phase progression, and fine-grained frame retrieval, on three datasets,
namely Pouring, Penn Action, and IKEA ASM, show superior performance of our
approach over state-of-the-art methods for self-supervised representation
learning from videos. In addition, our method provides significant performance
gain where labeled data is lacking. Our code and labels are available on our
research website: https://retrocausal.ai/research/
Related papers
- Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment [3.2873782624127834]
We present a self-supervised method for representation learning based on aligning temporal video sequences.
We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies.
We show that our learned representations outperform existing state-of-the-art approaches on action recognition tasks.
arXiv Detail & Related papers (2024-09-06T20:32:53Z) - Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Time Does Tell: Self-Supervised Time-Tuning of Dense Image
Representations [79.87044240860466]
We propose a novel approach that incorporates temporal consistency in dense self-supervised learning.
Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos.
Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images.
arXiv Detail & Related papers (2023-08-22T21:28:58Z) - Video alignment using unsupervised learning of local and global features [0.0]
We introduce an unsupervised method for alignment that uses global and local features of the frames.
In particular, we introduce effective features for each video frame by means of three machine vision tools: person detection, pose estimation, and VGG network.
The main advantage of our approach is that no training is required, which makes it applicable for any new type of action without any need to collect training samples for it.
arXiv Detail & Related papers (2023-04-13T22:20:54Z) - Self-supervised and Weakly Supervised Contrastive Learning for
Frame-wise Action Representations [26.09611987412578]
We introduce a new framework of contrastive action representation learning (CARL) to learn frame-wise action representation in a self-supervised or weakly-supervised manner.
Specifically, we introduce a simple but effective video encoder that considers both spatial and temporal context.
Our method outperforms previous state-of-the-art by a large margin for downstream fine-grained action classification and even faster inference.
arXiv Detail & Related papers (2022-12-06T16:42:22Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference.
We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - Self-supervised Temporal Discriminative Learning for Video
Representation Learning [39.43942923911425]
Temporal-discriminative features can hardly be extracted without using an annotated large-scale video action dataset for training.
This paper proposes a novel Video-based Temporal-Discriminative Learning framework in self-supervised manner.
arXiv Detail & Related papers (2020-08-05T13:36:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.