Weakly-supervised Representation Learning for Video Alignment and
Analysis
- URL: http://arxiv.org/abs/2302.04064v1
- Date: Wed, 8 Feb 2023 14:01:01 GMT
- Title: Weakly-supervised Representation Learning for Video Alignment and
Analysis
- Authors: Guy Bar-Shalom, George Leifman, Michael Elad, Ehud Rivlin
- Abstract summary: This paper introduces LRProp -- a novel weakly-supervised representation learning approach.
The proposed algorithm uses also a regularized SoftDTW loss for better tuning the learned features.
Our novel representation learning paradigm consistently outperforms the state of the art on temporal alignment tasks.
- Score: 16.80278496414627
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many tasks in video analysis and understanding boil down to the need for
frame-based feature learning, aiming to encapsulate the relevant visual content
so as to enable simpler and easier subsequent processing. While supervised
strategies for this learning task can be envisioned, self and weakly-supervised
alternatives are preferred due to the difficulties in getting labeled data.
This paper introduces LRProp -- a novel weakly-supervised representation
learning approach, with an emphasis on the application of temporal alignment
between pairs of videos of the same action category. The proposed approach uses
a transformer encoder for extracting frame-level features, and employs the DTW
algorithm within the training iterations in order to identify the alignment
path between video pairs. Through a process referred to as ``pair-wise position
propagation'', the probability distributions of these correspondences per
location are matched with the similarity of the frame-level features via
KL-divergence minimization. The proposed algorithm uses also a regularized
SoftDTW loss for better tuning the learned features. Our novel representation
learning paradigm consistently outperforms the state of the art on temporal
alignment tasks, establishing a new performance bar over several downstream
video analysis applications.
Related papers
- Self-Supervised Contrastive Learning for Videos using Differentiable Local Alignment [3.2873782624127834]
We present a self-supervised method for representation learning based on aligning temporal video sequences.
We introduce the novel Local-Alignment Contrastive (LAC) loss, which combines a differentiable local alignment loss to capture local temporal dependencies.
We show that our learned representations outperform existing state-of-the-art approaches on action recognition tasks.
arXiv Detail & Related papers (2024-09-06T20:32:53Z) - Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z) - Learning Implicit Temporal Alignment for Few-shot Video Classification [40.57508426481838]
Few-shot video classification aims to learn new video categories with only a few labeled examples.
It is particularly challenging to learn a class-invariant spatial-temporal representation in such a setting.
We propose a novel matching-based few-shot learning strategy for video sequences in this work.
arXiv Detail & Related papers (2021-05-11T07:18:57Z) - Adaptive Intermediate Representations for Video Understanding [50.64187463941215]
We introduce a new way to leverage semantic segmentation as an intermediate representation for video understanding.
We propose a general framework which learns the intermediate representations (optical flow and semantic segmentation) jointly with the final video understanding task.
We obtain more powerful visual representations for videos which lead to performance gains over the state-of-the-art.
arXiv Detail & Related papers (2021-04-14T21:37:23Z) - Learning Dynamic Alignment via Meta-filter for Few-shot Learning [94.41887992982986]
Few-shot learning aims to recognise new classes by adapting the learned knowledge with extremely limited few-shot (support) examples.
We learn a dynamic alignment, which can effectively highlight both query regions and channels according to different local support information.
The resulting framework establishes the new state-of-the-arts on major few-shot visual recognition benchmarks.
arXiv Detail & Related papers (2021-03-25T03:29:33Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.