Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization
- URL: http://arxiv.org/abs/2203.16800v1
- Date: Thu, 31 Mar 2022 05:13:50 GMT
- Title: Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization
- Authors: Junyu Gao, Mengyuan Chen, Changsheng Xu
- Abstract summary: This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
- Score: 87.47977407022492
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We target at the task of weakly-supervised action localization (WSAL), where
only video-level action labels are available during model training. Despite the
recent progress, existing methods mainly embrace a
localization-by-classification paradigm and overlook the fruitful fine-grained
temporal distinctions between video sequences, thus suffering from severe
ambiguity in classification learning and classification-to-localization
adaption. This paper argues that learning by contextually comparing
sequence-to-sequence distinctions offers an essential inductive bias in WSAL
and helps identify coherent action instances. Specifically, under a
differentiable dynamic programming formulation, two complementary contrastive
objectives are designed, including Fine-grained Sequence Distance (FSD)
contrasting and Longest Common Subsequence (LCS) contrasting, where the first
one considers the relations of various action/background proposals by using
match, insert, and delete operators and the second one mines the longest common
subsequences between two videos. Both contrasting modules can enhance each
other and jointly enjoy the merits of discriminative action-background
separation and alleviated task gap between classification and localization.
Extensive experiments show that our method achieves state-of-the-art
performance on two popular benchmarks. Our code is available at
https://github.com/MengyuanChen21/CVPR2022-FTCL.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Weakly-Supervised Action Localization by Hierarchically-structured
Latent Attention Modeling [19.683714649646603]
Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels.
Most existing models rely on multiple instance learning(MIL), where predictions of unlabeled instances are supervised by classifying labeled bags.
We propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics.
arXiv Detail & Related papers (2023-08-19T08:45:49Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - TTAN: Two-Stage Temporal Alignment Network for Few-shot Action
Recognition [29.95184808021684]
Few-shot action recognition aims to recognize novel action classes (query) using just a few samples (support)
We devise a novel multi-shot fusion strategy, which takes the misalignment among support samples into consideration.
Experiments on benchmark datasets show the potential of the proposed method in achieving state-of-the-art performance for few-shot action recognition.
arXiv Detail & Related papers (2021-07-10T07:22:49Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.