Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint
- URL: http://arxiv.org/abs/2304.12616v1
- Date: Tue, 25 Apr 2023 07:20:33 GMT
- Title: Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint
- Authors: Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Jie Li, Xinbo Gao
- Abstract summary: Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
- Score: 83.36913240873236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly Supervised Temporal Action Localization (WTAL) aims to classify and
localize temporal boundaries of actions for the video, given only video-level
category labels in the training datasets. Due to the lack of boundary
information during training, existing approaches formulate WTAL as a
classificationproblem, i.e., generating the temporal class activation map
(T-CAM) for localization. However, with only classification loss, the model
would be sub-optimized, i.e., the action-related scenes are enough to
distinguish different class labels. Regarding other actions in the
action-related scene ( i.e., the scene same as positive actions) as co-scene
actions, this sub-optimized model would misclassify the co-scene actions as
positive actions. To address this misclassification, we propose a simple yet
efficient method, named bidirectional semantic consistency constraint (Bi-SCC),
to discriminate the positive actions from co-scene actions. The proposed Bi-SCC
firstly adopts a temporal context augmentation to generate an augmented video
that breaks the correlation between positive actions and their co-scene actions
in the inter-video; Then, a semantic consistency constraint (SCC) is used to
enforce the predictions of the original video and augmented video to be
consistent, hence suppressing the co-scene actions. However, we find that this
augmented video would destroy the original temporal context. Simply applying
the consistency constraint would affect the completeness of localized positive
actions. Hence, we boost the SCC in a bidirectional way to suppress co-scene
actions while ensuring the integrity of positive actions, by cross-supervising
the original and augmented videos. Finally, our proposed Bi-SCC can be applied
to current WTAL approaches, and improve their performance. Experimental results
show that our approach outperforms the state-of-the-art methods on THUMOS14 and
ActivityNet.
Related papers
- ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos [7.030989629685138]
Action Concept Enhancement (ACE) improves concept understanding of vision-language models (VLMs)
ACE continually incorporates augmented action synonyms and negatives in an auxiliary classification loss.
We show the enhanced concept understanding of our VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space.
arXiv Detail & Related papers (2024-11-23T18:49:49Z) - FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition [57.17966905865054]
Real-life applications of action recognition often require a fine-grained understanding of subtle movements.
Existing semi-supervised action recognition has mainly focused on coarse-grained action recognition.
We propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs.
arXiv Detail & Related papers (2024-09-02T20:08:06Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - End-to-End Semi-Supervised Learning for Video Action Detection [23.042410033982193]
We propose a simple end-to-end based approach effectively which utilizes the unlabeled data.
Video action detection requires both, action class prediction as well as a-temporal consistency.
We demonstrate the effectiveness of the proposed approach on two different action detection benchmark datasets.
arXiv Detail & Related papers (2022-03-08T18:11:25Z) - FineAction: A Fined Video Dataset for Temporal Action Localization [60.90129329728657]
FineAction is a new large-scale fined video dataset collected from existing video datasets and web videos.
This dataset contains 139K fined action instances densely annotated in almost 17K untrimmed videos spanning 106 action categories.
Experimental results reveal that our FineAction brings new challenges for action localization on fined and multi-label instances with shorter duration.
arXiv Detail & Related papers (2021-05-24T06:06:32Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Weakly-Supervised Action Localization by Generative Attention Modeling [65.03548422403061]
Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available.
We propose to model the class-agnostic frame-wise conditioned probability on the frame attention using conditional Variational Auto-Encoder (VAE)
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
arXiv Detail & Related papers (2020-03-27T14:02:56Z) - Weakly Supervised Temporal Action Localization Using Deep Metric
Learning [12.49814373580862]
We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training.
We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm.
Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
arXiv Detail & Related papers (2020-01-21T22:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.