Related papers: Weakly-Supervised Temporal Action Localization with Bidirectional Semantic Consistency Constraint

Weakly-Supervised Temporal Action Localization with Bidirectional Semantic Consistency Constraint

URL: http://arxiv.org/abs/2304.12616v1
Date: Tue, 25 Apr 2023 07:20:33 GMT
Title: Weakly-Supervised Temporal Action Localization with Bidirectional Semantic Consistency Constraint
Authors: Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Jie Li, Xinbo Gao
Abstract summary: Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video. We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
Score: 83.36913240873236
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly Supervised Temporal Action Localization (WTAL) aims to classify and localize temporal boundaries of actions for the video, given only video-level category labels in the training datasets. Due to the lack of boundary information during training, existing approaches formulate WTAL as a classificationproblem, i.e., generating the temporal class activation map (T-CAM) for localization. However, with only classification loss, the model would be sub-optimized, i.e., the action-related scenes are enough to distinguish different class labels. Regarding other actions in the action-related scene ( i.e., the scene same as positive actions) as co-scene actions, this sub-optimized model would misclassify the co-scene actions as positive actions. To address this misclassification, we propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi-SCC), to discriminate the positive actions from co-scene actions. The proposed Bi-SCC firstly adopts a temporal context augmentation to generate an augmented video that breaks the correlation between positive actions and their co-scene actions in the inter-video; Then, a semantic consistency constraint (SCC) is used to enforce the predictions of the original video and augmented video to be consistent, hence suppressing the co-scene actions. However, we find that this augmented video would destroy the original temporal context. Simply applying the consistency constraint would affect the completeness of localized positive actions. Hence, we boost the SCC in a bidirectional way to suppress co-scene actions while ensuring the integrity of positive actions, by cross-supervising the original and augmented videos. Finally, our proposed Bi-SCC can be applied to current WTAL approaches, and improve their performance. Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.

Related papers

ACE: Action Concept Enhancement of Video-Language Models in Procedural Videos [7.030989629685138]
Action Concept Enhancement (ACE) improves concept understanding of vision-language models (VLMs) ACE continually incorporates augmented action synonyms and negatives in an auxiliary classification loss. We show the enhanced concept understanding of our VLM, by visualizing the alignment of encoded embeddings of unseen action synonyms in the embedding space.
arXiv Detail & Related papers (2024-11-23T18:49:49Z)
FinePseudo: Improving Pseudo-Labelling through Temporal-Alignablity for Semi-Supervised Fine-Grained Action Recognition [57.17966905865054]
Real-life applications of action recognition often require a fine-grained understanding of subtle movements. Existing semi-supervised action recognition has mainly focused on coarse-grained action recognition. We propose an Alignability-Verification-based Metric learning technique to effectively discriminate between fine-grained action pairs.
arXiv Detail & Related papers (2024-09-02T20:08:06Z)
Fine-grained Temporal Contrastive Learning for Weakly-supervised Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization. Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting. Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z)
End-to-End Semi-Supervised Learning for Video Action Detection [23.042410033982193]
We propose a simple end-to-end based approach effectively which utilizes the unlabeled data. Video action detection requires both, action class prediction as well as a-temporal consistency. We demonstrate the effectiveness of the proposed approach on two different action detection benchmark datasets.
arXiv Detail & Related papers (2022-03-08T18:11:25Z)
FineAction: A Fined Video Dataset for Temporal Action Localization [60.90129329728657]
FineAction is a new large-scale fined video dataset collected from existing video datasets and web videos. This dataset contains 139K fined action instances densely annotated in almost 17K untrimmed videos spanning 106 action categories. Experimental results reveal that our FineAction brings new challenges for action localization on fined and multi-label instances with shorter duration.
arXiv Detail & Related papers (2021-05-24T06:06:32Z)
Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision. We introduce a framework that learns two feature subspaces respectively for actions and their context. The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z)
Weakly-Supervised Action Localization by Generative Attention Modeling [65.03548422403061]
Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available. We propose to model the class-agnostic frame-wise conditioned probability on the frame attention using conditional Variational Auto-Encoder (VAE) By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
arXiv Detail & Related papers (2020-03-27T14:02:56Z)
Weakly Supervised Temporal Action Localization Using Deep Metric Learning [12.49814373580862]
We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training. We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm. Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
arXiv Detail & Related papers (2020-01-21T22:01:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.