Weakly-Supervised Action Localization by Hierarchically-structured
Latent Attention Modeling
- URL: http://arxiv.org/abs/2308.09946v2
- Date: Tue, 26 Sep 2023 03:37:42 GMT
- Title: Weakly-Supervised Action Localization by Hierarchically-structured
Latent Attention Modeling
- Authors: Guiqin Wang and Peng Zhao and Cong Zhao and Shusen Yang and Jie Cheng
and Luziwei Leng and Jianxing Liao and Qinghai Guo
- Abstract summary: Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels.
Most existing models rely on multiple instance learning(MIL), where predictions of unlabeled instances are supervised by classifying labeled bags.
We propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics.
- Score: 19.683714649646603
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised action localization aims to recognize and localize action
instancese in untrimmed videos with only video-level labels. Most existing
models rely on multiple instance learning(MIL), where the predictions of
unlabeled instances are supervised by classifying labeled bags. The MIL-based
methods are relatively well studied with cogent performance achieved on
classification but not on localization. Generally, they locate temporal regions
by the video-level classification but overlook the temporal variations of
feature semantics. To address this problem, we propose a novel attention-based
hierarchically-structured latent model to learn the temporal variations of
feature semantics. Specifically, our model entails two components, the first is
an unsupervised change-points detection module that detects change-points by
learning the latent representations of video features in a temporal hierarchy
based on their rates of change, and the second is an attention-based
classification model that selects the change-points of the foreground as the
boundaries. To evaluate the effectiveness of our model, we conduct extensive
experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The
experiments show that our method outperforms current state-of-the-art methods,
and even achieves comparable performance with fully-supervised methods.
Related papers
- Unsupervised Temporal Action Localization via Self-paced Incremental
Learning [57.55765505856969]
We present a novel self-paced incremental learning model to enhance clustering and localization training simultaneously.
We design two (constant- and variable- speed) incremental instance learning strategies for easy-to-hard model training, thus ensuring the reliability of these video pseudolabels.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot
End-to-End Temporal Action Detection [10.012716326383567]
Temporal action detection (TAD) involves the localization and classification of action instances within untrimmed videos.
We present ZEETAD, featuring two modules: dual-localization and zero-shot proposal classification.
We enhance discriminative capability on unseen classes by minimally updating the frozen CLIP encoder with lightweight adapters.
arXiv Detail & Related papers (2023-11-01T00:17:37Z) - ATTA: Anomaly-aware Test-Time Adaptation for Out-of-Distribution
Detection in Segmentation [22.084967085509387]
We propose a dual-level OOD detection framework to handle domain shift and semantic shift jointly.
The first level distinguishes whether domain shift exists in the image by leveraging global low-level features.
The second level identifies pixels with semantic shift by utilizing dense high-level feature maps.
arXiv Detail & Related papers (2023-09-12T06:49:56Z) - Weakly-Supervised Temporal Action Localization by Inferring Salient
Snippet-Feature [26.7937345622207]
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in unsupervised videos simultaneously.
Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video.
We propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature.
arXiv Detail & Related papers (2023-03-22T06:08:34Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal
Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training.
We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods.
Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Unsupervised Domain Adaptation for Spatio-Temporal Action Localization [69.12982544509427]
S-temporal action localization is an important problem in computer vision.
We propose an end-to-end unsupervised domain adaptation algorithm.
We show that significant performance gain can be achieved when spatial and temporal features are adapted separately or jointly.
arXiv Detail & Related papers (2020-10-19T04:25:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.