Related papers: Action Shuffling for Weakly Supervised Temporal Localization

Action Shuffling for Weakly Supervised Temporal Localization

URL: http://arxiv.org/abs/2105.04208v1
Date: Mon, 10 May 2021 09:05:58 GMT
Title: Action Shuffling for Weakly Supervised Temporal Localization
Authors: Xiao-Yu Zhang, Haichao Shi, Changsheng Li, Xinchu Shi
Abstract summary: This paper analyzes the order-sensitive and location-insensitive properties of actions. It embodies them into a self-augmented learning framework to improve the weakly supervised action localization performance.
Score: 22.43209053892713
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly supervised action localization is a challenging task with extensive applications, which aims to identify actions and the corresponding temporal intervals with only video-level annotations available. This paper analyzes the order-sensitive and location-insensitive properties of actions, and embodies them into a self-augmented learning framework to improve the weakly supervised action localization performance. To be specific, we propose a novel two-branch network architecture with intra/inter-action shuffling, referred to as ActShufNet. The intra-action shuffling branch lays out a self-supervised order prediction task to augment the video representation with inner-video relevance, whereas the inter-action shuffling branch imposes a reorganizing strategy on the existing action contents to augment the training set without resorting to any external resources. Furthermore, the global-local adversarial training is presented to enhance the model's robustness to irrelevant noises. Extensive experiments are conducted on three benchmark datasets, and the results clearly demonstrate the efficacy of the proposed method.

Related papers

Object-Centric Latent Action Learning [70.3173534658611]
We propose a novel object-centric latent action learning approach, based on VideoSaur and LAPO. This method effectively disentangles causal agent-object interactions from irrelevant background noise and reduces the performance degradation caused by distractors. Our preliminary experiments with the Distracting Control Suite show that latent action pretraining based on object decompositions improve the quality of inferred latent actions by x2.7 and efficiency of downstream fine-tuning with a small set of labeled actions, increasing return by x2.6 on average.
arXiv Detail & Related papers (2025-02-13T11:27:05Z)
Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z)
Leveraging triplet loss for unsupervised action segmentation [0.0]
We propose a fully unsupervised framework that learns action representations suitable for the action segmentation task from the single input video itself. Our method is a deep metric learning approach rooted in a shallow network with a triplet loss operating on similarity distributions. Under these circumstances, we successfully recover temporal boundaries in the learned action representations with higher quality compared with existing unsupervised approaches.
arXiv Detail & Related papers (2023-04-13T11:10:16Z)
Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature [26.7937345622207]
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in unsupervised videos simultaneously. Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video. We propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature.
arXiv Detail & Related papers (2023-03-22T06:08:34Z)
Multi-modal Prompting for Low-Shot Temporal Action Localization [95.19505874963751]
We consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification.
arXiv Detail & Related papers (2023-03-21T10:40:13Z)
Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos. We develop a novel auxiliary task by decoupling these two types of features within a video snippet. We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z)
Look at What I'm Doing: Self-Supervised Spatial Grounding of Narrations in Instructional Videos [78.34818195786846]
We introduce the task of spatially localizing narrated interactions in videos. Key to our approach is the ability to learn to spatially localize interactions with self-supervision on a large corpus of videos with accompanying transcribed narrations. We propose a multilayer cross-modal attention network that enables effective optimization of a contrastive loss during training.
arXiv Detail & Related papers (2021-10-20T14:45:13Z)
Hierarchical Modeling for Task Recognition and Action Segmentation in Weakly-Labeled Instructional Videos [6.187780920448871]
This paper focuses on task recognition and action segmentation in weakly-labeled instructional videos. We propose a two-stream framework, which exploits semantic and temporal hierarchies to recognize top-level tasks in instructional videos. We present a novel top-down weakly-supervised action segmentation approach, where the predicted task is used to constrain the inference of fine-grained action sequences.
arXiv Detail & Related papers (2021-10-12T02:32:15Z)
Unsupervised Action Segmentation with Self-supervised Feature Learning and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label. In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos. We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z)
Learning Actor-centered Representations for Action Localization in Streaming Videos using Predictive Learning [18.757368441841123]
Event perception tasks such as recognizing and localizing actions in streaming videos are essential for tackling visual understanding tasks. We tackle the problem of learning textitactor-centered representations through the notion of continual hierarchical predictive learning. Inspired by cognitive theories of event perception, we propose a novel, self-supervised framework.
arXiv Detail & Related papers (2021-04-29T06:06:58Z)
Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision. We introduce a framework that learns two feature subspaces respectively for actions and their context. The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z)
Two-Stream Consensus Network for Weakly-Supervised Temporal Action Localization [94.37084866660238]
We present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges. The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated. We propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries.
arXiv Detail & Related papers (2020-10-22T10:53:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.