Action Shuffle Alternating Learning for Unsupervised Action Segmentation
- URL: http://arxiv.org/abs/2104.02116v1
- Date: Mon, 5 Apr 2021 18:58:57 GMT
- Title: Action Shuffle Alternating Learning for Unsupervised Action Segmentation
- Authors: Jun Li, Sinisa Todorovic
- Abstract summary: We train an RNN to recognize positive and negative action sequences, and the RNN's hidden layer is taken as our new action-level feature embedding.
As supervision of actions is not available, we specify an HMM that explicitly models action lengths, and infer a MAP action segmentation with the Viterbi algorithm.
The resulting action segmentation is used as pseudo-ground truth for estimating our action-level feature embedding and updating the HMM.
- Score: 38.32743770719661
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses unsupervised action segmentation. Prior work captures
the frame-level temporal structure of videos by a feature embedding that
encodes time locations of frames in the video. We advance prior work with a new
self-supervised learning (SSL) of a feature embedding that accounts for both
frame- and action-level structure of videos. Our SSL trains an RNN to recognize
positive and negative action sequences, and the RNN's hidden layer is taken as
our new action-level feature embedding. The positive and negative sequences
consist of action segments sampled from videos, where in the former the sampled
action segments respect their time ordering in the video, and in the latter
they are shuffled. As supervision of actions is not available and our SSL
requires access to action segments, we specify an HMM that explicitly models
action lengths, and infer a MAP action segmentation with the Viterbi algorithm.
The resulting action segmentation is used as pseudo-ground truth for estimating
our action-level feature embedding and updating the HMM. We alternate the above
steps within the Generalized EM framework, which ensures convergence. Our
evaluation on the Breakfast, YouTube Instructions, and 50Salads datasets gives
superior results to those of the state of the art.
Related papers
- Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning.
Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z) - Learning to Refactor Action and Co-occurrence Features for Temporal
Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos.
We develop a novel auxiliary task by decoupling these two types of features within a video snippet.
We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z) - ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal
Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training.
We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods.
Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z) - Part-level Action Parsing via a Pose-guided Coarse-to-Fine Framework [108.70949305791201]
Part-level Action Parsing (PAP) aims to not only predict the video-level action but also recognize the frame-level fine-grained actions or interactions of body parts for each person in the video.
In particular, our framework first predicts the video-level class of the input video, then localizes the body parts and predicts the part-level action.
Our framework achieves state-of-the-art performance and outperforms existing methods over a 31.10% ROC score.
arXiv Detail & Related papers (2022-03-09T01:30:57Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - Anchor-Constrained Viterbi for Set-Supervised Action Segmentation [38.32743770719661]
This paper is about action segmentation under weak supervision in training.
We use a Hidden Markov Model (HMM) grounded on a multilayer perceptron (MLP) to label video frames.
In testing, a Monte Carlo sampling of action sets seen in training is used to generate candidate temporal sequences of actions.
arXiv Detail & Related papers (2021-04-05T18:50:21Z) - Weakly Supervised Temporal Action Localization with Segment-Level Labels [140.68096218667162]
Temporal action localization presents a trade-off between test performance and annotation-time cost.
We introduce a new segment-level supervision setting: segments are labeled when annotators observe actions happening here.
We devise a partial segment loss regarded as a loss sampling to learn integral action parts from labeled segments.
arXiv Detail & Related papers (2020-07-03T10:32:19Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z) - Set-Constrained Viterbi for Set-Supervised Action Segmentation [40.22433538226469]
This paper is about weakly supervised action segmentation.
The ground truth specifies only a set of actions present in a training video, but not their true temporal ordering.
We extend this framework by specifying an HMM, which accounts for co-occurrences of action classes and their temporal lengths.
arXiv Detail & Related papers (2020-02-27T05:32:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.