TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering
- URL: http://arxiv.org/abs/2303.05166v1
- Date: Thu, 9 Mar 2023 10:46:23 GMT
- Title: TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering
- Authors: Wei Lin, Anna Kukleva, Horst Possegger, Hilde Kuehne, Horst Bischof
- Abstract summary: We propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning.
Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
- Score: 27.52568444236988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action segmentation in untrimmed videos has gained increased
attention recently. However, annotating action classes and frame-wise
boundaries is extremely time consuming and cost intensive, especially on
large-scale datasets. To address this issue, we propose an unsupervised
approach for learning action classes from untrimmed video sequences. In
particular, we propose a temporal embedding network that combines relative time
prediction, feature reconstruction, and sequence-to-sequence learning, to
preserve the spatial layout and sequential nature of the video features. A
two-step clustering pipeline on these embedded feature representations then
allows us to enforce temporal consistency within, as well as across videos.
Based on the identified clusters, we decode the video into coherent temporal
segments that correspond to semantically meaningful action classes. Our
evaluation on three challenging datasets shows the impact of each component
and, furthermore, demonstrates our state-of-the-art unsupervised action
segmentation results.
Related papers
- Spatial-Temporal Multi-level Association for Video Object Segmentation [89.32226483171047]
This paper proposes spatial-temporal multi-level association, which jointly associates reference frame, test frame, and object features.
Specifically, we construct a spatial-temporal multi-level feature association module to learn better target-aware features.
arXiv Detail & Related papers (2024-04-09T12:44:34Z) - Temporal Segment Transformer for Action Segmentation [54.25103250496069]
We propose an attention based approach which we call textittemporal segment transformer, for joint segment relation modeling and denoising.
The main idea is to denoise segment representations using attention between segment and frame representations, and also use inter-segment attention to capture temporal correlations between segments.
We show that this novel architecture achieves state-of-the-art accuracy on the popular 50Salads, GTEA and Breakfast benchmarks.
arXiv Detail & Related papers (2023-02-25T13:05:57Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - Action Shuffle Alternating Learning for Unsupervised Action Segmentation [38.32743770719661]
We train an RNN to recognize positive and negative action sequences, and the RNN's hidden layer is taken as our new action-level feature embedding.
As supervision of actions is not available, we specify an HMM that explicitly models action lengths, and infer a MAP action segmentation with the Viterbi algorithm.
The resulting action segmentation is used as pseudo-ground truth for estimating our action-level feature embedding and updating the HMM.
arXiv Detail & Related papers (2021-04-05T18:58:57Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - MS-TCN++: Multi-Stage Temporal Convolutional Network for Action
Segmentation [87.16030562892537]
We propose a multi-stage architecture for the temporal action segmentation task.
The first stage generates an initial prediction that is refined by the next ones.
Our models achieve state-of-the-art results on three datasets.
arXiv Detail & Related papers (2020-06-16T14:50:47Z) - Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video.
We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time.
We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z) - Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in
Untrimmed Sequences [25.299599341774204]
This paper proposes an approach for the unsupervised learning of actions in untrimmed video sequences based on a joint visual-temporal embedding space.
We show that the proposed approach is able to provide a meaningful visual and temporal embedding out of the visual cues present in contiguous video frames.
arXiv Detail & Related papers (2020-01-29T22:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.