MS-TCN++: Multi-Stage Temporal Convolutional Network for Action
Segmentation
- URL: http://arxiv.org/abs/2006.09220v2
- Date: Wed, 2 Sep 2020 10:18:47 GMT
- Title: MS-TCN++: Multi-Stage Temporal Convolutional Network for Action
Segmentation
- Authors: Shijie Li, Yazan Abu Farha, Yun Liu, Ming-Ming Cheng, Juergen Gall
- Abstract summary: We propose a multi-stage architecture for the temporal action segmentation task.
The first stage generates an initial prediction that is refined by the next ones.
Our models achieve state-of-the-art results on three datasets.
- Score: 87.16030562892537
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the success of deep learning in classifying short trimmed videos, more
attention has been focused on temporally segmenting and classifying activities
in long untrimmed videos. State-of-the-art approaches for action segmentation
utilize several layers of temporal convolution and temporal pooling. Despite
the capabilities of these approaches in capturing temporal dependencies, their
predictions suffer from over-segmentation errors. In this paper, we propose a
multi-stage architecture for the temporal action segmentation task that
overcomes the limitations of the previous approaches. The first stage generates
an initial prediction that is refined by the next ones. In each stage we stack
several layers of dilated temporal convolutions covering a large receptive
field with few parameters. While this architecture already performs well, lower
layers still suffer from a small receptive field. To address this limitation,
we propose a dual dilated layer that combines both large and small receptive
fields. We further decouple the design of the first stage from the refining
stages to address the different requirements of these stages. Extensive
evaluation shows the effectiveness of the proposed model in capturing
long-range dependencies and recognizing action segments. Our models achieve
state-of-the-art results on three datasets: 50Salads, Georgia Tech Egocentric
Activities (GTEA), and the Breakfast dataset.
Related papers
- BIT: Bi-Level Temporal Modeling for Efficient Supervised Action
Segmentation [34.88225099758585]
supervised action segmentation aims to partition a video into non-overlapping segments, each representing a different action.
Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost.
We propose an efficient BI-level Temporal modeling framework that learns explicit action tokens to represent action segments.
arXiv Detail & Related papers (2023-08-28T20:59:15Z) - DIR-AS: Decoupling Individual Identification and Temporal Reasoning for
Action Segmentation [84.78383981697377]
Fully supervised action segmentation works on frame-wise action recognition with dense annotations and often suffers from the over-segmentation issue.
We develop a novel local-global attention mechanism with temporal pyramid dilation and temporal pyramid pooling for efficient multi-scale attention.
We achieve state-of-the-art accuracy, eg, 82.8% (+2.6%) on GTEA and 74.7% (+1.2%) on Breakfast, which demonstrates the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-04-04T20:27:18Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning.
Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Skeleton-Based Action Segmentation with Multi-Stage Spatial-Temporal
Graph Convolutional Neural Networks [0.5156484100374059]
State-of-the-art action segmentation approaches use multiple stages of temporal convolutions.
We present multi-stage spatial-temporal graph convolutional neural networks (MS-GCN)
We replace the initial stage of temporal convolutions with spatial-temporal graph convolutions, which better exploit the spatial configuration of the joints.
arXiv Detail & Related papers (2022-02-03T17:42:04Z) - ASFormer: Transformer for Action Segmentation [9.509416095106493]
We present an efficient Transformer-based model for action segmentation task, named ASFormer.
It constrains the hypothesis space within a reliable scope, and is beneficial for the action segmentation task to learn a proper target function with small training sets.
We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences.
arXiv Detail & Related papers (2021-10-16T13:07:20Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Boundary-sensitive Pre-training for Temporal Localization in Videos [124.40788524169668]
We investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext ( BSP) task.
With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types.
Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart.
arXiv Detail & Related papers (2020-11-21T17:46:24Z) - Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video.
We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time.
We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.