Related papers: Temporal Segment Transformer for Action Segmentation

Temporal Segment Transformer for Action Segmentation

URL: http://arxiv.org/abs/2302.13074v1
Date: Sat, 25 Feb 2023 13:05:57 GMT
Title: Temporal Segment Transformer for Action Segmentation
Authors: Zhichao Liu and Leshan Wang and Desen Zhou and Jian Wang and Songyang Zhang and Yang Bai and Errui Ding and Rui Fan
Abstract summary: We propose an attention based approach which we call textittemporal segment transformer, for joint segment relation modeling and denoising. The main idea is to denoise segment representations using attention between segment and frame representations, and also use inter-segment attention to capture temporal correlations between segments. We show that this novel architecture achieves state-of-the-art accuracy on the popular 50Salads, GTEA and Breakfast benchmarks.
Score: 54.25103250496069
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recognizing human actions from untrimmed videos is an important task in activity understanding, and poses unique challenges in modeling long-range temporal relations. Recent works adopt a predict-and-refine strategy which converts an initial prediction to action segments for global context modeling. However, the generated segment representations are often noisy and exhibit inaccurate segment boundaries, over-segmentation and other problems. To deal with these issues, we propose an attention based approach which we call \textit{temporal segment transformer}, for joint segment relation modeling and denoising. The main idea is to denoise segment representations using attention between segment and frame representations, and also use inter-segment attention to capture temporal correlations between segments. The refined segment representations are used to predict action labels and adjust segment boundaries, and a final action segmentation is produced based on voting from segment masks. We show that this novel architecture achieves state-of-the-art accuracy on the popular 50Salads, GTEA and Breakfast benchmarks. We also conduct extensive ablations to demonstrate the effectiveness of different components of our design.

Related papers

Image Segmentation in Foundation Model Era: A Survey [99.19456390358211]
Current research in image segmentation lacks a detailed analysis of distinct characteristics, challenges, and solutions associated with these advancements. This survey seeks to fill this gap by providing a thorough review of cutting-edge research centered around FM-driven image segmentation. An exhaustive overview of over 300 segmentation approaches is provided to encapsulate the breadth of current research efforts.
arXiv Detail & Related papers (2024-08-23T10:07:59Z)
Appearance-Based Refinement for Object-Centric Motion Segmentation [85.2426540999329]
We introduce an appearance-based refinement method that leverages temporal consistency in video streams to correct inaccurate flow-based proposals. Our approach involves a sequence-level selection mechanism that identifies accurate flow-predicted masks as exemplars. Its performance is evaluated on multiple video segmentation benchmarks, including DAVIS, YouTube, SegTrackv2, and FBMS-59.
arXiv Detail & Related papers (2023-12-18T18:59:51Z)
RefineVIS: Video Instance Segmentation with Temporal Attention Refinement [23.720986152136785]
RefineVIS learns two separate representations on top of an off-the-shelf frame-level image instance segmentation model. A Temporal Attention Refinement (TAR) module learns discriminative segmentation representations by exploiting temporal relationships. It achieves state-of-the-art video instance segmentation accuracy on YouTube-VIS 2019 (64.4 AP), Youtube-VIS 2021 (61.4 AP), and OVIS (46.1 AP) datasets.
arXiv Detail & Related papers (2023-06-07T20:45:15Z)
Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement. In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z)
TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences. In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning. Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z)
Monotonic segmental attention for automatic speech recognition [45.036436385637295]
We introduce a novel segmental-attention model for automatic speech recognition. We compare global-attention and different segmental-attention modeling variants. We observe that the segmental model generalizes much better to long sequences of up to several minutes.
arXiv Detail & Related papers (2022-10-26T14:21:23Z)
ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training. We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods. Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z)
Self-supervised Sparse to Dense Motion Segmentation [13.888344214818737]
We propose a self supervised method to learn the densification of sparse motion segmentations from single video frames. We evaluate our method on the well-known motion segmentation datasets FBMS59 and DAVIS16.
arXiv Detail & Related papers (2020-08-18T11:40:18Z)
MS-TCN++: Multi-Stage Temporal Convolutional Network for Action Segmentation [87.16030562892537]
We propose a multi-stage architecture for the temporal action segmentation task. The first stage generates an initial prediction that is refined by the next ones. Our models achieve state-of-the-art results on three datasets.
arXiv Detail & Related papers (2020-06-16T14:50:47Z)
Motion-supervised Co-Part Segmentation [88.40393225577088]
We propose a self-supervised deep learning method for co-part segmentation. Our approach develops the idea that motion information inferred from videos can be leveraged to discover meaningful object parts.
arXiv Detail & Related papers (2020-04-07T09:56:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.