BIT: Bi-Level Temporal Modeling for Efficient Supervised Action
Segmentation
- URL: http://arxiv.org/abs/2308.14900v2
- Date: Sat, 7 Oct 2023 21:50:36 GMT
- Title: BIT: Bi-Level Temporal Modeling for Efficient Supervised Action
Segmentation
- Authors: Zijia Lu, Ehsan Elhamifar
- Abstract summary: supervised action segmentation aims to partition a video into non-overlapping segments, each representing a different action.
Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost.
We propose an efficient BI-level Temporal modeling framework that learns explicit action tokens to represent action segments.
- Score: 34.88225099758585
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We address the task of supervised action segmentation which aims to partition
a video into non-overlapping segments, each representing a different action.
Recent works apply transformers to perform temporal modeling at the
frame-level, which suffer from high computational cost and cannot well capture
action dependencies over long temporal horizons. To address these issues, we
propose an efficient BI-level Temporal modeling (BIT) framework that learns
explicit action tokens to represent action segments, in parallel performs
temporal modeling on frame and action levels, while maintaining a low
computational cost. Our model contains (i) a frame branch that uses convolution
to learn frame-level relationships, (ii) an action branch that uses transformer
to learn action-level dependencies with a small set of action tokens and (iii)
cross-attentions to allow communication between the two branches. We apply and
extend a set-prediction objective to allow each action token to represent one
or multiple action segments, thus can avoid learning a large number of tokens
over long videos with many segments. Thanks to the design of our action branch,
we can also seamlessly leverage textual transcripts of videos (when available)
to help action segmentation by using them to initialize the action tokens. We
evaluate our model on four video datasets (two egocentric and two third-person)
for action segmentation with and without transcripts, showing that BIT
significantly improves the state-of-the-art accuracy with much lower
computational cost (30 times faster) compared to existing transformer-based
methods.
Related papers
- Efficient Temporal Action Segmentation via Boundary-aware Query Voting [51.92693641176378]
BaFormer is a boundary-aware Transformer network that tokenizes each video segment as an instance token.
BaFormer significantly reduces the computational costs, utilizing only 6% of the running time.
arXiv Detail & Related papers (2024-05-25T00:44:13Z) - TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning.
Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z) - Temporal Segment Transformer for Action Segmentation [54.25103250496069]
We propose an attention based approach which we call textittemporal segment transformer, for joint segment relation modeling and denoising.
The main idea is to denoise segment representations using attention between segment and frame representations, and also use inter-segment attention to capture temporal correlations between segments.
We show that this novel architecture achieves state-of-the-art accuracy on the popular 50Salads, GTEA and Breakfast benchmarks.
arXiv Detail & Related papers (2023-02-25T13:05:57Z) - ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal
Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training.
We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods.
Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z) - Graph Convolutional Module for Temporal Action Localization in Videos [142.5947904572949]
We claim that the relations between action units play an important role in action localization.
A more powerful action detector should not only capture the local content of each action unit but also allow a wider field of view on the context related to it.
We propose a general graph convolutional module (GCM) that can be easily plugged into existing action localization methods.
arXiv Detail & Related papers (2021-12-01T06:36:59Z) - ASFormer: Transformer for Action Segmentation [9.509416095106493]
We present an efficient Transformer-based model for action segmentation task, named ASFormer.
It constrains the hypothesis space within a reliable scope, and is beneficial for the action segmentation task to learn a proper target function with small training sets.
We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences.
arXiv Detail & Related papers (2021-10-16T13:07:20Z) - Learning to Associate Every Segment for Video Panoptic Segmentation [123.03617367709303]
We learn coarse segment-level matching and fine pixel-level matching together.
We show that our per-frame computation model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets.
arXiv Detail & Related papers (2021-06-17T13:06:24Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - Modeling Multi-Label Action Dependencies for Temporal Action
Localization [53.53490517832068]
Real-world videos contain many complex actions with inherent relationships between action classes.
We propose an attention-based architecture that models these action relationships for the task of temporal action localization in unoccurrence videos.
We show improved performance over state-of-the-art methods on multi-label action localization benchmarks.
arXiv Detail & Related papers (2021-03-04T13:37:28Z) - MS-TCN++: Multi-Stage Temporal Convolutional Network for Action
Segmentation [87.16030562892537]
We propose a multi-stage architecture for the temporal action segmentation task.
The first stage generates an initial prediction that is refined by the next ones.
Our models achieve state-of-the-art results on three datasets.
arXiv Detail & Related papers (2020-06-16T14:50:47Z) - Hierarchical Attention Network for Action Segmentation [45.19890687786009]
The temporal segmentation of events is an essential task and a precursor for the automatic recognition of human actions in the video.
We propose a complete end-to-end supervised learning approach that can better learn relationships between actions over time.
We evaluate our system on challenging public benchmark datasets, including MERL Shopping, 50 salads, and Georgia Tech Egocentric datasets.
arXiv Detail & Related papers (2020-05-07T02:39:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.