Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment
- URL: http://arxiv.org/abs/2403.19225v1
- Date: Thu, 28 Mar 2024 08:39:44 GMT
- Title: Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment
- Authors: Angchi Xu, Wei-Shi Zheng,
- Abstract summary: Weakly-supervised action segmentation is a task of learning to partition a long video into several action segments, where training videos are only accompanied by transcripts.
Most of existing methods need to infer pseudo segmentation for training by serial alignment between all frames and the transcript.
We propose a novel Action-Transition-Aware Boundary Alignment framework to efficiently and effectively filter out noisy boundaries and detect transitions.
- Score: 33.74853437611066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly-supervised action segmentation is a task of learning to partition a long video into several action segments, where training videos are only accompanied by transcripts (ordered list of actions). Most of existing methods need to infer pseudo segmentation for training by serial alignment between all frames and the transcript, which is time-consuming and hard to be parallelized while training. In this work, we aim to escape from this inefficient alignment with massive but redundant frames, and instead to directly localize a few action transitions for pseudo segmentation generation, where a transition refers to the change from an action segment to its next adjacent one in the transcript. As the true transitions are submerged in noisy boundaries due to intra-segment visual variation, we propose a novel Action-Transition-Aware Boundary Alignment (ATBA) framework to efficiently and effectively filter out noisy boundaries and detect transitions. In addition, to boost the semantic learning in the case that noise is inevitably present in the pseudo segmentation, we also introduce video-level losses to utilize the trusted video-level supervision. Extensive experiments show the effectiveness of our approach on both performance and training speed.
Related papers
- Efficient Temporal Action Segmentation via Boundary-aware Query Voting [51.92693641176378]
BaFormer is a boundary-aware Transformer network that tokenizes each video segment as an instance token.
BaFormer significantly reduces the computational costs, utilizing only 6% of the running time.
arXiv Detail & Related papers (2024-05-25T00:44:13Z) - O-TALC: Steps Towards Combating Oversegmentation within Online Action Segmentation [0.48748194765816943]
We introduce two methods for improved training and inference of backbone action recognition models.
Firstly, we introduce dense sampling whilst training to facilitate training vs. inference clip matching and improve segment boundary predictions.
Secondly, we introduce an Online Temporally Aware Label Cleaning (O-TALC) strategy to explicitly reduce oversegmentation during online inference.
arXiv Detail & Related papers (2024-04-10T10:36:15Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Entity-aware and Motion-aware Transformers for Language-driven Action
Localization in Videos [29.81187528951681]
We propose entity-aware and motion-aware Transformers that progressively localizes actions in videos.
The entity-aware Transformer incorporates the textual entities into visual representation learning.
The motion-aware Transformer captures fine-grained motion changes at multiple temporal scales.
arXiv Detail & Related papers (2022-05-12T03:00:40Z) - Transformers in Action:Weakly Supervised Action Segmentation [81.18941007536468]
We show how to apply transformers to improve action alignment accuracy over the equivalent RNN-based models.
We also propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time.
We evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers.
arXiv Detail & Related papers (2022-01-14T21:15:58Z) - Contrastive Transformation for Self-supervised Correspondence Learning [120.62547360463923]
We study the self-supervised learning of visual correspondence using unlabeled videos in the wild.
Our method simultaneously considers intra- and inter-video representation associations for reliable correspondence estimation.
Our framework outperforms the recent self-supervised correspondence methods on a range of visual tasks.
arXiv Detail & Related papers (2020-12-09T14:05:06Z) - Learning Motion Flows for Semi-supervised Instrument Segmentation from
Robotic Surgical Video [64.44583693846751]
We study the semi-supervised instrument segmentation from robotic surgical videos with sparse annotations.
By exploiting generated data pairs, our framework can recover and even enhance temporal consistency of training sequences.
Results show that our method outperforms the state-of-the-art semisupervised methods by a large margin.
arXiv Detail & Related papers (2020-07-06T02:39:32Z) - Weakly Supervised Temporal Action Localization with Segment-Level Labels [140.68096218667162]
Temporal action localization presents a trade-off between test performance and annotation-time cost.
We introduce a new segment-level supervision setting: segments are labeled when annotators observe actions happening here.
We devise a partial segment loss regarded as a loss sampling to learn integral action parts from labeled segments.
arXiv Detail & Related papers (2020-07-03T10:32:19Z) - SCT: Set Constrained Temporal Transformer for Set Supervised Action
Segmentation [22.887397951846353]
Weakly supervised approaches aim at learning temporal action segmentation from videos that are only weakly labeled.
We propose an approach that can be trained end-to-end on such data.
We evaluate our approach on three datasets where the approach achieves state-of-the-art results.
arXiv Detail & Related papers (2020-03-31T14:51:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.