Transformers in Action:Weakly Supervised Action Segmentation
- URL: http://arxiv.org/abs/2201.05675v1
- Date: Fri, 14 Jan 2022 21:15:58 GMT
- Title: Transformers in Action:Weakly Supervised Action Segmentation
- Authors: John Ridley, Huseyin Coskun, David Joseph Tan, Nassir Navab, Federico
Tombari
- Abstract summary: We show how to apply transformers to improve action alignment accuracy over the equivalent RNN-based models.
We also propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time.
We evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers.
- Score: 81.18941007536468
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The video action segmentation task is regularly explored under weaker forms
of supervision, such as transcript supervision, where a list of actions is
easier to obtain than dense frame-wise labels. In this formulation, the task
presents various challenges for sequence modeling approaches due to the
emphasis on action transition points, long sequence lengths, and frame
contextualization, making the task well-posed for transformers. Given
developments enabling transformers to scale linearly, we demonstrate through
our architecture how they can be applied to improve action alignment accuracy
over the equivalent RNN-based models with the attention mechanism focusing
around salient action transition regions. Additionally, given the recent focus
on inference-time transcript selection, we propose a supplemental transcript
embedding approach to select transcripts more quickly at inference-time.
Furthermore, we subsequently demonstrate how this approach can also improve the
overall segmentation performance. Finally, we evaluate our proposed methods
across the benchmark datasets to better understand the applicability of
transformers and the importance of transcript selection on this video-driven
weakly-supervised task.
Related papers
- Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment [33.74853437611066]
Weakly-supervised action segmentation is a task of learning to partition a long video into several action segments, where training videos are only accompanied by transcripts.
Most of existing methods need to infer pseudo segmentation for training by serial alignment between all frames and the transcript.
We propose a novel Action-Transition-Aware Boundary Alignment framework to efficiently and effectively filter out noisy boundaries and detect transitions.
arXiv Detail & Related papers (2024-03-28T08:39:44Z) - POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization [26.506893363676678]
This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action localization.
POTLoc is designed to identify and track continuous action structures via a self-training strategy.
It outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets.
arXiv Detail & Related papers (2023-10-20T15:28:06Z) - AntPivot: Livestream Highlight Detection via Hierarchical Attention
Mechanism [64.70568612993416]
We formulate a new task Livestream Highlight Detection, discuss and analyze the difficulties listed above and propose a novel architecture AntPivot to solve this problem.
We construct a fully-annotated dataset AntHighlight to instantiate this task and evaluate the performance of our model.
arXiv Detail & Related papers (2022-06-10T05:58:11Z) - Entity-aware and Motion-aware Transformers for Language-driven Action
Localization in Videos [29.81187528951681]
We propose entity-aware and motion-aware Transformers that progressively localizes actions in videos.
The entity-aware Transformer incorporates the textual entities into visual representation learning.
The motion-aware Transformer captures fine-grained motion changes at multiple temporal scales.
arXiv Detail & Related papers (2022-05-12T03:00:40Z) - SVIP: Sequence VerIfication for Procedures in Videos [68.07865790764237]
We propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations.
Such a challenging task resides in an open-set setting without prior action detection or segmentation.
We collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments.
arXiv Detail & Related papers (2021-12-13T07:03:36Z) - Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z) - Temporal Action Proposal Generation with Transformers [25.66256889923748]
This paper intuitively presents a unified temporal action proposal generation framework with original Transformers.
The Boundary Transformer captures long-term temporal dependencies to predict precise boundary information.
The Proposal Transformer learns the rich inter-proposal relationships for reliable confidence evaluation.
arXiv Detail & Related papers (2021-05-25T16:22:12Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks.
We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.