Related papers: Transformers in Action:Weakly Supervised Action Segmentation

Transformers in Action:Weakly Supervised Action Segmentation

URL: http://arxiv.org/abs/2201.05675v1
Date: Fri, 14 Jan 2022 21:15:58 GMT
Title: Transformers in Action:Weakly Supervised Action Segmentation
Authors: John Ridley, Huseyin Coskun, David Joseph Tan, Nassir Navab, Federico Tombari
Abstract summary: We show how to apply transformers to improve action alignment accuracy over the equivalent RNN-based models. We also propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time. We evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers.
Score: 81.18941007536468
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The video action segmentation task is regularly explored under weaker forms of supervision, such as transcript supervision, where a list of actions is easier to obtain than dense frame-wise labels. In this formulation, the task presents various challenges for sequence modeling approaches due to the emphasis on action transition points, long sequence lengths, and frame contextualization, making the task well-posed for transformers. Given developments enabling transformers to scale linearly, we demonstrate through our architecture how they can be applied to improve action alignment accuracy over the equivalent RNN-based models with the attention mechanism focusing around salient action transition regions. Additionally, given the recent focus on inference-time transcript selection, we propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time. Furthermore, we subsequently demonstrate how this approach can also improve the overall segmentation performance. Finally, we evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers and the importance of transcript selection on this video-driven weakly-supervised task.

Related papers

Versatile Transition Generation with Image-to-Video Diffusion [89.67070538399457]
We present a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions.<n>We show that VTG achieves superior transition performance consistently across all four tasks.
arXiv Detail & Related papers (2025-08-03T10:03:56Z)
Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences. Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations. Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z)
Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment [33.74853437611066]
Weakly-supervised action segmentation is a task of learning to partition a long video into several action segments, where training videos are only accompanied by transcripts. Most of existing methods need to infer pseudo segmentation for training by serial alignment between all frames and the transcript. We propose a novel Action-Transition-Aware Boundary Alignment framework to efficiently and effectively filter out noisy boundaries and detect transitions.
arXiv Detail & Related papers (2024-03-28T08:39:44Z)
POTLoc: Pseudo-Label Oriented Transformer for Point-Supervised Temporal Action Localization [26.506893363676678]
This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action localization. POTLoc is designed to identify and track continuous action structures via a self-training strategy. It outperforms the state-of-the-art point-supervised methods on THUMOS'14 and ActivityNet-v1.2 datasets.
arXiv Detail & Related papers (2023-10-20T15:28:06Z)
AntPivot: Livestream Highlight Detection via Hierarchical Attention Mechanism [64.70568612993416]
We formulate a new task Livestream Highlight Detection, discuss and analyze the difficulties listed above and propose a novel architecture AntPivot to solve this problem. We construct a fully-annotated dataset AntHighlight to instantiate this task and evaluate the performance of our model.
arXiv Detail & Related papers (2022-06-10T05:58:11Z)
Entity-aware and Motion-aware Transformers for Language-driven Action Localization in Videos [29.81187528951681]
We propose entity-aware and motion-aware Transformers that progressively localizes actions in videos. The entity-aware Transformer incorporates the textual entities into visual representation learning. The motion-aware Transformer captures fine-grained motion changes at multiple temporal scales.
arXiv Detail & Related papers (2022-05-12T03:00:40Z)
SVIP: Sequence VerIfication for Procedures in Videos [68.07865790764237]
We propose a novel sequence verification task that aims to distinguish positive video pairs performing the same action sequence from negative ones with step-level transformations. Such a challenging task resides in an open-set setting without prior action detection or segmentation. We collect a scripted video dataset enumerating all kinds of step-level transformations in chemical experiments.
arXiv Detail & Related papers (2021-12-13T07:03:36Z)
Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion. One recent work builds an action-click supervision framework. It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods. In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z)
Temporal Action Proposal Generation with Transformers [25.66256889923748]
This paper intuitively presents a unified temporal action proposal generation framework with original Transformers. The Boundary Transformer captures long-term temporal dependencies to predict precise boundary information. The Proposal Transformer learns the rich inter-proposal relationships for reliable confidence evaluation.
arXiv Detail & Related papers (2021-05-25T16:22:12Z)
Augmented Transformer with Adaptive Graph for Temporal Action Proposal Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG. Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer. An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.