Related papers: End-to-End Action Segmentation Transformer

End-to-End Action Segmentation Transformer

URL: http://arxiv.org/abs/2503.06316v3
Date: Wed, 27 Aug 2025 04:37:20 GMT
Title: End-to-End Action Segmentation Transformer
Authors: Tieqiao Wang, Sinisa Todorovic,
Abstract summary: We introduce the End-to-End Action Transformer (EAST), which processes raw video frames directly.<n>Our contributions are as follows: (1) a lightweight adapter design for effective fine-tuning of large backbones; (2) an efficient segmentation-by-detection framework for leveraging action proposals predicted over a coarsely downsampled video; and (3) a novel action-proposal-based data augmentation strategy.
Score: 13.30372897896507
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most recent work on action segmentation relies on pre-computed frame features from models trained on other tasks and typically focuses on framewise encoding and labeling without explicitly modeling action segments. To overcome these limitations, we introduce the End-to-End Action Segmentation Transformer (EAST), which processes raw video frames directly -- eliminating the need for pre-extracted features and enabling true end-to-end training. Our contributions are as follows: (1) a lightweight adapter design for effective fine-tuning of large backbones; (2) an efficient segmentation-by-detection framework for leveraging action proposals predicted over a coarsely downsampled video; and (3) a novel action-proposal-based data augmentation strategy. EAST achieves SOTA performance on standard benchmarks, including GTEA, 50Salads, Breakfast, and Assembly-101.

Related papers

From Observation to Action: Latent Action-based Primitive Segmentation for VLA Pre-training in Industrial Settings [53.09342573704396]
We present a novel framework to unlock vast unlabeled human demonstration data from continuous industrial video streams for Vision-Language-Action (VLA) model pre-training.<n>Our method first trains a lightweight motion tokenizer to encode motion dynamics, then employs an unsupervised action segmenter to discover and segment semantically coherent action primitives.<n>This is the first fully automated end-to-end system for extracting and organizing VLA pre-training data from unstructured industrial videos, offering a scalable solution for embodied AI integration in manufacturing.
arXiv Detail & Related papers (2025-11-26T14:19:44Z)
FCA-RAC: First Cycle Annotated Repetitive Action Counting [30.253568218869237]
We propose a framework called First Cycle Annotated Repetitive Action Counting (FCA-RAC) FCA-RAC contains 4 parts: 1) a labeling technique that annotates each training video with the start and end of the first action cycle, along with the total action count. This technique enables the model to capture the correlation between the initial action cycle and subsequent actions.
arXiv Detail & Related papers (2024-06-18T01:12:43Z)
Efficient Temporal Action Segmentation via Boundary-aware Query Voting [51.92693641176378]
BaFormer is a boundary-aware Transformer network that tokenizes each video segment as an instance token. BaFormer significantly reduces the computational costs, utilizing only 6% of the running time.
arXiv Detail & Related papers (2024-05-25T00:44:13Z)
BIT: Bi-Level Temporal Modeling for Efficient Supervised Action Segmentation [34.88225099758585]
supervised action segmentation aims to partition a video into non-overlapping segments, each representing a different action. Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost. We propose an efficient BI-level Temporal modeling framework that learns explicit action tokens to represent action segments.
arXiv Detail & Related papers (2023-08-28T20:59:15Z)
Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement. In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z)
Temporal Segment Transformer for Action Segmentation [54.25103250496069]
We propose an attention based approach which we call textittemporal segment transformer, for joint segment relation modeling and denoising. The main idea is to denoise segment representations using attention between segment and frame representations, and also use inter-segment attention to capture temporal correlations between segments. We show that this novel architecture achieves state-of-the-art accuracy on the popular 50Salads, GTEA and Breakfast benchmarks.
arXiv Detail & Related papers (2023-02-25T13:05:57Z)
Transforming the Interactive Segmentation for Medical Imaging [34.57242805353604]
The goal of this paper is to interactively refine the automatic segmentation on challenging structures that fall behind human performance. We propose a novel Transformer-based architecture for Interactive (TIS) Our proposed architecture is composed of Transformer Decoder variants, which naturally fulfills feature comparison with the attention mechanisms.
arXiv Detail & Related papers (2022-08-20T03:28:23Z)
ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training. We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods. Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z)
Transformers in Action:Weakly Supervised Action Segmentation [81.18941007536468]
We show how to apply transformers to improve action alignment accuracy over the equivalent RNN-based models. We also propose a supplemental transcript embedding approach to select transcripts more quickly at inference-time. We evaluate our proposed methods across the benchmark datasets to better understand the applicability of transformers.
arXiv Detail & Related papers (2022-01-14T21:15:58Z)
ASFormer: Transformer for Action Segmentation [9.509416095106493]
We present an efficient Transformer-based model for action segmentation task, named ASFormer. It constrains the hypothesis space within a reliable scope, and is beneficial for the action segmentation task to learn a proper target function with small training sets. We apply a pre-defined hierarchical representation pattern that efficiently handles long input sequences.
arXiv Detail & Related papers (2021-10-16T13:07:20Z)
End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR. Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z)
Target-Aware Object Discovery and Association for Unsupervised Video Multi-Object Segmentation [79.6596425920849]
This paper addresses the task of unsupervised video multi-object segmentation. We introduce a novel approach for more accurate and efficient unseen-temporal segmentation. We evaluate the proposed approach on DAVIS$_17$ and YouTube-VIS, and the results demonstrate that it outperforms state-of-the-art methods both in segmentation accuracy and inference speed.
arXiv Detail & Related papers (2021-04-10T14:39:44Z)
Video Instance Segmentation with a Propose-Reduce Paradigm [68.59137660342326]
Video instance segmentation (VIS) aims to segment and associate all instances of predefined classes for each frame in videos. Prior methods usually obtain segmentation for a frame or clip first, and then merge the incomplete results by tracking or matching. We propose a new paradigm -- Propose-Reduce, to generate complete sequences for input videos by a single step.
arXiv Detail & Related papers (2021-03-25T10:58:36Z)
End-to-End Video Instance Segmentation with Transformers [84.17794705045333]
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem. For the first time, we demonstrate a much simpler and faster video instance segmentation framework built upon Transformers, achieving competitive accuracy.
arXiv Detail & Related papers (2020-11-30T02:03:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.