Related papers: Dual-Stream Alignment for Action Segmentation

Dual-Stream Alignment for Action Segmentation

URL: http://arxiv.org/abs/2510.07652v1
Date: Thu, 09 Oct 2025 00:59:17 GMT
Title: Dual-Stream Alignment for Action Segmentation
Authors: Harshala Gammulle, Clinton Fookes, Sridha Sridharan, Simon Denman,
Abstract summary: Action segmentation involves identifying when and where specific actions occur in continuous video streams.<n>Recent research has shifted toward two-stream methods that learn action-wise features to enhance action performance.<n>We propose the Dual-Stream Alignment Network (DSA Net) and investigate the impact of a second stream of learned action features to guide segmentation.
Score: 37.24437077331131
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Action segmentation is a challenging yet active research area that involves identifying when and where specific actions occur in continuous video streams. Most existing work has focused on single-stream approaches that model the spatio- temporal aspects of frame sequences. However, recent research has shifted toward two-stream methods that learn action-wise features to enhance action segmentation performance. In this work, we propose the Dual-Stream Alignment Network (DSA Net) and investigate the impact of incorporating a second stream of learned action features to guide segmentation by capturing both action and action-transition cues. Communication between the two streams is facilitated by a Temporal Context (TC) block, which fuses complementary information using cross- attention and Quantum-based Action-Guided Modulation (Q- ActGM), enhancing the expressive power of the fused features. To the best of our knowledge, this is the first study to introduce a hybrid quantum-classical machine learning framework for action segmentation. Our primary objective is for the two streams (frame-wise and action-wise) to learn a shared feature space through feature alignment. This is encouraged by the proposed Dual-Stream Alignment Loss, which comprises three components: relational consistency, cross-level contrastive, and cycle-consistency reconstruction losses. Following prior work, we evaluate DSA Net on several diverse benchmark datasets: GTEA, Breakfast, 50Salads, and EgoProcel. We further demonstrate the effectiveness of each component through extensive ablation studies. Notably, DSA Net achieves state-of-the-art performance, significantly outperforming existing

Related papers

Reframing Dense Action Detection (RefDense): A Paradigm Shift in Problem Solving & a Novel Optimization Strategy [23.100602876056165]
We argue that handling the dual challenge of temporal and class overlaps is too complex to be tackled by a single network.<n>We propose to decompose the task of detecting dense ambiguous actions into detecting dense unambiguous sub-concepts.<n>Our experiments demonstrate the superiority of our approach over state-of-the-art methods.
arXiv Detail & Related papers (2025-01-30T17:20:42Z)
ActFusion: a Unified Diffusion Model for Action Segmentation and Anticipation [66.8640112000444]
Temporal action segmentation and long-term action anticipation are popular vision tasks for the temporal analysis of actions in videos.<n>We tackle these two problems, action segmentation and action anticipation, jointly using a unified diffusion model dubbed ActFusion.<n>We introduce a new anticipative masking strategy during training in which a late part of the video frames is masked as invisible, and learnable tokens replace these frames to learn to predict the invisible future.
arXiv Detail & Related papers (2024-12-05T17:12:35Z)
Learning to Refactor Action and Co-occurrence Features for Temporal Action Localization [74.74339878286935]
Action features and co-occurrence features often dominate the actual action content in videos. We develop a novel auxiliary task by decoupling these two types of features within a video snippet. We term our method RefactorNet, which first explicitly factorizes the action content and regularizes its co-occurrence features.
arXiv Detail & Related papers (2022-06-23T06:30:08Z)
ACGNet: Action Complement Graph Network for Weakly-supervised Temporal Action Localization [39.377289930528555]
Weakly-trimmed temporal action localization (WTAL) in unsupervised videos has emerged as a practical but challenging task since only video-level labels are available. Existing approaches typically leverage off-the-shelf segment-level features, which suffer from spatial incompleteness and temporal incoherence. In this paper, we tackle this problem by enhancing segment-level representations with a simple yet effective graph convolutional network.
arXiv Detail & Related papers (2021-12-21T04:18:44Z)
Graph Convolutional Module for Temporal Action Localization in Videos [142.5947904572949]
We claim that the relations between action units play an important role in action localization. A more powerful action detector should not only capture the local content of each action unit but also allow a wider field of view on the context related to it. We propose a general graph convolutional module (GCM) that can be easily plugged into existing action localization methods.
arXiv Detail & Related papers (2021-12-01T06:36:59Z)
Weakly-Supervised Spatio-Temporal Anomaly Detection in Surveillance Video [128.41392860714635]
We introduce Weakly-Supervised Snoma-Temporally Detection (WSSTAD) in surveillance video. WSSTAD aims to localize a-temporal tube (i.e. sequence of bounding boxes at consecutive times) that encloses abnormal event. We propose a dual-branch network which takes as input proposals with multi-granularities in both spatial-temporal domains.
arXiv Detail & Related papers (2021-08-09T06:11:14Z)
Learning Comprehensive Motion Representation for Action Recognition [124.65403098534266]
2D CNN-based methods are efficient but may yield redundant features due to applying the same 2D convolution kernel to each frame. Recent efforts attempt to capture motion information by establishing inter-frame connections while still suffering the limited temporal receptive field or high latency. We propose a Channel-wise Motion Enhancement (CME) module to adaptively emphasize the channels related to dynamic information with a channel-wise gate vector. We also propose a Spatial-wise Motion Enhancement (SME) module to focus on the regions with the critical target in motion, according to the point-to-point similarity between adjacent feature maps.
arXiv Detail & Related papers (2021-03-23T03:06:26Z)
Two-Stream AMTnet for Action Detection [12.581710073789848]
We propose a new deep neural network architecture for online action detection, termed ream to the original appearance one in AMTnet. Two-Stream AMTnet exhibits superior action detection performance over state-of-the-art approaches on the standard action detection benchmarks.
arXiv Detail & Related papers (2020-04-03T12:16:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.