Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling
- URL: http://arxiv.org/abs/2510.12385v1
- Date: Tue, 14 Oct 2025 11:03:30 GMT
- Title: Learning to Recognize Correctly Completed Procedure Steps in Egocentric Assembly Videos through Spatio-Temporal Modeling
- Authors: Tim J. Schoonbeek, Shao-Hsuan Hung, Dan Lehman, Hans Onvlee, Jacek Kustra, Peter H. N. de With, Fons van der Sommen,
- Abstract summary: We propose a dual-stream framework for Procedure step recognition (PSR)<n>STORM-PSR is evaluated on the MECCANO and IndustReal datasets.<n>It reduces the average delay between actual and predicted assembly step completions by 11.2% and 26.1%, respectively.
- Score: 11.720299001244976
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Procedure step recognition (PSR) aims to identify all correctly completed steps and their sequential order in videos of procedural tasks. The existing state-of-the-art models rely solely on detecting assembly object states in individual video frames. By neglecting temporal features, model robustness and accuracy are limited, especially when objects are partially occluded. To overcome these limitations, we propose Spatio-Temporal Occlusion-Resilient Modeling for Procedure Step Recognition (STORM-PSR), a dual-stream framework for PSR that leverages both spatial and temporal features. The assembly state detection stream operates effectively with unobstructed views of the object, while the spatio-temporal stream captures both spatial and temporal features to recognize step completions even under partial occlusion. This stream includes a spatial encoder, pre-trained using a novel weakly supervised approach to capture meaningful spatial representations, and a transformer-based temporal encoder that learns how these spatial features relate over time. STORM-PSR is evaluated on the MECCANO and IndustReal datasets, reducing the average delay between actual and predicted assembly step completions by 11.2% and 26.1%, respectively, compared to prior methods. We demonstrate that this reduction in delay is driven by the spatio-temporal stream, which does not rely on unobstructed views of the object to infer completed steps. The code for STORM-PSR, along with the newly annotated MECCANO labels, is made publicly available at https://timschoonbeek.github.io/stormpsr .
Related papers
- PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching [51.98089287914147]
textbfPick-and-textbflay textbfMemory (PM) construction module for dynamic bfStereo matching, dubbed as bftextPPMStereo.<n>Inspired by the two-stage decision-making process in humans, we propose a textbfPick-and-textbflay textbfMemory (PM) construction module for dynamic bfStereo matching, dubbed as bftextPPMStereo.
arXiv Detail & Related papers (2025-10-23T03:52:39Z) - UST-SSM: Unified Spatio-Temporal State Space Models for Point Cloud Video Modeling [53.199942923818206]
Point cloud videos capture 3D motion while reducing the effects of lighting and viewpoint variations, making them highly effective for recognizing subtle and continuous human actions.<n> Selective State Space Models (SSMs) have shown good performance in sequence modeling with linear complexity.<n>We propose the Unified Spatio-Temporal State Space Model (UST-SSM), which extends the latest advancements in SSMs to point cloud videos.
arXiv Detail & Related papers (2025-08-20T10:46:01Z) - TSdetector: Temporal-Spatial Self-correction Collaborative Learning for Colonoscopy Video Detection [19.00902297385955]
We propose a novel Temporal-Spatial self-correction detector (TSdetector), which integrates temporal-level consistency learning and spatial-level reliability learning to detect objects continuously.
The experimental results on three publicly available polyp video dataset show that TSdetector achieves the highest polyp detection rate and outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-09-30T06:19:29Z) - CMR-Agent: Learning a Cross-Modal Agent for Iterative Image-to-Point Cloud Registration [2.400446821380503]
Image-to-point cloud registration aims to determine the relative camera pose of an RGB image with respect to a point cloud.
Most learning-based methods establish 2D-3D point correspondences in feature space without any feedback mechanism for iterative optimization.
We propose to reformulate the registration procedure as an iterative Markov decision process, allowing for incremental adjustments to the camera pose.
arXiv Detail & Related papers (2024-08-05T11:40:59Z) - Revisiting the Encoding of Satellite Image Time Series [2.5874041837241304]
Image Time Series (SITS)temporal learning is complex due to hightemporal resolutions and irregular acquisition times.
We develop a novel perspective of SITS processing as a direct set prediction problem, inspired by the recent trend in adopting query-based transformer decoders.
We attain new state-of-the-art (SOTA) results on the Satellite PASTIS benchmark dataset.
arXiv Detail & Related papers (2023-05-03T12:44:20Z) - Implicit Temporal Modeling with Learnable Alignment for Video
Recognition [95.82093301212964]
We propose a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance.
ILA achieves a top-1 accuracy of 88.7% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H.
arXiv Detail & Related papers (2023-04-20T17:11:01Z) - Motion-inductive Self-supervised Object Discovery in Videos [99.35664705038728]
We propose a model for processing consecutive RGB frames, and infer the optical flow between any pair of frames using a layered representation.
We demonstrate superior performance over previous state-of-the-art methods on three public video segmentation datasets.
arXiv Detail & Related papers (2022-10-01T08:38:28Z) - Motion-aware Memory Network for Fast Video Salient Object Detection [15.967509480432266]
We design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD.
In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames.
In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches.
The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
arXiv Detail & Related papers (2022-08-01T15:56:19Z) - RCL: Recurrent Continuous Localization for Temporal Action Detection [33.05225035315206]
State-of-the-art methods mostly rely on a dense anchoring scheme, where anchors are sampled uniformly over the temporal domain with a discretized grid.
In this paper, we introduce Recurrent Continuous Localization (RCL), which learns a fully continuous anchoring representation.
RCL achieves 52.92% mAP@0.5 on THUMOS14 and 37.65% mAP on ActivtiyNet v1.3, outperforming all existing single-model detectors.
arXiv Detail & Related papers (2022-03-14T13:56:12Z) - ST-DETR: Spatio-Temporal Object Traces Attention Detection Transformer [2.4366811507669124]
We propose a Spatio-Temporal Transformer-based architecture for object detection from a sequence of temporal frames.
We employ the full attention mechanisms to take advantage of the features correlations over both dimensions.
Results show a significant 5% mAP improvement on the KITTI MOD dataset.
arXiv Detail & Related papers (2021-07-13T07:38:08Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - A Prospective Study on Sequence-Driven Temporal Sampling and Ego-Motion
Compensation for Action Recognition in the EPIC-Kitchens Dataset [68.8204255655161]
Action recognition is one of the top-challenging research fields in computer vision.
ego-motion recorded sequences have become of important relevance.
The proposed method aims to cope with it by estimating this ego-motion or camera motion.
arXiv Detail & Related papers (2020-08-26T14:44:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.