Signals vs. Videos: Advancing Motion Intention Recognition for Human-Robot Collaboration in Construction
- URL: http://arxiv.org/abs/2509.07990v1
- Date: Mon, 25 Aug 2025 14:58:07 GMT
- Title: Signals vs. Videos: Advancing Motion Intention Recognition for Human-Robot Collaboration in Construction
- Authors: Charan Gajjala Chenchu, Kinam Kim, Gao Lu, Zia Ud Din,
- Abstract summary: The study leverages deep learning to assess two different modalities in recognizing workers' motion intention at the early stage of movement in drywall installation tasks.<n>The Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM) model utilizing surface electromyography (sEMG) data achieved an accuracy of around 87%.<n>The pre-trained Video Swin Transformer combined with transfer learning harnessed video sequences as input to recognize motion intention and attained an accuracy of 94%.
- Score: 1.108292291257035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human-robot collaboration (HRC) in the construction industry depends on precise and prompt recognition of human motion intentions and actions by robots to maximize safety and workflow efficiency. There is a research gap in comparing data modalities, specifically signals and videos, for motion intention recognition. To address this, the study leverages deep learning to assess two different modalities in recognizing workers' motion intention at the early stage of movement in drywall installation tasks. The Convolutional Neural Network - Long Short-Term Memory (CNN-LSTM) model utilizing surface electromyography (sEMG) data achieved an accuracy of around 87% with an average time of 0.04 seconds to perform prediction on a sample input. Meanwhile, the pre-trained Video Swin Transformer combined with transfer learning harnessed video sequences as input to recognize motion intention and attained an accuracy of 94% but with a longer average time of 0.15 seconds for a similar prediction. This study emphasizes the unique strengths and trade-offs of both data formats, directing their systematic deployments to enhance HRC in real-world construction projects.
Related papers
- Robotic VLA Benefits from Joint Learning with Motion Image Diffusion [114.60268819583017]
Vision-Language-Action (VLA) models have achieved remarkable progress in robotic manipulation by mapping multimodal observations and instructions directly to actions.<n>We propose joint learning with motion image diffusion, a novel strategy that enhances VLA models with motion reasoning capabilities.<n> Experiments in both simulation and real-world environments demonstrate that joint learning with motion image diffusion improves the success rate of pi-series VLAs to 97.5%.
arXiv Detail & Related papers (2025-12-19T19:07:53Z) - Physical Autoregressive Model for Robotic Manipulation without Action Pretraining [65.8971623698511]
We build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR)<n>PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining.<n>Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task.
arXiv Detail & Related papers (2025-08-13T13:54:51Z) - VidMan: Exploiting Implicit Dynamics from Video Diffusion Model for Effective Robot Manipulation [79.00294932026266]
VidMan is a novel framework that employs a two-stage training mechanism to enhance stability and improve data utilization efficiency.
Our framework outperforms state-of-the-art baseline model GR-1 on the CALVIN benchmark, achieving a 11.7% relative improvement, and demonstrates over 9% precision gains on the OXE small-scale dataset.
arXiv Detail & Related papers (2024-11-14T03:13:26Z) - Benchmarking Adaptive Intelligence and Computer Vision on Human-Robot Collaboration [0.0]
Human-Robot Collaboration (HRC) is vital in Industry 4.0, using sensors, digital twins, collaborative robots (cobots) and intention-recognition models to have efficient manufacturing processes.
We address concept drift by integrating Adaptive Intelligence and self-labeling to improve the resilience of intention-recognition in an HRC system.
arXiv Detail & Related papers (2024-09-30T01:25:48Z) - AdvMT: Adversarial Motion Transformer for Long-term Human Motion
Prediction [2.837740438355204]
We present the Adversarial Motion Transformer (AdvMT), a novel model that integrates a transformer-based motion encoder and a temporal continuity discriminator.
With adversarial training, our method effectively reduces the unwanted artifacts in predictions, thereby ensuring the learning of more realistic and fluid human motions.
arXiv Detail & Related papers (2024-01-10T09:15:50Z) - Learning Human Action Recognition Representations Without Real Humans [66.61527869763819]
We present a benchmark that leverages real-world videos with humans removed and synthetic data containing virtual humans to pre-train a model.
We then evaluate the transferability of the representation learned on this data to a diverse set of downstream action recognition benchmarks.
Our approach outperforms previous baselines by up to 5%.
arXiv Detail & Related papers (2023-11-10T18:38:14Z) - Robust Activity Recognition for Adaptive Worker-Robot Interaction using
Transfer Learning [0.0]
This paper proposes a transfer learning methodology for activity recognition of construction workers.
The developed algorithm transfers features from a model pre-trained by the original authors and fine-tunes them for the downstream task of activity recognition.
Results indicate that the fine-tuned model can recognize distinct MMH tasks in a robust and adaptive manner.
arXiv Detail & Related papers (2023-08-28T19:03:46Z) - Differentiable Frequency-based Disentanglement for Aerial Video Action
Recognition [56.91538445510214]
We present a learning algorithm for human activity recognition in videos.
Our approach is designed for UAV videos, which are mainly acquired from obliquely placed dynamic cameras.
We conduct extensive experiments on the UAV Human dataset and the NEC Drone dataset.
arXiv Detail & Related papers (2022-09-15T22:16:52Z) - Towards Domain-Independent and Real-Time Gesture Recognition Using
mmWave Signal [11.76969975145963]
DI-Gesture is a domain-independent and real-time mmWave gesture recognition system.
In real-time scenario, the accuracy of DI-Gesutre reaches over 97% with average inference time of 2.87ms.
arXiv Detail & Related papers (2021-11-11T13:28:28Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.