Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion
- URL: http://arxiv.org/abs/2602.23694v2
- Date: Thu, 05 Mar 2026 08:41:00 GMT
- Title: Interpretable Multimodal Gesture Recognition for Drone and Mobile Robot Teleoperation via Log-Likelihood Ratio Fusion
- Authors: Seungyeol Baek, Jaspreet Singh, Lala Shakti Swarup Ray, Hymalai Bello, Paul Lukowicz, Sungho Suh,
- Abstract summary: Vision-based gesture recognition has been explored as one method for hands-free teleoperation.<n>We propose a multimodal gesture recognition framework that integrates inertial data from Apple Watches on both wrists with capacitive sensing signals from custom gloves.<n>We show that our framework achieves performance comparable to a state-of-the-art vision-based baseline.
- Score: 14.332919759770645
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human operators are still frequently exposed to hazardous environments such as disaster zones and industrial facilities, where intuitive and reliable teleoperation of mobile robots and Unmanned Aerial Vehicles (UAVs) is essential. In this context, hands-free teleoperation enhances operator mobility and situational awareness, thereby improving safety in hazardous environments. While vision-based gesture recognition has been explored as one method for hands-free teleoperation, its performance often deteriorates under occlusions, lighting variations, and cluttered backgrounds, limiting its applicability in real-world operations. To overcome these limitations, we propose a multimodal gesture recognition framework that integrates inertial data (accelerometer, gyroscope, and orientation) from Apple Watches on both wrists with capacitive sensing signals from custom gloves. We design a late fusion strategy based on the log-likelihood ratio (LLR), which not only enhances recognition performance but also provides interpretability by quantifying modality-specific contributions. To support this research, we introduce a new dataset of 20 distinct gestures inspired by aircraft marshalling signals, comprising synchronized RGB video, IMU, and capacitive sensor data. Experimental results demonstrate that our framework achieves performance comparable to a state-of-the-art vision-based baseline while significantly reducing computational cost, model size, and training time, making it well suited for real-time robot control. We therefore underscore the potential of sensor-based multimodal fusion as a robust and interpretable solution for gesture-driven mobile robot and drone teleoperation.
Related papers
- Real-Time Human-Robot Interaction Intent Detection Using RGB-based Pose and Emotion Cues with Cross-Camera Model Generalization [0.8839687029212673]
Service robots in public spaces require real-time understanding of human behavioral intentions for natural interaction.<n>We present a framework for frame-accurate human-robot interaction intent detection that fuses camera-invariant 2D skeletal pose and facial emotion features extracted from monocular RGB video.
arXiv Detail & Related papers (2025-12-18T08:44:22Z) - End-to-End Dexterous Arm-Hand VLA Policies via Shared Autonomy: VR Teleoperation Augmented by Autonomous Hand VLA Policy for Efficient Data Collection [10.217810309422232]
We propose a framework that divides control between macro and micro motions.<n>A human operator guides the robot's arm pose through intuitive VR teleoperation.<n>An autonomous DexGrasp-VLA policy handles fine-grained hand control using real-time tactile and visual feedback.
arXiv Detail & Related papers (2025-10-31T16:12:02Z) - OS-Sentinel: Towards Safety-Enhanced Mobile GUI Agents via Hybrid Validation in Realistic Workflows [77.95511352806261]
Computer-using agents powered by Vision-Language Models (VLMs) have demonstrated human-like capabilities in operating digital environments like mobile platforms.<n>We propose OS-Sentinel, a novel hybrid safety detection framework that combines a Formal Verifier for detecting explicit system-level violations with a Contextual Judge for assessing contextual risks and agent actions.
arXiv Detail & Related papers (2025-10-28T13:22:39Z) - A Vision-Based Shared-Control Teleoperation Scheme for Controlling the Robotic Arm of a Four-Legged Robot [0.9699673328328621]
This work proposes an intuitive remote control by leveraging a vision-based pose estimation pipeline.<n>The system maps these wrist movements into robotic arm commands to control the robot's arm in real-time.<n>A trajectory planner ensures safe teleoperation by detecting and preventing collisions with both obstacles and the robotic arm itself.
arXiv Detail & Related papers (2025-08-20T18:31:57Z) - Recognizing Actions from Robotic View for Natural Human-Robot Interaction [52.00935005918032]
Natural Human-Robot Interaction (N-HRI) requires robots to recognize human actions at varying distances and states, regardless of whether the robot itself is in motion or stationary.<n>Existing benchmarks for N-HRI fail to address the unique complexities in N-HRI due to limited data, modalities, task categories, and diversity of subjects and environments.<n>We introduce (Action from Robotic View) a large-scale dataset for perception-centric robotic views prevalent in mobile service robots.
arXiv Detail & Related papers (2025-07-30T09:48:34Z) - DiG-Net: Enhancing Quality of Life through Hyper-Range Dynamic Gesture Recognition in Assistive Robotics [2.625826951636656]
We introduce a novel approach designed specifically for assistive robotics, enabling dynamic gesture recognition at extended distances of up to 30 meters.<n>Our proposed Distance-aware Gesture Network (DiG-Net) effectively combines Depth-Conditioned Deformable Alignment (DADA) blocks with Spatio-Temporal Graph modules.<n>By effectively interpreting gestures from considerable distances, DiG-Net significantly enhances the usability of assistive robots in home healthcare, industrial safety, and remote assistance scenarios.
arXiv Detail & Related papers (2025-05-30T16:47:44Z) - Radar-Based Recognition of Static Hand Gestures in American Sign
Language [17.021656590925005]
This study explores the efficacy of synthetic data generated by an advanced radar ray-tracing simulator.
The simulator employs an intuitive material model that can be adjusted to introduce data diversity.
Despite exclusively training the NN on synthetic data, it demonstrates promising performance when put to the test with real measurement data.
arXiv Detail & Related papers (2024-02-20T08:19:30Z) - Agile gesture recognition for capacitive sensing devices: adapting
on-the-job [55.40855017016652]
We demonstrate a hand gesture recognition system that uses signals from capacitive sensors embedded into the etee hand controller.
The controller generates real-time signals from each of the wearer five fingers.
We use a machine learning technique to analyse the time series signals and identify three features that can represent 5 fingers within 500 ms.
arXiv Detail & Related papers (2023-05-12T17:24:02Z) - Bayesian Imitation Learning for End-to-End Mobile Manipulation [80.47771322489422]
Augmenting policies with additional sensor inputs, such as RGB + depth cameras, is a straightforward approach to improving robot perception capabilities.
We show that using the Variational Information Bottleneck to regularize convolutional neural networks improves generalization to held-out domains.
We demonstrate that our method is able to help close the sim-to-real gap and successfully fuse RGB and depth modalities.
arXiv Detail & Related papers (2022-02-15T17:38:30Z) - Driving-Signal Aware Full-Body Avatars [49.89791440532946]
We present a learning-based method for building driving-signal aware full-body avatars.
Our model is a conditional variational autoencoder that can be animated with incomplete driving signals.
We demonstrate the efficacy of our approach on the challenging problem of full-body animation for virtual telepresence.
arXiv Detail & Related papers (2021-05-21T16:22:38Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.