Ultra-Range Gesture Recognition using a Web-Camera in Human-Robot Interaction
- URL: http://arxiv.org/abs/2311.15361v2
- Date: Wed, 10 Apr 2024 06:46:08 GMT
- Title: Ultra-Range Gesture Recognition using a Web-Camera in Human-Robot Interaction
- Authors: Eran Bamani, Eden Nissinman, Inbar Meir, Lisa Koenigsberg, Avishai Sintov,
- Abstract summary: Vision-based methods for gesture recognition have been shown to be effective only up to a user-camera distance of seven meters.
We propose a novel URGR termed Graph Vision Transformer (GViT) which takes the enhanced image as input.
Evaluation of the proposed framework over diverse test data yields a high recognition rate of 98.1%.
- Score: 2.240453048130742
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hand gestures play a significant role in human interactions where non-verbal intentions, thoughts and commands are conveyed. In Human-Robot Interaction (HRI), hand gestures offer a similar and efficient medium for conveying clear and rapid directives to a robotic agent. However, state-of-the-art vision-based methods for gesture recognition have been shown to be effective only up to a user-camera distance of seven meters. Such a short distance range limits practical HRI with, for example, service robots, search and rescue robots and drones. In this work, we address the Ultra-Range Gesture Recognition (URGR) problem by aiming for a recognition distance of up to 25 meters and in the context of HRI. We propose the URGR framework, a novel deep-learning, using solely a simple RGB camera. Gesture inference is based on a single image. First, a novel super-resolution model termed High-Quality Network (HQ-Net) uses a set of self-attention and convolutional layers to enhance the low-resolution image of the user. Then, we propose a novel URGR classifier termed Graph Vision Transformer (GViT) which takes the enhanced image as input. GViT combines the benefits of a Graph Convolutional Network (GCN) and a modified Vision Transformer (ViT). Evaluation of the proposed framework over diverse test data yields a high recognition rate of 98.1%. The framework has also exhibited superior performance compared to human recognition in ultra-range distances. With the framework, we analyze and demonstrate the performance of an autonomous quadruped robot directed by human gestures in complex ultra-range indoor and outdoor environments, acquiring 96% recognition rate on average.
Related papers
- Dynamic Gesture Recognition in Ultra-Range Distance for Effective Human-Robot Interaction [2.625826951636656]
This paper presents a novel approach for ultra-range gesture recognition, addressing Human-Robot Interaction (HRI) challenges over extended distances.
By leveraging human gestures in video data, we propose the Temporal-Spatiotemporal Fusion Network (TSFN) model that surpasses the limitations of current methods.
With applications in service robots, search and rescue operations, and drone-based interactions, our approach enhances HRI in expansive environments.
arXiv Detail & Related papers (2024-07-31T06:56:46Z) - A Diffusion-based Data Generator for Training Object Recognition Models in Ultra-Range Distance [2.240453048130742]
Training a model to recognize hardly visible objects located in ultra-range requires an exhaustive collection of labeled samples.
We propose the Diffusion in Ultra-Range (DUR) framework based on a Diffusion model to generate labeled images of distant objects in various scenes.
DUR is compared to other types of generative models showcasing superiority both in fidelity and in recognition success rate when training a URGR model.
arXiv Detail & Related papers (2024-04-15T14:55:43Z) - EventTransAct: A video transformer-based framework for Event-camera
based action recognition [52.537021302246664]
Event cameras offer new opportunities compared to standard action recognition in RGB videos.
In this study, we employ a computationally efficient model, namely the video transformer network (VTN), which initially acquires spatial embeddings per event-frame.
In order to better adopt the VTN for the sparse and fine-grained nature of event data, we design Event-Contrastive Loss ($mathcalL_EC$) and event-specific augmentations.
arXiv Detail & Related papers (2023-08-25T23:51:07Z) - Agile gesture recognition for capacitive sensing devices: adapting
on-the-job [55.40855017016652]
We demonstrate a hand gesture recognition system that uses signals from capacitive sensors embedded into the etee hand controller.
The controller generates real-time signals from each of the wearer five fingers.
We use a machine learning technique to analyse the time series signals and identify three features that can represent 5 fingers within 500 ms.
arXiv Detail & Related papers (2023-05-12T17:24:02Z) - Online Recognition of Incomplete Gesture Data to Interface Collaborative
Robots [0.0]
This paper introduces an HRI framework to classify large vocabularies of interwoven static gestures (SGs) and dynamic gestures (DGs) captured with wearable sensors.
The recognized gestures are used to teleoperate a robot in a collaborative process that consists of preparing a breakfast meal.
arXiv Detail & Related papers (2023-04-13T18:49:08Z) - Cross Vision-RF Gait Re-identification with Low-cost RGB-D Cameras and
mmWave Radars [15.662787088335618]
This work studies the problem of cross-modal human re-identification (ReID)
We propose the first-of-its-kind vision-RF system for cross-modal multi-person ReID at the same time.
Our proposed system is able to achieve 92.5% top-1 accuracy and 97.5% top-5 accuracy out of 56 volunteers.
arXiv Detail & Related papers (2022-07-16T10:34:25Z) - CNN-based Omnidirectional Object Detection for HermesBot Autonomous
Delivery Robot with Preliminary Frame Classification [53.56290185900837]
We propose an algorithm for optimizing a neural network for object detection using preliminary binary frame classification.
An autonomous mobile robot with 6 rolling-shutter cameras on the perimeter providing a 360-degree field of view was used as the experimental setup.
arXiv Detail & Related papers (2021-10-22T15:05:37Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z) - VGAI: End-to-End Learning of Vision-Based Decentralized Controllers for
Robot Swarms [237.25930757584047]
We propose to learn decentralized controllers based on solely raw visual inputs.
For the first time, that integrates the learning of two key components: communication and visual perception.
Our proposed learning framework combines a convolutional neural network (CNN) for each robot to extract messages from the visual inputs, and a graph neural network (GNN) over the entire swarm to transmit, receive and process these messages.
arXiv Detail & Related papers (2020-02-06T15:25:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.