Robotic Detection of a Human-Comprehensible Gestural Language for
Underwater Multi-Human-Robot Collaboration
- URL: http://arxiv.org/abs/2207.05331v1
- Date: Tue, 12 Jul 2022 06:04:12 GMT
- Title: Robotic Detection of a Human-Comprehensible Gestural Language for
Underwater Multi-Human-Robot Collaboration
- Authors: Sadman Sakib Enan, Michael Fulton and Junaed Sattar
- Abstract summary: We present a motion-based robotic communication framework that enables non-verbal communication among autonomous underwater vehicles (AUVs) and human divers.
We design a gestural language for AUV-to-A communication which can be easily understood by divers observing the conversation.
To allow As to visually understand a gesture from another AUV, we propose a deep network (RRCommNet) which exploits a self-attention mechanism to learn to recognize each message by extracting discrimi-temporal features.
- Score: 16.823029377470363
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a motion-based robotic communication framework that
enables non-verbal communication among autonomous underwater vehicles (AUVs)
and human divers. We design a gestural language for AUV-to-AUV communication
which can be easily understood by divers observing the conversation unlike
typical radio frequency, light, or audio based AUV communication. To allow AUVs
to visually understand a gesture from another AUV, we propose a deep network
(RRCommNet) which exploits a self-attention mechanism to learn to recognize
each message by extracting maximally discriminative spatio-temporal features.
We train this network on diverse simulated and real-world data. Our
experimental evaluations, both in simulation and in closed-water robot trials,
demonstrate that the proposed RRCommNet architecture is able to decipher
gesture-based messages with an average accuracy of 88-94% on simulated data,
73-83% on real data (depending on the version of the model used). Further, by
performing a message transcription study with human participants, we also show
that the proposed language can be understood by humans, with an overall
transcription accuracy of 88%. Finally, we discuss the inference runtime of
RRCommNet on embedded GPU hardware, for real-time use on board AUVs in the
field.
Related papers
- Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - NatSGD: A Dataset with Speech, Gestures, and Demonstrations for Robot
Learning in Natural Human-Robot Interaction [19.65778558341053]
Speech-gesture HRI datasets often focus on elementary tasks, like object pointing and pushing.
We introduce NatSGD, a multimodal HRI dataset encompassing human commands through speech and gestures.
We demonstrate its effectiveness in training robots to understand tasks through multimodal human commands.
arXiv Detail & Related papers (2024-03-04T18:02:41Z) - MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in
3D World [55.878173953175356]
We propose MultiPLY, a multisensory embodied large language model.
We first collect Multisensory Universe, a large-scale multisensory interaction dataset comprising 500k data.
We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks.
arXiv Detail & Related papers (2024-01-16T18:59:45Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - UnLoc: A Universal Localization Method for Autonomous Vehicles using
LiDAR, Radar and/or Camera Input [51.150605800173366]
UnLoc is a novel unified neural modeling approach for localization with multi-sensor input in all weather conditions.
Our method is extensively evaluated on Oxford Radar RobotCar, ApolloSouthBay and Perth-WA datasets.
arXiv Detail & Related papers (2023-07-03T04:10:55Z) - Visual Detection of Diver Attentiveness for Underwater Human-Robot
Interaction [15.64806176508126]
We present a diver attention estimation framework for autonomous underwater vehicles (AUVs)
The core element of the framework is a deep neural network (called DATT-Net) which exploits the geometric relation among 10 facial keypoints of the divers to determine their head orientation.
Our experiments demonstrate that the proposed DATT-Net architecture can determine the attentiveness of human divers with promising accuracy.
arXiv Detail & Related papers (2022-09-28T22:08:41Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Few-Shot Visual Grounding for Natural Human-Robot Interaction [0.0]
We propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user.
At the core of our system, we employ a multi-modal deep neural network for visual grounding.
We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets.
arXiv Detail & Related papers (2021-03-17T15:24:02Z) - Decoding EEG Brain Activity for Multi-Modal Natural Language Processing [9.35961671939495]
We present the first large-scale study of systematically analyzing the potential of EEG brain activity data for improving natural language processing tasks.
We find that filtering the EEG signals into frequency bands is more beneficial than using the broadband signal.
For a range of word embedding types, EEG data improves binary and ternary sentiment classification and outperforms multiple baselines.
arXiv Detail & Related papers (2021-02-17T09:44:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.