NatSGD: A Dataset with Speech, Gestures, and Demonstrations for Robot
Learning in Natural Human-Robot Interaction
- URL: http://arxiv.org/abs/2403.02274v1
- Date: Mon, 4 Mar 2024 18:02:41 GMT
- Title: NatSGD: A Dataset with Speech, Gestures, and Demonstrations for Robot
Learning in Natural Human-Robot Interaction
- Authors: Snehesh Shrestha, Yantian Zha, Saketh Banagiri, Ge Gao, Yiannis
Aloimonos, Cornelia Fermuller
- Abstract summary: Speech-gesture HRI datasets often focus on elementary tasks, like object pointing and pushing.
We introduce NatSGD, a multimodal HRI dataset encompassing human commands through speech and gestures.
We demonstrate its effectiveness in training robots to understand tasks through multimodal human commands.
- Score: 19.65778558341053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in multimodal Human-Robot Interaction (HRI) datasets have
highlighted the fusion of speech and gesture, expanding robots' capabilities to
absorb explicit and implicit HRI insights. However, existing speech-gesture HRI
datasets often focus on elementary tasks, like object pointing and pushing,
revealing limitations in scaling to intricate domains and prioritizing human
command data over robot behavior records. To bridge these gaps, we introduce
NatSGD, a multimodal HRI dataset encompassing human commands through speech and
gestures that are natural, synchronized with robot behavior demonstrations.
NatSGD serves as a foundational resource at the intersection of machine
learning and HRI research, and we demonstrate its effectiveness in training
robots to understand tasks through multimodal human commands, emphasizing the
significance of jointly considering speech and gestures. We have released our
dataset, simulator, and code to facilitate future research in human-robot
interaction system learning; access these resources at
https://www.snehesh.com/natsgd/
Related papers
- Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task [2.8220015774219567]
Head movements are crucial for social human-human interaction.
In this work, we employed a generative AI pipeline to produce human-like head movements for a Nao humanoid robot.
The results show that the Nao robot successfully imitates human head movements in a natural manner while actively tracking the speakers during the conversation.
arXiv Detail & Related papers (2024-07-16T17:08:40Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Teaching Unknown Objects by Leveraging Human Gaze and Augmented Reality
in Human-Robot Interaction [3.1473798197405953]
This dissertation aims to teach a robot unknown objects in the context of Human-Robot Interaction (HRI)
The combination of eye tracking and Augmented Reality created a powerful synergy that empowered the human teacher to communicate with the robot.
The robot's object detection capabilities exhibited comparable performance to state-of-the-art object detectors trained on extensive datasets.
arXiv Detail & Related papers (2023-12-12T11:34:43Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Learning Multimodal Latent Dynamics for Human-Robot Interaction [19.803547418450236]
This article presents a method for learning well-coordinated Human-Robot Interaction (HRI) from Human-Human Interactions (HHI)
We devise a hybrid approach using Hidden Markov Models (HMMs) as the latent space priors for a Variational Autoencoder to model a joint distribution over the interacting agents.
We find that Users perceive our method as more human-like, timely, and accurate and rank our method with a higher degree of preference over other baselines.
arXiv Detail & Related papers (2023-11-27T23:56:59Z) - RH20T: A Comprehensive Robotic Dataset for Learning Diverse Skills in
One-Shot [56.130215236125224]
A key challenge in robotic manipulation in open domains is how to acquire diverse and generalizable skills for robots.
Recent research in one-shot imitation learning has shown promise in transferring trained policies to new tasks based on demonstrations.
This paper aims to unlock the potential for an agent to generalize to hundreds of real-world skills with multi-modal perception.
arXiv Detail & Related papers (2023-07-02T15:33:31Z) - MILD: Multimodal Interactive Latent Dynamics for Learning Human-Robot
Interaction [34.978017200500005]
We propose Multimodal Interactive Latent Dynamics (MILD) to address the problem of two-party physical Human-Robot Interactions (HRIs)
We learn the interaction dynamics from demonstrations, using Hidden Semi-Markov Models (HSMMs) to model the joint distribution of the interacting agents in the latent space of a Variational Autoencoder (VAE)
MILD generates more accurate trajectories for the controlled agent (robot) when conditioned on the observed agent's (human) trajectory.
arXiv Detail & Related papers (2022-10-22T11:25:11Z) - Few-Shot Visual Grounding for Natural Human-Robot Interaction [0.0]
We propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user.
At the core of our system, we employ a multi-modal deep neural network for visual grounding.
We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets.
arXiv Detail & Related papers (2021-03-17T15:24:02Z) - Where is my hand? Deep hand segmentation for visual self-recognition in
humanoid robots [129.46920552019247]
We propose the use of a Convolution Neural Network (CNN) to segment the robot hand from an image in an egocentric view.
We fine-tuned the Mask-RCNN network for the specific task of segmenting the hand of the humanoid robot Vizzy.
arXiv Detail & Related papers (2021-02-09T10:34:32Z) - Self-supervised reinforcement learning for speaker localisation with the
iCub humanoid robot [58.2026611111328]
Looking at a person's face is one of the mechanisms that humans rely on when it comes to filtering speech in noisy environments.
Having a robot that can look toward a speaker could benefit ASR performance in challenging environments.
We propose a self-supervised reinforcement learning-based framework inspired by the early development of humans.
arXiv Detail & Related papers (2020-11-12T18:02:15Z) - Task-relevant Representation Learning for Networked Robotic Perception [74.0215744125845]
This paper presents an algorithm to learn task-relevant representations of sensory data that are co-designed with a pre-trained robotic perception model's ultimate objective.
Our algorithm aggressively compresses robotic sensory data by up to 11x more than competing methods.
arXiv Detail & Related papers (2020-11-06T07:39:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.