MILD: Multimodal Interactive Latent Dynamics for Learning Human-Robot
Interaction
- URL: http://arxiv.org/abs/2210.12418v1
- Date: Sat, 22 Oct 2022 11:25:11 GMT
- Title: MILD: Multimodal Interactive Latent Dynamics for Learning Human-Robot
Interaction
- Authors: Vignesh Prasad, Dorothea Koert, Ruth Stock-Homburg, Jan Peters,
Georgia Chalvatzaki
- Abstract summary: We propose Multimodal Interactive Latent Dynamics (MILD) to address the problem of two-party physical Human-Robot Interactions (HRIs)
We learn the interaction dynamics from demonstrations, using Hidden Semi-Markov Models (HSMMs) to model the joint distribution of the interacting agents in the latent space of a Variational Autoencoder (VAE)
MILD generates more accurate trajectories for the controlled agent (robot) when conditioned on the observed agent's (human) trajectory.
- Score: 34.978017200500005
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Modeling interaction dynamics to generate robot trajectories that enable a
robot to adapt and react to a human's actions and intentions is critical for
efficient and effective collaborative Human-Robot Interactions (HRI). Learning
from Demonstration (LfD) methods from Human-Human Interactions (HHI) have shown
promising results, especially when coupled with representation learning
techniques. However, such methods for learning HRI either do not scale well to
high dimensional data or cannot accurately adapt to changing via-poses of the
interacting partner. We propose Multimodal Interactive Latent Dynamics (MILD),
a method that couples deep representation learning and probabilistic machine
learning to address the problem of two-party physical HRIs. We learn the
interaction dynamics from demonstrations, using Hidden Semi-Markov Models
(HSMMs) to model the joint distribution of the interacting agents in the latent
space of a Variational Autoencoder (VAE). Our experimental evaluations for
learning HRI from HHI demonstrations show that MILD effectively captures the
multimodality in the latent representations of HRI tasks, allowing us to decode
the varying dynamics occurring in such tasks. Compared to related work, MILD
generates more accurate trajectories for the controlled agent (robot) when
conditioned on the observed agent's (human) trajectory. Notably, MILD can learn
directly from camera-based pose estimations to generate trajectories, which we
then map to a humanoid robot without the need for any additional training.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - MoVEInt: Mixture of Variational Experts for Learning Human-Robot Interactions from Demonstrations [19.184155232662995]
We propose a novel approach for learning a shared latent space representation for Human-Robot Interaction (HRI)
We train a Variational Autoencoder (VAE) to learn robot motions regularized using an informative latent space prior.
We find that our approach of using an informative MDN prior from human observations for a VAE generates more accurate robot motions.
arXiv Detail & Related papers (2024-07-10T13:16:12Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - MATRIX: Multi-Agent Trajectory Generation with Diverse Contexts [47.12378253630105]
We study trajectory-level data generation for multi-human or human-robot interaction scenarios.
We propose a learning-based automatic trajectory generation model, which we call Multi-Agent TRajectory generation with dIverse conteXts (MATRIX)
arXiv Detail & Related papers (2024-03-09T23:28:54Z) - NatSGD: A Dataset with Speech, Gestures, and Demonstrations for Robot
Learning in Natural Human-Robot Interaction [19.65778558341053]
Speech-gesture HRI datasets often focus on elementary tasks, like object pointing and pushing.
We introduce NatSGD, a multimodal HRI dataset encompassing human commands through speech and gestures.
We demonstrate its effectiveness in training robots to understand tasks through multimodal human commands.
arXiv Detail & Related papers (2024-03-04T18:02:41Z) - Learning Multimodal Latent Dynamics for Human-Robot Interaction [19.803547418450236]
This article presents a method for learning well-coordinated Human-Robot Interaction (HRI) from Human-Human Interactions (HHI)
We devise a hybrid approach using Hidden Markov Models (HMMs) as the latent space priors for a Variational Autoencoder to model a joint distribution over the interacting agents.
We find that Users perceive our method as more human-like, timely, and accurate and rank our method with a higher degree of preference over other baselines.
arXiv Detail & Related papers (2023-11-27T23:56:59Z) - InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint [67.6297384588837]
We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs.
We demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model.
arXiv Detail & Related papers (2023-11-27T14:32:33Z) - Expanding Frozen Vision-Language Models without Retraining: Towards
Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks.
In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space.
We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z) - Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration.
We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE.
We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z) - DIME: Fine-grained Interpretations of Multimodal Models via Disentangled
Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models.
Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z) - Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works.
However, learning a model that captures the dynamics of complex skills represents a major challenge.
We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.