MoEmo Vision Transformer: Integrating Cross-Attention and Movement
Vectors in 3D Pose Estimation for HRI Emotion Detection
- URL: http://arxiv.org/abs/2310.09757v1
- Date: Sun, 15 Oct 2023 06:52:15 GMT
- Title: MoEmo Vision Transformer: Integrating Cross-Attention and Movement
Vectors in 3D Pose Estimation for HRI Emotion Detection
- Authors: David C. Jeong, Tianma Shen, Hongji Liu, Raghav Kapoor, Casey Nguyen,
Song Liu, Christopher A. Kitts
- Abstract summary: We introduce MoEmo (Motion to Emotion), a cross-attention vision transformer (ViT) for human emotion detection within robotics systems.
We implement a cross-attention fusion model to combine movement vectors and environment contexts into a joint representation to derive emotion estimation.
We train the MoEmo system to jointly analyze motion and context, yielding emotion detection that outperforms the current state-of-the-art.
- Score: 4.757210144179483
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Emotion detection presents challenges to intelligent human-robot interaction
(HRI). Foundational deep learning techniques used in emotion detection are
limited by information-constrained datasets or models that lack the necessary
complexity to learn interactions between input data elements, such as the the
variance of human emotions across different contexts. In the current effort, we
introduce 1) MoEmo (Motion to Emotion), a cross-attention vision transformer
(ViT) for human emotion detection within robotics systems based on 3D human
pose estimations across various contexts, and 2) a data set that offers
full-body videos of human movement and corresponding emotion labels based on
human gestures and environmental contexts. Compared to existing approaches, our
method effectively leverages the subtle connections between movement vectors of
gestures and environmental contexts through the use of cross-attention on the
extracted movement vectors of full-body human gestures/poses and feature maps
of environmental contexts. We implement a cross-attention fusion model to
combine movement vectors and environment contexts into a joint representation
to derive emotion estimation. Leveraging our Naturalistic Motion Database, we
train the MoEmo system to jointly analyze motion and context, yielding emotion
detection that outperforms the current state-of-the-art.
Related papers
- EMOTION: Expressive Motion Sequence Generation for Humanoid Robots with In-Context Learning [10.266351600604612]
This paper introduces a framework, called EMOTION, for generating expressive motion sequences in humanoid robots.
We conduct online user studies comparing the naturalness and understandability of the motions generated by EMOTION and its human-feedback version, EMOTION++.
arXiv Detail & Related papers (2024-10-30T17:22:45Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - Exploring Emotions in Multi-componential Space using Interactive VR Games [1.1510009152620668]
We operationalised a data-driven approach using interactive Virtual Reality (VR) games.
We used Machine Learning (ML) methods to identify the unique contributions of each component to emotion differentiation.
These findings also have implications for using VR environments in emotion research.
arXiv Detail & Related papers (2024-04-04T06:54:44Z) - Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation [43.04371187071256]
We present a novel method to generate vivid and emotional 3D co-speech gestures in 3D avatars.
We use the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches.
Our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts.
arXiv Detail & Related papers (2023-11-29T11:10:40Z) - Task-Oriented Human-Object Interactions Generation with Implicit Neural
Representations [61.659439423703155]
TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations.
Our method generates continuous motions that are parameterized only by the temporal coordinate.
This work takes a step further toward general human-scene interaction simulation.
arXiv Detail & Related papers (2023-03-23T09:31:56Z) - Multi-Cue Adaptive Emotion Recognition Network [4.570705738465714]
We propose a new deep learning approach for emotion recognition based on adaptive multi-cues.
We compare the proposed approach with the state-of-art approaches in the CAER-S dataset.
arXiv Detail & Related papers (2021-11-03T15:08:55Z) - SOLVER: Scene-Object Interrelated Visual Emotion Reasoning Network [83.27291945217424]
We propose a novel Scene-Object interreLated Visual Emotion Reasoning network (SOLVER) to predict emotions from images.
To mine the emotional relationships between distinct objects, we first build up an Emotion Graph based on semantic concepts and visual features.
We also design a Scene-Object Fusion Module to integrate scenes and objects, which exploits scene features to guide the fusion process of object features with the proposed scene-based attention mechanism.
arXiv Detail & Related papers (2021-10-24T02:41:41Z) - Scene-aware Generative Network for Human Motion Synthesis [125.21079898942347]
We propose a new framework, with the interaction between the scene and the human motion taken into account.
Considering the uncertainty of human motion, we formulate this task as a generative task.
We derive a GAN based learning approach, with discriminators to enforce the compatibility between the human motion and the contextual scene.
arXiv Detail & Related papers (2021-05-31T09:05:50Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - iGibson, a Simulation Environment for Interactive Tasks in Large
Realistic Scenes [54.04456391489063]
iGibson is a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes.
Our environment contains fifteen fully interactive home-sized scenes populated with rigid and articulated objects.
iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of simple human demonstrated behaviors.
arXiv Detail & Related papers (2020-12-05T02:14:17Z) - Affective Movement Generation using Laban Effort and Shape and Hidden
Markov Models [6.181642248900806]
This paper presents an approach for automatic affective movement generation that makes use of two movement abstractions: 1) Laban movement analysis (LMA), and 2) hidden Markov modeling.
The LMA provides a systematic tool for an abstract representation of the kinematic and expressive characteristics of movements.
An HMM abstraction of the identified movements is obtained and used with the desired motion path to generate a novel movement that conveys the target emotion.
The efficacy of the proposed approach in generating movements with recognizable target emotions is assessed using a validated automatic recognition model and a user study.
arXiv Detail & Related papers (2020-06-10T21:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.