Related papers: Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents

Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents

URL: http://arxiv.org/abs/2509.12507v1
Date: Mon, 15 Sep 2025 23:15:15 GMT
Title: Learning to Generate Pointing Gestures in Situated Embodied Conversational Agents
Authors: Anna Deichler, Siyang Wang, Simon Alexanderson, Jonas Beskow,
Abstract summary: We present a framework for generating pointing gestures in embodied agents by combining imitation and reinforcement learning.<n>We evaluate the approach against supervised learning and retrieval baselines in both objective metrics and a virtual reality referential game with human users.
Score: 19.868403110796105
License: http://creativecommons.org/licenses/by/4.0/
Abstract: One of the main goals of robotics and intelligent agent research is to enable natural communication with humans in physically situated settings. While recent work has focused on verbal modes such as language and speech, non-verbal communication is crucial for flexible interaction. We present a framework for generating pointing gestures in embodied agents by combining imitation and reinforcement learning. Using a small motion capture dataset, our method learns a motor control policy that produces physically valid, naturalistic gestures with high referential accuracy. We evaluate the approach against supervised learning and retrieval baselines in both objective metrics and a virtual reality referential game with human users. Results show that our system achieves higher naturalness and accuracy than state-of-the-art supervised models, highlighting the promise of imitation-RL for communicative gesture generation and its potential application to robots.

Related papers

An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction [0.0]
This work presents a novel HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic.<n>The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition.<n> Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy of 75%.
arXiv Detail & Related papers (2026-02-23T09:05:15Z)
Towards Context-Aware Human-like Pointing Gestures with RL Motion Imitation [19.868403110796105]
We present a motion capture dataset of human pointing gestures covering diverse styles, handedness, and spatial targets.<n>Using reinforcement learning with motion imitation, we train policies that reproduce human-like pointing while maximizing precision.
arXiv Detail & Related papers (2025-09-16T09:30:42Z)
Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z)
Towards Developmentally Plausible Rewards: Communicative Success as a Learning Signal for Interactive Language Models [49.22720751953838]
We propose a method for training language models in an interactive setting inspired by child language acquisition.<n>In our setting, a speaker attempts to communicate some information to a listener in a single-turn dialogue and receives a reward if communicative success is achieved.
arXiv Detail & Related papers (2025-05-09T11:48:36Z)
Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication [4.49451692966442]
This paper proposes an Inter-Diffusion Generation Model of Speakers and Listeners for Effective Communication.<n>For the first time, we integrate the full-body gestures of listeners into the generation framework.
arXiv Detail & Related papers (2025-05-08T07:00:58Z)
Signaling and Social Learning in Swarms of Robots [0.0]
This paper investigates the role of communication in improving coordination within robot swarms. We highlight the role communication can play in addressing the credit assignment problem.
arXiv Detail & Related papers (2024-11-18T14:42:15Z)
Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues. Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z)
Multimodal Fusion with LLMs for Engagement Prediction in Natural Conversation [70.52558242336988]
We focus on predicting engagement in dyadic interactions by scrutinizing verbal and non-verbal cues, aiming to detect signs of disinterest or confusion. In this work, we collect a dataset featuring 34 participants engaged in casual dyadic conversations, each providing self-reported engagement ratings at the end of each conversation. We introduce a novel fusion strategy using Large Language Models (LLMs) to integrate multiple behavior modalities into a multimodal transcript''
arXiv Detail & Related papers (2024-09-13T18:28:12Z)
Real-time Addressee Estimation: Deployment of a Deep-Learning Model on the iCub Robot [52.277579221741746]
Addressee Estimation is a skill essential for social robots to interact smoothly with humans. Inspired by human perceptual skills, a deep-learning model for Addressee Estimation is designed, trained, and deployed on an iCub robot. The study presents the procedure of such implementation and the performance of the model deployed in real-time human-robot interaction.
arXiv Detail & Related papers (2023-11-09T13:01:21Z)
Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents [5.244401764969407]
Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread. We propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances.
arXiv Detail & Related papers (2023-09-17T18:46:25Z)
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control. Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.