Related papers: Communicative Learning with Natural Gestures for Embodied Navigation Agents with Human-in-the-Scene

Communicative Learning with Natural Gestures for Embodied Navigation Agents with Human-in-the-Scene

URL: http://arxiv.org/abs/2108.02846v1
Date: Thu, 5 Aug 2021 20:56:47 GMT
Title: Communicative Learning with Natural Gestures for Embodied Navigation Agents with Human-in-the-Scene
Authors: Qi Wu, Cheng-Ju Wu, Yixin Zhu, Jungseock Joo
Abstract summary: We develop a VR-based 3D simulation environment, named Ges-THOR, based on AI2-THOR platform. In this virtual environment, a human player is placed in the same virtual scene and shepherds the artificial agent using only gestures. We argue that learning the semantics of natural gestures is mutually beneficial to learning the navigation task--learn to communicate and communicate to learn.
Score: 34.1812210095966
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human-robot collaboration is an essential research topic in artificial intelligence (AI), enabling researchers to devise cognitive AI systems and affords an intuitive means for users to interact with the robot. Of note, communication plays a central role. To date, prior studies in embodied agent navigation have only demonstrated that human languages facilitate communication by instructions in natural languages. Nevertheless, a plethora of other forms of communication is left unexplored. In fact, human communication originated in gestures and oftentimes is delivered through multimodal cues, e.g. "go there" with a pointing gesture. To bridge the gap and fill in the missing dimension of communication in embodied agent navigation, we propose investigating the effects of using gestures as the communicative interface instead of verbal cues. Specifically, we develop a VR-based 3D simulation environment, named Ges-THOR, based on AI2-THOR platform. In this virtual environment, a human player is placed in the same virtual scene and shepherds the artificial agent using only gestures. The agent is tasked to solve the navigation problem guided by natural gestures with unknown semantics; we do not use any predefined gestures due to the diversity and versatile nature of human gestures. We argue that learning the semantics of natural gestures is mutually beneficial to learning the navigation task--learn to communicate and communicate to learn. In a series of experiments, we demonstrate that human gesture cues, even without predefined semantics, improve the object-goal navigation for an embodied agent, outperforming various state-of-the-art methods.

Related papers

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [113.25650486482762]
We introduce the Seamless Interaction dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage.<n>This dataset enables the development of AI technologies that understand dyadic embodied dynamics.<n>We develop a suite of models that utilize the dataset to generate dyadic motion gestures and facial expressions aligned with human speech.
arXiv Detail & Related papers (2025-06-27T18:09:49Z)
Air-Ground Collaboration for Language-Specified Missions in Unknown Environments [62.56917065429864]
We present a first-of-its-kind system where an unmanned aerial vehicle (UAV) and an unmanned ground vehicle (UGV) are able to collaboratively accomplish missions specified in natural language.<n>We leverage a Large Language Model (LLM)-enabled planner to reason over semantic-metric maps that are built online and opportunistically shared between an aerial and a ground robot.
arXiv Detail & Related papers (2025-05-14T03:33:46Z)
Large Language Models for Virtual Human Gesture Selection [0.3749861135832072]
Co-speech gestures convey a wide variety of meanings and play an important role in face-to-face human interactions. We use the semantic capabilities of Large Language Models to develop a gesture selection approach that suggests meaningful, appropriate co-speech gestures.
arXiv Detail & Related papers (2025-03-18T16:49:56Z)
doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation [0.0]
doScenes is a novel dataset designed to facilitate research on human-vehicle instruction interactions. DoScenes bridges the gap between instruction and driving response, enabling context-aware and adaptive planning.
arXiv Detail & Related papers (2024-12-08T11:16:47Z)
CANVAS: Commonsense-Aware Navigation System for Intuitive Human-Robot Interaction [19.997935470257794]
We present CANVAS, a framework that combines visual and linguistic instructions for commonsense-aware navigation. Its success is driven by imitation learning, enabling the robot to learn from human navigation behavior. Our experiments show that CANVAS outperforms the strong rule-based system ROS NavStack across all environments.
arXiv Detail & Related papers (2024-10-02T06:34:45Z)
SIFToM: Robust Spoken Instruction Following through Theory of Mind [51.326266354164716]
We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions. Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks.
arXiv Detail & Related papers (2024-09-17T02:36:10Z)
Bridging the Communication Gap: Artificial Agents Learning Sign Language through Imitation [6.1400257928108575]
This research explores acquiring non-verbal communication skills through learning from demonstrations. In particular, we focus on imitation learning for artificial agents, exemplified by teaching a simulated humanoid American Sign Language. We use computer vision and deep learning to extract information from videos, and reinforcement learning to enable the agent to replicate observed actions.
arXiv Detail & Related papers (2024-06-14T13:50:29Z)
CoNav: A Benchmark for Human-Centered Collaborative Navigation [66.6268966718022]
We propose a collaborative navigation (CoNav) benchmark. Our CoNav tackles the critical challenge of constructing a 3D navigation environment with realistic and diverse human activities. We propose an intention-aware agent for reasoning both long-term and short-term human intention.
arXiv Detail & Related papers (2024-06-04T15:44:25Z)
Maia: A Real-time Non-Verbal Chat for Human-AI Interaction [10.580858171606167]
We propose an alternative to text-based Human-AI interaction. By leveraging nonverbal visual communication, through facial expressions, head and body movements, we aim to enhance engagement. Our approach is not art-specific and can be adapted to various paintings, animations, and avatars.
arXiv Detail & Related papers (2024-02-09T13:07:22Z)
GestureGPT: Toward Zero-Shot Free-Form Hand Gesture Understanding with Large Language Model Agents [35.48323584634582]
We introduce GestureGPT, a free-form hand gesture understanding framework that mimics human gesture understanding procedures. Our framework leverages multiple Large Language Model agents to manage and synthesize gesture and context information. We validated our framework offline under two real-world scenarios: smart home control and online video streaming.
arXiv Detail & Related papers (2023-10-19T15:17:34Z)
HandMeThat: Human-Robot Communication in Physical and Social Environments [73.91355172754717]
HandMeThat is a benchmark for a holistic evaluation of instruction understanding and following in physical and social environments. HandMeThat contains 10,000 episodes of human-robot interactions. We show that both offline and online reinforcement learning algorithms perform poorly on HandMeThat.
arXiv Detail & Related papers (2023-10-05T16:14:46Z)
Towards More Human-like AI Communication: A Review of Emergent Communication Research [0.0]
Emergent communication (Emecom) is a field of research aiming to develop artificial agents capable of using natural language. In this review, we delineate all the common proprieties we find across the literature and how they relate to human interactions. We identify two subcategories and highlight their characteristics and open challenges.
arXiv Detail & Related papers (2023-08-01T14:43:10Z)
Gesture2Path: Imitation Learning for Gesture-aware Navigation [54.570943577423094]
We present Gesture2Path, a novel social navigation approach that combines image-based imitation learning with model-predictive control. We deploy our method on real robots and showcase the effectiveness of our approach for the four gestures-navigation scenarios.
arXiv Detail & Related papers (2022-09-19T23:05:36Z)
Socially Compliant Navigation Dataset (SCAND): A Large-Scale Dataset of Demonstrations for Social Navigation [92.66286342108934]
Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a'socially compliant' manner in the presence of other intelligent agents such as humans. Our dataset contains 8.7 hours, 138 trajectories, 25 miles of socially compliant, human teleoperated driving demonstrations.
arXiv Detail & Related papers (2022-03-28T19:09:11Z)
Visual Navigation Among Humans with Optimal Control as a Supervisor [72.5188978268463]
We propose an approach that combines learning-based perception with model-based optimal control to navigate among humans. Our approach is enabled by our novel data-generation tool, HumANav. We demonstrate that the learned navigation policies can anticipate and react to humans without explicitly predicting future human motion.
arXiv Detail & Related papers (2020-03-20T16:13:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.