Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind
- URL: http://arxiv.org/abs/2409.10849v2
- Date: Mon, 06 Oct 2025 16:05:39 GMT
- Title: Pragmatic Embodied Spoken Instruction Following in Human-Robot Collaboration with Theory of Mind
- Authors: Lance Ying, Xinyi Li, Shivam Aarya, Yizirui Fang, Yifan Yin, Jason Xinyu Liu, Stefanie Tellex, Joshua B. Tenenbaum, Tianmin Shu,
- Abstract summary: We present a cognitively inspired neurosymbolic model, Spoken Instruction Following through Theory of Mind (SIFToM)<n>SIFToM uses a Vision-Language Model with model-based mental inference to enable robots to pragmatically follow human instructions under diverse speech conditions.<n>Results show that SIFToM can significantly improve the performance of a lightweight base VLM (Gemini 2.5 Flash), outperforming state-of-the-art VLMs (Gemini 2.5 Pro) and approaching human-level accuracy on challenging spoken instruction following tasks.
- Score: 51.45478233267092
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spoken language instructions are ubiquitous in agent collaboration. However, in real-world human-robot collaboration, following human spoken instructions can be challenging due to various speaker and environmental factors, such as background noise or mispronunciation. When faced with noisy auditory inputs, humans can leverage the collaborative context in the embodied environment to interpret noisy spoken instructions and take pragmatic assistive actions. In this paper, we present a cognitively inspired neurosymbolic model, Spoken Instruction Following through Theory of Mind (SIFToM), which leverages a Vision-Language Model with model-based mental inference to enable robots to pragmatically follow human instructions under diverse speech conditions. We test SIFToM in both simulated environments (VirtualHome) and real-world human-robot collaborative settings with human evaluations. Results show that SIFToM can significantly improve the performance of a lightweight base VLM (Gemini 2.5 Flash), outperforming state-of-the-art VLMs (Gemini 2.5 Pro) and approaching human-level accuracy on challenging spoken instruction following tasks.
Related papers
- Inner Speech as Behavior Guides: Steerable Imitation of Diverse Behaviors for Human-AI coordination [21.33828337975933]
Imitation learning has emerged as one of the prominent approaches to build such agents.<n>We propose MIMIC, a framework that uses language as an internal representation of behavioral intent.<n>MIMIC enables fine-grained steering of behavior at inference time by conditioning the agent on behavior-specific speech.
arXiv Detail & Related papers (2026-02-24T03:37:42Z) - An Approach to Combining Video and Speech with Large Language Models in Human-Robot Interaction [0.0]
This work presents a novel HRI framework that combines advanced vision-language models, speech processing, and fuzzy logic.<n>The proposed system integrates Florence-2 for object detection, Llama 3.1 for natural language understanding, and Whisper for speech recognition.<n> Experimental evaluations conducted on consumer-grade hardware demonstrate a command execution accuracy of 75%.
arXiv Detail & Related papers (2026-02-23T09:05:15Z) - BoSS: Beyond-Semantic Speech [43.96461266560891]
Beyond-Semantic Speech (BoSS) refers to the set of information in speech communication that encompasses but transcends explicit semantics.<n>We present a formalized framework for BoSS, leveraging cognitive relevance theories and machine learning models to analyze temporal and contextual speech dynamics.<n>These findings highlight the need for advancing BoSS research to enable richer, more context-aware human-machine communication.
arXiv Detail & Related papers (2025-07-23T14:53:50Z) - Situated Instruction Following [87.37244711380411]
We propose situated instruction following, which embraces the inherent underspecification and ambiguity of real-world communication.
The meaning of situated instructions naturally unfold through the past actions and the expected future behaviors of the human involved.
Our experiments indicate that state-of-the-art Embodied Instruction Following (EIF) models lack holistic understanding of situated human intention.
arXiv Detail & Related papers (2024-07-15T19:32:30Z) - Human-Object Interaction from Human-Level Instructions [17.10279738828331]
We propose the first complete system for synthesizing human-object interactions for object manipulation in contextual environments.<n>We leverage large language models (LLMs) to interpret the input instructions into detailed execution plans.<n>Unlike prior work, our system is capable of generating detailed finger-object interactions, in seamless coordination with full-body movements.
arXiv Detail & Related papers (2024-06-25T17:46:28Z) - Real-time Addressee Estimation: Deployment of a Deep-Learning Model on
the iCub Robot [52.277579221741746]
Addressee Estimation is a skill essential for social robots to interact smoothly with humans.
Inspired by human perceptual skills, a deep-learning model for Addressee Estimation is designed, trained, and deployed on an iCub robot.
The study presents the procedure of such implementation and the performance of the model deployed in real-time human-robot interaction.
arXiv Detail & Related papers (2023-11-09T13:01:21Z) - Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents [5.244401764969407]
Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread.
We propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances.
arXiv Detail & Related papers (2023-09-17T18:46:25Z) - Speaking the Language of Your Listener: Audience-Aware Adaptation via
Plug-and-Play Theory of Mind [4.052000839878213]
We model a visually grounded referential game between a knowledgeable speaker and a listener with more limited visual and linguistic experience.
We endow our speaker with the ability to adapt its referring expressions via a simulation module that monitors the effectiveness of planned utterances from the listener's perspective.
arXiv Detail & Related papers (2023-05-31T15:17:28Z) - Learning Human-to-Robot Handovers from Point Clouds [63.18127198174958]
We propose the first framework to learn control policies for vision-based human-to-robot handovers.
We show significant performance gains over baselines on a simulation benchmark, sim-to-sim transfer and sim-to-real transfer.
arXiv Detail & Related papers (2023-03-30T17:58:36Z) - Computational Language Acquisition with Theory of Mind [84.2267302901888]
We build language-learning agents equipped with Theory of Mind (ToM) and measure its effects on the learning process.
We find that training speakers with a highly weighted ToM listener component leads to performance gains in our image referential game setting.
arXiv Detail & Related papers (2023-03-02T18:59:46Z) - Perception Point: Identifying Critical Learning Periods in Speech for
Bilingual Networks [58.24134321728942]
We compare and identify cognitive aspects on deep neural-based visual lip-reading models.
We observe a strong correlation between these theories in cognitive psychology and our unique modeling.
arXiv Detail & Related papers (2021-10-13T05:30:50Z) - Co-GAIL: Learning Diverse Strategies for Human-Robot Collaboration [51.268988527778276]
We present a method for learning a human-robot collaboration policy from human-human collaboration demonstrations.
Our method co-optimizes a human policy and a robot policy in an interactive learning process.
arXiv Detail & Related papers (2021-08-13T03:14:43Z) - Few-shot Language Coordination by Modeling Theory of Mind [95.54446989205117]
We study the task of few-shot $textitlanguage coordination$.
We require the lead agent to coordinate with a $textitpopulation$ of agents with different linguistic abilities.
This requires the ability to model the partner's beliefs, a vital component of human communication.
arXiv Detail & Related papers (2021-07-12T19:26:11Z) - Language-Conditioned Imitation Learning for Robot Manipulation Tasks [39.40937105264774]
We introduce a method for incorporating unstructured natural language into imitation learning.
At training time, the expert can provide demonstrations along with verbal descriptions in order to describe the underlying intent.
The training process then interrelates these two modalities to encode the correlations between language, perception, and motion.
The resulting language-conditioned visuomotor policies can be conditioned at runtime on new human commands and instructions.
arXiv Detail & Related papers (2020-10-22T21:49:08Z) - Joint Mind Modeling for Explanation Generation in Complex Human-Robot
Collaborative Tasks [83.37025218216888]
We propose a novel explainable AI (XAI) framework for achieving human-like communication in human-robot collaborations.
The robot builds a hierarchical mind model of the human user and generates explanations of its own mind as a form of communications.
Results show that the generated explanations of our approach significantly improves the collaboration performance and user perception of the robot.
arXiv Detail & Related papers (2020-07-24T23:35:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.