Related papers: Situated Instruction Following

Situated Instruction Following

URL: http://arxiv.org/abs/2407.12061v1
Date: Mon, 15 Jul 2024 19:32:30 GMT
Title: Situated Instruction Following
Authors: So Yeon Min, Xavi Puig, Devendra Singh Chaplot, Tsung-Yen Yang, Akshara Rai, Priyam Parashar, Ruslan Salakhutdinov, Yonatan Bisk, Roozbeh Mottaghi,
Abstract summary: We propose situated instruction following, which embraces the inherent underspecification and ambiguity of real-world communication. The meaning of situated instructions naturally unfold through the past actions and the expected future behaviors of the human involved. Our experiments indicate that state-of-the-art Embodied Instruction Following (EIF) models lack holistic understanding of situated human intention.
Score: 87.37244711380411
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language is never spoken in a vacuum. It is expressed, comprehended, and contextualized within the holistic backdrop of the speaker's history, actions, and environment. Since humans are used to communicating efficiently with situated language, the practicality of robotic assistants hinge on their ability to understand and act upon implicit and situated instructions. In traditional instruction following paradigms, the agent acts alone in an empty house, leading to language use that is both simplified and artificially "complete." In contrast, we propose situated instruction following, which embraces the inherent underspecification and ambiguity of real-world communication with the physical presence of a human speaker. The meaning of situated instructions naturally unfold through the past actions and the expected future behaviors of the human involved. Specifically, within our settings we have instructions that (1) are ambiguously specified, (2) have temporally evolving intent, (3) can be interpreted more precisely with the agent's dynamic actions. Our experiments indicate that state-of-the-art Embodied Instruction Following (EIF) models lack holistic understanding of situated human intention.

Related papers

Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary [59.98573566227095]
We introduce Humanoid-LLA, a Large Language Action Model that maps expressive language commands to physically executable whole-body actions for humanoid robots.<n>Our approach integrates three core components: a unified motion vocabulary that aligns human and humanoid motion primitives into a shared discrete space; a vocabulary-directed controller distilled from a privileged policy to ensure physical feasibility; and a physics-informed fine-tuning stage using reinforcement learning with dynamics-aware rewards to enhance robustness and stability.
arXiv Detail & Related papers (2025-11-28T08:11:24Z)
Infer Human's Intentions Before Following Natural Language Instructions [24.197496779892383]
We propose a new framework, Follow Instructions with Social and Embodied Reasoning (FISER), aiming for better natural language instruction following in collaborative tasks. Our framework makes explicit inferences about human goals and intentions as intermediate reasoning steps. We empirically demonstrate that using social reasoning to explicitly infer human intentions before making action plans surpasses purely end-to-end approaches.
arXiv Detail & Related papers (2024-09-26T17:19:49Z)
SIFToM: Robust Spoken Instruction Following through Theory of Mind [51.326266354164716]
We present a cognitively inspired model, Speech Instruction Following through Theory of Mind (SIFToM), to enable robots to pragmatically follow human instructions under diverse speech conditions. Results show that the SIFToM model outperforms state-of-the-art speech and language models, approaching human-level accuracy on challenging speech instruction following tasks.
arXiv Detail & Related papers (2024-09-17T02:36:10Z)
ThinkBot: Embodied Instruction Following with Thought Chain Reasoning [66.09880459084901]
Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments. We propose ThinkBot that reasons the thought chain in human instruction to recover the missing action descriptions. Our ThinkBot outperforms the state-of-the-art EIF methods by a sizable margin in both success rate and execution efficiency.
arXiv Detail & Related papers (2023-12-12T08:30:09Z)
Real-time Addressee Estimation: Deployment of a Deep-Learning Model on the iCub Robot [52.277579221741746]
Addressee Estimation is a skill essential for social robots to interact smoothly with humans. Inspired by human perceptual skills, a deep-learning model for Addressee Estimation is designed, trained, and deployed on an iCub robot. The study presents the procedure of such implementation and the performance of the model deployed in real-time human-robot interaction.
arXiv Detail & Related papers (2023-11-09T13:01:21Z)
"No, to the Right" -- Online Language Corrections for Robotic Manipulation via Shared Autonomy [70.45420918526926]
We present LILAC, a framework for incorporating and adapting to natural language corrections online during execution. Instead of discrete turn-taking between a human and robot, LILAC splits agency between the human and robot. We show that our corrections-aware approach obtains higher task completion rates, and is subjectively preferred by users.
arXiv Detail & Related papers (2023-01-06T15:03:27Z)
Overcoming Referential Ambiguity in Language-Guided Goal-Conditioned Reinforcement Learning [8.715518445626826]
The learner can misunderstand the teacher's intentions if the instruction ambiguously refer to features of the object. We study how two concepts derived from cognitive sciences can help resolve those referential ambiguities. We apply those ideas to a teacher/learner setup with two artificial agents on a simulated robotic task.
arXiv Detail & Related papers (2022-09-26T15:07:59Z)
How to talk so your robot will learn: Instructions, descriptions, and pragmatics [14.289220844201695]
We study how a human might communicate preferences over behaviors. We show that in traditional reinforcement learning settings, pragmatic social learning can integrate with and accelerate individual learning. Our findings suggest that social learning from a wider range of language is a promising approach for value alignment and reinforcement learning more broadly.
arXiv Detail & Related papers (2022-06-16T01:33:38Z)
Scene-Intuitive Agent for Remote Embodied Visual Grounding [89.73786309180139]
Humans learn from life events to form intuitions towards the understanding of visual environments and languages. We present an agent that mimics such human behaviors.
arXiv Detail & Related papers (2021-03-24T02:37:48Z)
Language-Conditioned Imitation Learning for Robot Manipulation Tasks [39.40937105264774]
We introduce a method for incorporating unstructured natural language into imitation learning. At training time, the expert can provide demonstrations along with verbal descriptions in order to describe the underlying intent. The training process then interrelates these two modalities to encode the correlations between language, perception, and motion. The resulting language-conditioned visuomotor policies can be conditioned at runtime on new human commands and instructions.
arXiv Detail & Related papers (2020-10-22T21:49:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.