Talk-to-Resolve: Combining scene understanding and spatial dialogue to
resolve granular task ambiguity for a collocated robot
- URL: http://arxiv.org/abs/2111.11099v1
- Date: Mon, 22 Nov 2021 10:42:59 GMT
- Title: Talk-to-Resolve: Combining scene understanding and spatial dialogue to
resolve granular task ambiguity for a collocated robot
- Authors: Pradip Pramanick, Chayan Sarkar, Snehasis Banerjee, Brojeshwar
Bhowmick
- Abstract summary: The utility of collocating robots largely depends on the easy and intuitive interaction mechanism with the human.
We present a system called Talk-to-Resolve (TTR) that enables a robot to initiate a coherent dialogue exchange with the instructor.
Our system can identify the stalemate and resolve them with appropriate dialogue exchange with 82% accuracy.
- Score: 15.408128612723882
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The utility of collocating robots largely depends on the easy and intuitive
interaction mechanism with the human. If a robot accepts task instruction in
natural language, first, it has to understand the user's intention by decoding
the instruction. However, while executing the task, the robot may face
unforeseeable circumstances due to the variations in the observed scene and
therefore requires further user intervention. In this article, we present a
system called Talk-to-Resolve (TTR) that enables a robot to initiate a coherent
dialogue exchange with the instructor by observing the scene visually to
resolve the impasse. Through dialogue, it either finds a cue to move forward in
the original plan, an acceptable alternative to the original plan, or
affirmation to abort the task altogether. To realize the possible stalemate, we
utilize the dense captions of the observed scene and the given instruction
jointly to compute the robot's next action. We evaluate our system based on a
data set of initial instruction and situational scene pairs. Our system can
identify the stalemate and resolve them with appropriate dialogue exchange with
82% accuracy. Additionally, a user study reveals that the questions from our
systems are more natural (4.02 on average on a scale of 1 to 5) as compared to
a state-of-the-art (3.08 on average).
Related papers
- Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration [64.6107798750142]
Vocal Sandbox is a framework for enabling seamless human-robot collaboration in situated environments.
We design lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot's capabilities in real-time.
We evaluate Vocal Sandbox in two settings: collaborative gift bag assembly and LEGO stop-motion animation.
arXiv Detail & Related papers (2024-11-04T20:44:40Z) - Context-Aware Command Understanding for Tabletop Scenarios [1.7082212774297747]
This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios.
By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot.
We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation.
arXiv Detail & Related papers (2024-10-08T20:46:39Z) - SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning [17.125080112897102]
This paper addresses a challenging interactive task learning scenario where the robot is unaware of a concept that's key to solving the instructed task.
We propose SECURE, an interactive task learning framework designed to solve such problems by fixing a deficient domain model using embodied conversation.
Using SECURE, the robot not only learns from the user's corrective feedback when it makes a mistake, but it also learns to make strategic dialogue decisions for revealing useful evidence about novel concepts for solving the instructed task.
arXiv Detail & Related papers (2024-09-26T11:40:07Z) - Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption.
SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning.
We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z) - Real-time Addressee Estimation: Deployment of a Deep-Learning Model on
the iCub Robot [52.277579221741746]
Addressee Estimation is a skill essential for social robots to interact smoothly with humans.
Inspired by human perceptual skills, a deep-learning model for Addressee Estimation is designed, trained, and deployed on an iCub robot.
The study presents the procedure of such implementation and the performance of the model deployed in real-time human-robot interaction.
arXiv Detail & Related papers (2023-11-09T13:01:21Z) - Proactive Human-Robot Interaction using Visuo-Lingual Transformers [0.0]
Humans possess the innate ability to extract latent visuo-lingual cues to infer context through human interaction.
We propose a learning-based method that uses visual cues from the scene, lingual commands from a user and knowledge of prior object-object interaction to identify and proactively predict the underlying goal the user intends to achieve.
arXiv Detail & Related papers (2023-10-04T00:50:21Z) - "No, to the Right" -- Online Language Corrections for Robotic
Manipulation via Shared Autonomy [70.45420918526926]
We present LILAC, a framework for incorporating and adapting to natural language corrections online during execution.
Instead of discrete turn-taking between a human and robot, LILAC splits agency between the human and robot.
We show that our corrections-aware approach obtains higher task completion rates, and is subjectively preferred by users.
arXiv Detail & Related papers (2023-01-06T15:03:27Z) - Instruction-driven history-aware policies for robotic manipulations [82.25511767738224]
We propose a unified transformer-based approach that takes into account multiple inputs.
In particular, our transformer architecture integrates (i) natural language instructions and (ii) multi-view scene observations.
We evaluate our method on the challenging RLBench benchmark and on a real-world robot.
arXiv Detail & Related papers (2022-09-11T16:28:25Z) - Correcting Robot Plans with Natural Language Feedback [88.92824527743105]
We explore natural language as an expressive and flexible tool for robot correction.
We show that these transformations enable users to correct goals, update robot motions, and recover from planning errors.
Our method makes it possible to compose multiple constraints and generalizes to unseen scenes, objects, and sentences in simulated environments and real-world environments.
arXiv Detail & Related papers (2022-04-11T15:22:43Z) - Scene Editing as Teleoperation: A Case Study in 6DoF Kit Assembly [18.563562557565483]
We propose the framework "Scene Editing as Teleoperation" (SEaT)
Instead of controlling the robot, users focus on specifying the task's goal.
A user can perform teleoperation without any expert knowledge of the robot hardware.
arXiv Detail & Related papers (2021-10-09T04:22:21Z) - Composing Pick-and-Place Tasks By Grounding Language [41.075844857146805]
We present a robot system that follows unconstrained language instructions to pick and place arbitrary objects.
Our approach infers objects and their relationships from input images and language expressions.
Results obtained using a real-world PR2 robot demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2021-02-16T11:29:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.