Enhancing Interpretability and Interactivity in Robot Manipulation: A
Neurosymbolic Approach
- URL: http://arxiv.org/abs/2210.00858v3
- Date: Sun, 7 May 2023 17:06:49 GMT
- Title: Enhancing Interpretability and Interactivity in Robot Manipulation: A
Neurosymbolic Approach
- Authors: Georgios Tziafas, Hamidreza Kasaei
- Abstract summary: We present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation.
A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA) or a grasp action instruction.
We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper we present a neurosymbolic architecture for coupling
language-guided visual reasoning with robot manipulation. A non-expert human
user can prompt the robot using unconstrained natural language, providing a
referring expression (REF), a question (VQA), or a grasp action instruction.
The system tackles all cases in a task-agnostic fashion through the utilization
of a shared library of primitive skills. Each primitive handles an independent
sub-task, such as reasoning about visual attributes, spatial relation
comprehension, logic and enumeration, as well as arm control. A language parser
maps the input query to an executable program composed of such primitives,
depending on the context. While some primitives are purely symbolic operations
(e.g. counting), others are trainable neural functions (e.g. visual grounding),
therefore marrying the interpretability and systematic generalization benefits
of discrete symbolic approaches with the scalability and representational power
of deep networks. We generate a 3D vision-and-language synthetic dataset of
tabletop scenes in a simulation environment to train our approach and perform
extensive evaluations in both synthetic and real-world scenes. Results showcase
the benefits of our approach in terms of accuracy, sample-efficiency, and
robustness to the user's vocabulary, while being transferable to real-world
scenes with few-shot visual fine-tuning. Finally, we integrate our method with
a robot framework and demonstrate how it can serve as an interpretable solution
for an interactive object-picking task, both in simulation and with a real
robot. We make our datasets available in
https://gtziafas.github.io/neurosymbolic-manipulation.
Related papers
- Context-Aware Command Understanding for Tabletop Scenarios [1.7082212774297747]
This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios.
By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot.
We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation.
arXiv Detail & Related papers (2024-10-08T20:46:39Z) - Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models [53.22792173053473]
We introduce an interactive robotic manipulation framework called Polaris.
Polaris integrates perception and interaction by utilizing GPT-4 alongside grounded vision models.
We propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline.
arXiv Detail & Related papers (2024-08-15T06:40:38Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - Language-Driven Representation Learning for Robotics [115.93273609767145]
Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks.
We introduce a framework for language-driven representation learning from human videos and captions.
We find that Voltron's language-driven learning outperform the prior-of-the-art, especially on targeted problems requiring higher-level control.
arXiv Detail & Related papers (2023-02-24T17:29:31Z) - Emergence of Machine Language: Towards Symbolic Intelligence with Neural
Networks [73.94290462239061]
We propose to combine symbolism and connectionism principles by using neural networks to derive a discrete representation.
By designing an interactive environment and task, we demonstrated that machines could generate a spontaneous, flexible, and semantic language.
arXiv Detail & Related papers (2022-01-14T14:54:58Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z) - VSGM -- Enhance robot task understanding ability through visual semantic
graph [0.0]
We consider that giving robots an understanding of visual semantics and language semantics will improve inference ability.
In this paper, we propose a novel method-VSGM, which uses the semantic graph to obtain better visual image features.
arXiv Detail & Related papers (2021-05-19T07:22:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.