VSGM -- Enhance robot task understanding ability through visual semantic
graph
- URL: http://arxiv.org/abs/2105.08959v1
- Date: Wed, 19 May 2021 07:22:31 GMT
- Title: VSGM -- Enhance robot task understanding ability through visual semantic
graph
- Authors: Cheng Yu Tsai and Mu-Chun Su
- Abstract summary: We consider that giving robots an understanding of visual semantics and language semantics will improve inference ability.
In this paper, we propose a novel method-VSGM, which uses the semantic graph to obtain better visual image features.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, developing AI for robotics has raised much attention. The
interaction of vision and language of robots is particularly difficult. We
consider that giving robots an understanding of visual semantics and language
semantics will improve inference ability. In this paper, we propose a novel
method-VSGM (Visual Semantic Graph Memory), which uses the semantic graph to
obtain better visual image features, improve the robot's visual understanding
ability. By providing prior knowledge of the robot and detecting the objects in
the image, it predicts the correlation between the attributes of the object and
the objects and converts them into a graph-based representation; and mapping
the object in the image to be a top-down egocentric map. Finally, the important
object features of the current task are extracted by Graph Neural Networks. The
method proposed in this paper is verified in the ALFRED (Action Learning From
Realistic Environments and Directives) dataset. In this dataset, the robot
needs to perform daily indoor household tasks following the required language
instructions. After the model is added to the VSGM, the task success rate can
be improved by 6~10%.
Related papers
- Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - WALL-E: Embodied Robotic WAiter Load Lifting with Large Language Model [92.90127398282209]
This paper investigates the potential of integrating the most recent Large Language Models (LLMs) and existing visual grounding and robotic grasping system.
We introduce the WALL-E (Embodied Robotic WAiter load lifting with Large Language model) as an example of this integration.
We deploy this LLM-empowered system on the physical robot to provide a more user-friendly interface for the instruction-guided grasping task.
arXiv Detail & Related papers (2023-08-30T11:35:21Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - Language-Driven Representation Learning for Robotics [115.93273609767145]
Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks.
We introduce a framework for language-driven representation learning from human videos and captions.
We find that Voltron's language-driven learning outperform the prior-of-the-art, especially on targeted problems requiring higher-level control.
arXiv Detail & Related papers (2023-02-24T17:29:31Z) - Enhancing Interpretability and Interactivity in Robot Manipulation: A
Neurosymbolic Approach [0.0]
We present a neurosymbolic architecture for coupling language-guided visual reasoning with robot manipulation.
A non-expert human user can prompt the robot using unconstrained natural language, providing a referring expression (REF), a question (VQA) or a grasp action instruction.
We generate a 3D vision-and-language synthetic dataset of tabletop scenes in a simulation environment to train our approach and perform extensive evaluations in both synthetic and real-world scenes.
arXiv Detail & Related papers (2022-10-03T12:21:45Z) - Graph Neural Networks for Relational Inductive Bias in Vision-based Deep
Reinforcement Learning of Robot Control [0.0]
This work introduces a neural network architecture that combines relational inductive bias and visual feedback to learn an efficient position control policy.
We derive a graph representation that models the robot's internal state with a low-dimensional description of the visual scene generated by an image encoding network.
We show the ability of the model to improve sample efficiency for a 6-DoF robot arm in a visually realistic 3D environment.
arXiv Detail & Related papers (2022-03-11T15:11:54Z) - Reasoning with Scene Graphs for Robot Planning under Partial
Observability [7.121002367542985]
We develop an algorithm called scene analysis for robot planning (SARP) that enables robots to reason with visual contextual information.
Experiments have been conducted using multiple 3D environments in simulation, and a dataset collected by a real robot.
arXiv Detail & Related papers (2022-02-21T18:45:56Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.