Following Instructions by Imagining and Reaching Visual Goals
- URL: http://arxiv.org/abs/2001.09373v1
- Date: Sat, 25 Jan 2020 23:26:56 GMT
- Title: Following Instructions by Imagining and Reaching Visual Goals
- Authors: John Kanu, Eadom Dessalene, Xiaomin Lin, Cornelia Fermuller, Yiannis
Aloimonos
- Abstract summary: We present a novel framework for learning to perform temporally extended tasks using spatial reasoning.
Our framework operates on raw pixel images, assumes no prior linguistic or perceptual knowledge, and learns via intrinsic motivation.
We validate our method in two environments with a robot arm in a simulated interactive 3D environment.
- Score: 8.19944635961041
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While traditional methods for instruction-following typically assume prior
linguistic and perceptual knowledge, many recent works in reinforcement
learning (RL) have proposed learning policies end-to-end, typically by training
neural networks to map joint representations of observations and instructions
directly to actions. In this work, we present a novel framework for learning to
perform temporally extended tasks using spatial reasoning in the RL framework,
by sequentially imagining visual goals and choosing appropriate actions to
fulfill imagined goals. Our framework operates on raw pixel images, assumes no
prior linguistic or perceptual knowledge, and learns via intrinsic motivation
and a single extrinsic reward signal measuring task completion. We validate our
method in two environments with a robot arm in a simulated interactive 3D
environment. Our method outperforms two flat architectures with raw-pixel and
ground-truth states, and a hierarchical architecture with ground-truth states
on object arrangement tasks.
Related papers
- TIPS: Text-Image Pretraining with Spatial Awareness [13.38247732379754]
Self-supervised image-only pretraining is still the go-to method for many vision applications.
We propose a novel general-purpose image-text model, which can be effectively used off-the-shelf for dense and global vision tasks.
arXiv Detail & Related papers (2024-10-21T21:05:04Z) - DecisionNCE: Embodied Multimodal Representations via Implicit Preference Learning [27.705230758809094]
Multimodal pretraining is an effective strategy for the trinity of goals of representation learning in autonomous robots.
We propose a universal unified objective that can simultaneously extract meaningful task progression information from image sequences.
DecisionNCE provides an embodied representation learning framework that elegantly extracts both local and global task progression features.
arXiv Detail & Related papers (2024-02-28T07:58:24Z) - Graphical Object-Centric Actor-Critic [55.2480439325792]
We propose a novel object-centric reinforcement learning algorithm combining actor-critic and model-based approaches.
We use a transformer encoder to extract object representations and graph neural networks to approximate the dynamics of an environment.
Our algorithm performs better in a visually complex 3D robotic environment and a 2D environment with compositional structure than the state-of-the-art model-free actor-critic algorithm.
arXiv Detail & Related papers (2023-10-26T06:05:12Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - InstructDiffusion: A Generalist Modeling Interface for Vision Tasks [52.981128371910266]
We present InstructDiffusion, a framework for aligning computer vision tasks with human instructions.
InstructDiffusion could handle a variety of vision tasks, including understanding tasks and generative tasks.
It even exhibits the ability to handle unseen tasks and outperforms prior methods on novel datasets.
arXiv Detail & Related papers (2023-09-07T17:56:57Z) - Semi Supervised Meta Learning for Spatiotemporal Learning [0.0]
We seek to understand the impact of applying meta-learning to existing representation learning architectures.
We utilize Memory Augmented Neural Network (MANN) architecture to apply meta-learning to our framework.
arXiv Detail & Related papers (2023-07-09T04:09:58Z) - Pretraining on Interactions for Learning Grounded Affordance
Representations [22.290431852705662]
We train a neural network to predict objects' trajectories in a simulated interaction.
We show that our network's latent representations differentiate between both observed and unobserved affordances.
Our results suggest a way in which modern deep learning approaches to grounded language learning can be integrated with traditional formal semantic notions of lexical representations.
arXiv Detail & Related papers (2022-07-05T19:19:53Z) - Silver-Bullet-3D at ManiSkill 2021: Learning-from-Demonstrations and
Heuristic Rule-based Methods for Object Manipulation [118.27432851053335]
This paper presents an overview and comparative analysis of our systems designed for the following two tracks in SAPIEN ManiSkill Challenge 2021: No Interaction Track.
The No Interaction track targets for learning policies from pre-collected demonstration trajectories.
In this track, we design a Heuristic Rule-based Method (HRM) to trigger high-quality object manipulation by decomposing the task into a series of sub-tasks.
For each sub-task, the simple rule-based controlling strategies are adopted to predict actions that can be applied to robotic arms.
arXiv Detail & Related papers (2022-06-13T16:20:42Z) - Context-Aware Sequence Alignment using 4D Skeletal Augmentation [67.05537307224525]
Temporal alignment of fine-grained human actions in videos is important for numerous applications in computer vision, robotics, and mixed reality.
We propose a novel context-aware self-supervised learning architecture to align sequences of actions.
Specifically, CASA employs self-attention and cross-attention mechanisms to incorporate the spatial and temporal context of human actions.
arXiv Detail & Related papers (2022-04-26T10:59:29Z) - CLIPort: What and Where Pathways for Robotic Manipulation [35.505615833638124]
We present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding with the spatial precision of Transporter.
Our framework is capable of solving a variety of language-specified tabletop tasks without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures.
arXiv Detail & Related papers (2021-09-24T17:44:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.