Related papers: Visual Affordance Prediction for Guiding Robot Exploration

Visual Affordance Prediction for Guiding Robot Exploration

URL: http://arxiv.org/abs/2305.17783v1
Date: Sun, 28 May 2023 17:53:09 GMT
Title: Visual Affordance Prediction for Guiding Robot Exploration
Authors: Homanga Bharadhwaj, Abhinav Gupta, Shubham Tulsiani
Abstract summary: We develop an approach for learning visual affordances for guiding robot exploration. We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE. We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
Score: 56.17795036091848
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Motivated by the intuitive understanding humans have about the space of possible interactions, and the ease with which they can generalize this understanding to previously unseen scenes, we develop an approach for learning visual affordances for guiding robot exploration. Given an input image of a scene, we infer a distribution over plausible future states that can be achieved via interactions with it. We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE and show that these models can be trained using large-scale and diverse passive data, and that the learned models exhibit compositional generalization to diverse objects beyond the training distribution. We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.

Related papers

Emergence of Human to Robot Transfer in Vision-Language-Action Models [88.76648919814771]
Vision-language-action (VLA) models can enable broad open world generalization, but require large and diverse datasets.<n>We show that human-to-robot transfer emerges once the VLA is pre-trained on sufficient scenes, tasks, and embodiments.
arXiv Detail & Related papers (2025-12-27T00:13:11Z)
VideoVLA: Video Generators Can Be Generalizable Robot Manipulators [86.70243911696616]
Generalization in robot manipulation is essential for deploying robots in open-world environments.<n>We present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators.
arXiv Detail & Related papers (2025-12-07T18:57:15Z)
UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning [22.84748754972181]
Building generalist robot policies that can handle diverse tasks in open-ended environments is a central challenge in robotics.<n>To leverage knowledge from large-scale pretraining, prior work has typically built generalist policies either on top of vision-language understanding models (VLMs) or generative models.<n>Recent unified models of generation and understanding have demonstrated strong capabilities in both comprehension and generation through large-scale pretraining.<n>We introduce UniCoD, which acquires the ability to dynamically model high-dimensional visual features through pretraining on over 1M internet-scale instructional manipulation videos.
arXiv Detail & Related papers (2025-10-12T14:54:19Z)
Object-Centric Action-Enhanced Representations for Robot Visuo-Motor Policy Learning [21.142247150423863]
We propose an object-centric encoder that performs semantic segmentation and visual representation generation in a coupled manner.<n>To achieve this, we leverage the Slot Attention mechanism and use the SOLV model, pretrained in large out-of-domain datasets.<n>We show that exploiting models pretrained on out-of-domain datasets can benefit this process, and that fine-tuning on datasets depicting human actions can significantly improve performance.
arXiv Detail & Related papers (2025-05-27T09:56:52Z)
Reciprocal Learning of Intent Inferral with Augmented Visual Feedback for Stroke [2.303526979876375]
We propose a bidirectional paradigm that facilitates human adaptation to an intent inferral classifier. We demonstrate this paradigm in the context of controlling a robotic hand orthosis for stroke. Our experiments with stroke subjects show reciprocal learning improving performance in a subset of subjects without negatively impacting performance on the others.
arXiv Detail & Related papers (2024-12-10T22:49:36Z)
Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA) LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z)
Expanding Frozen Vision-Language Models without Retraining: Towards Improved Robot Perception [0.0]
Vision-language models (VLMs) have shown powerful capabilities in visual question answering and reasoning tasks. In this paper, we demonstrate a method of aligning the embedding spaces of different modalities to the vision embedding space. We show that using multiple modalities as input improves the VLM's scene understanding and enhances its overall performance in various tasks.
arXiv Detail & Related papers (2023-08-31T06:53:55Z)
Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies. The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z)
Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning. We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z)
Few-Shot Visual Grounding for Natural Human-Robot Interaction [0.0]
We propose a software architecture that segments a target object from a crowded scene, indicated verbally by a human user. At the core of our system, we employ a multi-modal deep neural network for visual grounding. We evaluate the performance of the proposed model on real RGB-D data collected from public scene datasets.
arXiv Detail & Related papers (2021-03-17T15:24:02Z)
Model-Based Visual Planning with Self-Supervised Functional Distances [104.83979811803466]
We present a self-supervised method for model-based visual goal reaching. Our approach learns entirely using offline, unlabeled data. We find that this approach substantially outperforms both model-free and model-based prior methods.
arXiv Detail & Related papers (2020-12-30T23:59:09Z)
Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works. However, learning a model that captures the dynamics of complex skills represents a major challenge. We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.