Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video
- URL: http://arxiv.org/abs/2407.13856v1
- Date: Thu, 18 Jul 2024 18:55:56 GMT
- Title: Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video
- Authors: Zachary Chavis, Hyun Soo Park, Stephen J. Guy,
- Abstract summary: We present a system which trains on spatially-localized egocentric videos in order to connect visual input and task descriptions.
We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images.
The resulting system enables robots to use egocentric sensing to navigate to physical locations of novel tasks specified in natural language.
- Score: 18.14234312389889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models lack the spatial understanding necessary for robotics applications where the agent must reason about the affordances provided by the 3D world around them. We present a system which trains on spatially-localized egocentric videos in order to connect visual input and task descriptions to predict a task's spatial affordance, that is the location where a person would go to accomplish the task. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our learning-based approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting system enables robots to use egocentric sensing to navigate to physical locations of novel tasks specified in natural language.
Related papers
- Image-based Geo-localization for Robotics: Are Black-box Vision-Language Models there yet? [25.419763184667985]
Vision-Language models (VLMs) offer exciting opportunities for robotic applications involving image geo-localization.
Recent research works have focused on using a VLM as embeddings extractor for geo-localization.
This paper investigates the potential of some of the state-of-the-art VLMs as stand-alone, zero-shot geo-localization systems.
arXiv Detail & Related papers (2025-01-28T13:46:01Z) - MLLM-Search: A Zero-Shot Approach to Finding People using Multimodal Large Language Models [5.28115111932163]
We present MLLM-Search, a novel zero-shot person search architecture for mobile robots.
Our approach introduces a novel visual prompting method to provide robots with spatial understanding of the environment.
Experiments with a mobile robot in a multi-room floor of a building showed that MLLM-Search was able to generalize to finding a person in a new unseen environment.
arXiv Detail & Related papers (2024-11-27T21:59:29Z) - Space-LLaVA: a Vision-Language Model Adapted to Extraterrestrial Applications [14.89043819048682]
We see three core challenges in the future of space robotics that motivate building FM for space robotics.
As a firststep towards a space foundation model model, we augment three extraterrestrial databases with fine-grained annotations.
We fine-tune a Vision-Language Model to adapt to the semantic features in an extraterrestrial environment.
arXiv Detail & Related papers (2024-08-12T05:07:24Z) - VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs)
One understudied capability inVLMs is visual spatial planning.
Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z) - Grounding Language Plans in Demonstrations Through Counterfactual Perturbations [25.19071357445557]
Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI.
We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks.
arXiv Detail & Related papers (2024-03-25T19:04:59Z) - PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [140.14239499047977]
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding.
We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT)
We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
arXiv Detail & Related papers (2024-02-12T18:33:47Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration.
We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE.
We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.