Related papers: Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video

Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video

URL: http://arxiv.org/abs/2407.13856v1
Date: Thu, 18 Jul 2024 18:55:56 GMT
Title: Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video
Authors: Zachary Chavis, Hyun Soo Park, Stephen J. Guy,
Abstract summary: We present a system which trains on spatially-localized egocentric videos in order to connect visual input and task descriptions. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. The resulting system enables robots to use egocentric sensing to navigate to physical locations of novel tasks specified in natural language.
Score: 18.14234312389889
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) have shown great success as foundational models for downstream vision and natural language applications in a variety of domains. However, these models lack the spatial understanding necessary for robotics applications where the agent must reason about the affordances provided by the 3D world around them. We present a system which trains on spatially-localized egocentric videos in order to connect visual input and task descriptions to predict a task's spatial affordance, that is the location where a person would go to accomplish the task. We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images. Our learning-based approach has less error both on predicting where a task may take place and on predicting what tasks are likely to happen at the current location. The resulting system enables robots to use egocentric sensing to navigate to physical locations of novel tasks specified in natural language.

Related papers

Mind the Gap: Benchmarking Spatial Reasoning in Vision-Language Models [14.442394137843923]
We present a detailed analysis that first delineates the core elements of spatial reasoning. We then assesses the performance of these models in both synthetic and real-world images.
arXiv Detail & Related papers (2025-03-25T14:34:06Z)
LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps [18.602777449136738]
We propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs. We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks.
arXiv Detail & Related papers (2025-03-15T18:54:06Z)
Image-based Geo-localization for Robotics: Are Black-box Vision-Language Models there yet? [25.419763184667985]
Vision-Language models (VLMs) offer exciting opportunities for robotic applications involving image geo-localization. Recent research works have focused on using a VLM as embeddings extractor for geo-localization. This paper investigates the potential of some of the state-of-the-art VLMs as stand-alone, zero-shot geo-localization systems.
arXiv Detail & Related papers (2025-01-28T13:46:01Z)
Space-LLaVA: a Vision-Language Model Adapted to Extraterrestrial Applications [14.89043819048682]
We see three core challenges in the future of space robotics that motivate building FM for space robotics. As a firststep towards a space foundation model model, we augment three extraterrestrial databases with fine-grained annotations. We fine-tune a Vision-Language Model to adapt to the semantic features in an extraterrestrial environment.
arXiv Detail & Related papers (2024-08-12T05:07:24Z)
VSP: Assessing the dual challenges of perception and reasoning in spatial planning tasks for VLMs [102.36953558562436]
Vision language models (VLMs) are an exciting emerging class of language models (LMs) One understudied capability inVLMs is visual spatial planning. Our study introduces a benchmark that evaluates the spatial planning capability in these models in general.
arXiv Detail & Related papers (2024-07-02T00:24:01Z)
Grounding Language Plans in Demonstrations Through Counterfactual Perturbations [25.19071357445557]
Grounding the common-sense reasoning of Large Language Models (LLMs) in physical domains remains a pivotal yet unsolved problem for embodied AI. We show our approach improves the interpretability and reactivity of imitation learning through 2D navigation and simulated and real robot manipulation tasks.
arXiv Detail & Related papers (2024-03-25T19:04:59Z)
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [140.14239499047977]
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT) We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
arXiv Detail & Related papers (2024-02-12T18:33:47Z)
Interactive Planning Using Large Language Models for Partially Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks. We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z)
Look Before You Leap: Unveiling the Power of GPT-4V in Robotic Vision-Language Planning [32.045840007623276]
We introduce Robotic Vision-Language Planning (ViLa), a novel approach for long-horizon robotic planning. ViLa directly integrates perceptual data into its reasoning and planning process. Our evaluation, conducted in both real-robot and simulated environments, demonstrates ViLa's superiority over existing LLM-based planners.
arXiv Detail & Related papers (2023-11-29T17:46:25Z)
Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning. This task unifies spatial and temporal localization in video. We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z)
Visual Affordance Prediction for Guiding Robot Exploration [56.17795036091848]
We develop an approach for learning visual affordances for guiding robot exploration. We use a Transformer-based model to learn a conditional distribution in the latent embedding space of a VQ-VAE. We show how the trained affordance model can be used for guiding exploration by acting as a goal-sampling distribution, during visual goal-conditioned policy learning in robotic manipulation.
arXiv Detail & Related papers (2023-05-28T17:53:09Z)
Distilling Internet-Scale Vision-Language Models into Embodied Agents [24.71298634838615]
We propose using pretrained vision-language models (VLMs) to supervise embodied agents. We combine ideas from model distillation and hindsight experience replay (HER) to retroactively generate language describing the agent's behavior. Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.
arXiv Detail & Related papers (2023-01-29T18:21:05Z)
Learning Language-Conditioned Robot Behavior from Offline Data and Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction. We propose to leverage offline robot datasets with crowd-sourced natural language labels. We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.