Space-LLaVA: a Vision-Language Model Adapted to Extraterrestrial Applications
- URL: http://arxiv.org/abs/2408.05924v2
- Date: Sat, 18 Jan 2025 19:33:02 GMT
- Title: Space-LLaVA: a Vision-Language Model Adapted to Extraterrestrial Applications
- Authors: Matthew Foutter, Daniele Gammelli, Justin Kruger, Ethan Foss, Praneet Bhoj, Tommaso Guffanti, Simone D'Amico, Marco Pavone,
- Abstract summary: We see three core challenges in the future of space robotics that motivate building FM for space robotics.<n>As a firststep towards a space foundation model model, we augment three extraterrestrial databases with fine-grained annotations.<n>We fine-tune a Vision-Language Model to adapt to the semantic features in an extraterrestrial environment.
- Score: 14.89043819048682
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation Models (FMs), e.g., large language models, possess attributes of intelligence which offer promise to endow a robot with the contextual understanding necessary to navigate complex, unstructured tasks in the wild. We see three core challenges in the future of space robotics that motivate building an FM for the space robotics community: 1) Scalability of ground-in-the-loop operations; 2) Generalizing prior knowledge to novel environments; and 3) Multi-modality in tasks and sensor data. As a first-step towards a space foundation model, we programmatically augment three extraterrestrial databases with fine-grained language annotations inspired by the sensory reasoning necessary to e.g., identify a site of scientific interest on Mars, building a synthetic dataset of visual-question-answer and visual instruction-following tuples. We fine-tune a pre-trained LLaVA 13B checkpoint on our augmented dataset to adapt a Vision-Language Model (VLM) to the visual semantic features in an extraterrestrial environment, demonstrating FMs as a tool for specialization and enhancing a VLM's zero-shot performance on unseen task types in comparison to state-of-the-art VLMs. Ablation studies show that fine-tuning the language backbone and vision-language adapter in concert is key to facilitate adaption while a small percentage, e.g., 20%, of the pre-training data can be used to safeguard against catastrophic forgetting.
Related papers
- Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation [67.31811007549489]
We propose a Rewriting-driven AugMentation (RAM) paradigm for Vision-Language Navigation (VLN)
Benefiting from our rewriting mechanism, new observation-instruction can be obtained in both simulator-free and labor-saving manners to promote generalization.
Experiments on both the discrete environments (R2R, REVERIE, and R4R) and continuous environments (R2R-CE) show the superior performance and impressive generalization ability of our method.
arXiv Detail & Related papers (2025-03-23T13:18:17Z) - LIAM: Multimodal Transformer for Language Instructions, Images, Actions and Semantic Maps [18.602777449136738]
We propose LIAM - an end-to-end model that predicts action transcripts based on language, image, action, and map inputs.
We evaluate our method on the ALFRED dataset, a simulator-generated benchmark for domestic tasks.
arXiv Detail & Related papers (2025-03-15T18:54:06Z) - PAVLM: Advancing Point Cloud based Affordance Understanding Via Vision-Language Model [4.079327215055764]
Affordance understanding, the task of identifying actionable regions on 3D objects, plays a vital role in allowing robotic systems to engage with and operate within the physical world.
Visual Language Models (VLMs) have excelled in high-level reasoning but fall short in grasping the nuanced physical properties required for effective human-robot interaction.
We introduce PAVLM, an innovative framework that utilizes the extensive multimodal knowledge embedded in pre-trained language models to enhance 3D affordance understanding of point cloud.
arXiv Detail & Related papers (2024-10-15T12:53:42Z) - TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.
To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.
This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z) - Pushing the Limits of Vision-Language Models in Remote Sensing without Human Annotations [5.065947993017157]
This study introduces an approach to curate vision-language datasets by employing an image decoding machine learning model.
We amassed approximately 9.6 million vision-language paired datasets in VHR imagery.
The resultant model outperformed counterparts that did not leverage publicly available vision-language datasets.
arXiv Detail & Related papers (2024-09-11T06:36:08Z) - Space3D-Bench: Spatial 3D Question Answering Benchmark [49.259397521459114]
We present Space3D-Bench - a collection of 1000 general spatial questions and answers related to scenes of the Replica dataset.
We provide an assessment system that grades natural language responses based on predefined ground-truth answers.
Finally, we introduce a baseline called RAG3D-Chat integrating the world understanding of foundation models with rich context retrieval.
arXiv Detail & Related papers (2024-08-29T16:05:22Z) - Simultaneous Localization and Affordance Prediction for Tasks in Egocentric Video [18.14234312389889]
We present a system which trains on spatially-localized egocentric videos in order to connect visual input and task descriptions.
We show our approach outperforms the baseline of using a VLM to map similarity of a task's description over a set of location-tagged images.
The resulting system enables robots to use egocentric sensing to navigate to physical locations of novel tasks specified in natural language.
arXiv Detail & Related papers (2024-07-18T18:55:56Z) - SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models [68.13636352687257]
We introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities.
During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances.
Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts.
arXiv Detail & Related papers (2024-06-03T17:59:06Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities [59.39858959066982]
understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics.
We develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images.
By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA.
arXiv Detail & Related papers (2024-01-22T18:01:01Z) - Towards A Foundation Model For Trajectory Intelligence [0.0]
We present the results of training a large trajectory model using real-world user check-in data.
Our approach follows a pre-train and fine-tune paradigm, where a base model is pre-trained via masked trajectory modeling.
Our empirical analysis utilizes a comprehensive dataset of over 2 billion check-ins generated by more than 6 million users.
arXiv Detail & Related papers (2023-11-30T00:34:09Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - Kosmos-2: Grounding Multimodal Large Language Models to the World [107.27280175398089]
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM)
It enables new capabilities of perceiving object descriptions (e.g., bounding boxes) and grounding text to the visual world.
Code and pretrained models are available at https://aka.ms/kosmos-2.
arXiv Detail & Related papers (2023-06-26T16:32:47Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.
Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.
Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Grounded Decoding: Guiding Text Generation with Grounded Models for
Embodied Agents [111.15288256221764]
Grounded-decoding project aims to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives.
We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
arXiv Detail & Related papers (2023-03-01T22:58:50Z) - Transfer Learning with Synthetic Corpora for Spatial Role Labeling and
Reasoning [15.082041039434365]
We provide two new data resources on multiple spatial language processing tasks.
The first dataset is synthesized for transfer learning on spatial question answering (SQA) and spatial role labeling (SpRL)
The second dataset is a real-world SQA dataset with human-generated questions built on an existing corpus with SPRL annotations.
arXiv Detail & Related papers (2022-10-30T21:23:34Z) - Grounding Language with Visual Affordances over Unstructured Data [26.92329260907805]
We propose a novel approach to efficiently learn language-conditioned robot skills from unstructured, offline and reset-free data.
We exploit a self-supervised visuo-lingual affordance model, which requires as little as 1% of the total data with language.
We find that our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches.
arXiv Detail & Related papers (2022-10-04T21:16:48Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z) - LanguageRefer: Spatial-Language Model for 3D Visual Grounding [72.7618059299306]
We develop a spatial-language model for a 3D visual grounding problem.
We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D.
arXiv Detail & Related papers (2021-07-07T18:55:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.