Related papers: Visual Language Models as Operator Agents in the Space Domain

Visual Language Models as Operator Agents in the Space Domain

URL: http://arxiv.org/abs/2501.07802v1
Date: Tue, 14 Jan 2025 03:03:37 GMT
Title: Visual Language Models as Operator Agents in the Space Domain
Authors: Alejandro Carrasco, Marco Nedungadi, Enrico M. Zucchelli, Amit Jain, Victor Rodriguez-Fernandez, Richard Linares,
Abstract summary: Vision-Language Models (VLMs) can enhance autonomous control and decision-making in space missions.<n>In the software context, we employ VLMs to interpret visual screenshots of the graphical user interface to perform complex orbital maneuvers.<n>In the hardware context, we integrate VLMs with robotic systems equipped with cameras to inspect and diagnose physical space objects, such as satellites.
Score: 36.943670587532026
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper explores the application of Vision-Language Models (VLMs) as operator agents in the space domain, focusing on both software and hardware operational paradigms. Building on advances in Large Language Models (LLMs) and their multimodal extensions, we investigate how VLMs can enhance autonomous control and decision-making in space missions. In the software context, we employ VLMs within the Kerbal Space Program Differential Games (KSPDG) simulation environment, enabling the agent to interpret visual screenshots of the graphical user interface to perform complex orbital maneuvers. In the hardware context, we integrate VLMs with robotic systems equipped with cameras to inspect and diagnose physical space objects, such as satellites. Our results demonstrate that VLMs can effectively process visual and textual data to generate contextually appropriate actions, competing with traditional methods and non-multimodal LLMs in simulation tasks, and showing promise in real-world applications.

Related papers

LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation [94.84458417662404]
LangTraj is a language-conditioned scene-diffusion model that simulates the joint behavior of all agents in traffic scenarios. By conditioning on natural language inputs, LangTraj provides flexible and intuitive control over interactive behaviors. LangTraj demonstrates strong performance in realism, language controllability, and language-conditioned safety-critical simulation.
arXiv Detail & Related papers (2025-04-15T17:14:06Z)
Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired [0.2410625015892047]
We introduce the Space-Aware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench) Our data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings. We propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance.
arXiv Detail & Related papers (2025-02-11T02:14:49Z)
Fine-tuning LLMs for Autonomous Spacecraft Control: A Case Study Using Kerbal Space Program [42.87968485876435]
This study explores the use of fine-tuned Large Language Models (LLMs) for autonomous spacecraft control. We demonstrate how these models can effectively control spacecraft using language-based inputs and outputs.
arXiv Detail & Related papers (2024-08-16T11:43:31Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations. First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets. We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
Probing Multimodal LLMs as World Models for Driving [72.18727651074563]
We look at the application of Multimodal Large Language Models (MLLMs) in autonomous driving. Despite advances in models like GPT-4o, their performance in complex driving environments remains largely unexplored.
arXiv Detail & Related papers (2024-05-09T17:52:42Z)
Language Models are Spacecraft Operators [36.943670587532026]
Large Language Models (LLMs) are autonomous agents that take actions based on the content of the user text prompts. We have developed a pure LLM-based solution for the Kerbal Space Program Differential Games (KSPDG) challenge.
arXiv Detail & Related papers (2024-03-30T16:43:59Z)
PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [140.14239499047977]
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding. We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT) We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
arXiv Detail & Related papers (2024-02-12T18:33:47Z)
MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments [82.67236400004826]
We introduce the Multimodal Embodied Interactive Agent (MEIA), capable of translating high-level tasks expressed in natural language into a sequence of executable actions. MEM module enables MEIA to generate executable action plans based on diverse requirements and the robot's capabilities.
arXiv Detail & Related papers (2024-02-01T02:43:20Z)
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks [92.03764152132315]
We design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks. It has powerful visual capabilities and can be a good alternative to the ViT-22B.
arXiv Detail & Related papers (2023-12-21T18:59:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.