Related papers: OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation

URL: http://arxiv.org/abs/2509.19480v1
Date: Tue, 23 Sep 2025 18:40:29 GMT
Title: OmniVLA: An Omni-Modal Vision-Language-Action Model for Robot Navigation
Authors: Noriaki Hirose, Catherine Glossop, Dhruv Shah, Sergey Levine,
Abstract summary: We present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation.<n>Our approach leverages a high-capacity vision-language-action backbone and trains with three primary goal modalities.<n>We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks.
Score: 49.66156306240961
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Humans can flexibly interpret and compose different goal specifications, such as language instructions, spatial coordinates, or visual references, when navigating to a destination. In contrast, most existing robotic navigation policies are trained on a single modality, limiting their adaptability to real-world scenarios where different forms of goal specification are natural and complementary. In this work, we present a training framework for robotic foundation models that enables omni-modal goal conditioning for vision-based navigation. Our approach leverages a high-capacity vision-language-action (VLA) backbone and trains with three primary goal modalities: 2D poses, egocentric images, and natural language, as well as their combinations, through a randomized modality fusion strategy. This design not only expands the pool of usable datasets but also encourages the policy to develop richer geometric, semantic, and visual representations. The resulting model, OmniVLA, achieves strong generalization to unseen environments, robustness to scarce modalities, and the ability to follow novel natural language instructions. We demonstrate that OmniVLA outperforms specialist baselines across modalities and offers a flexible foundation for fine-tuning to new modalities and tasks. We believe OmniVLA provides a step toward broadly generalizable and flexible navigation policies, and a scalable path for building omni-modal robotic foundation models. We present videos showcasing OmniVLA performance and will release its checkpoints and training code on our project page.

Related papers

OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [54.661157616245966]
Open-world navigation requires robots to make decisions in complex everyday environments.<n>Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language.<n>We propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.
arXiv Detail & Related papers (2026-03-05T17:02:22Z)
OmniGAIA: Towards Native Omni-Modal AI Agents [103.79729735478924]
We introduce a benchmark designed to evaluate omni-modal agents on tasks requiring deep reasoning and multi-turn tool execution.<n>We propose OmniAtlas, a native omni-modal foundation agent under tool-integrated reasoning paradigm with active omni-modal perception.
arXiv Detail & Related papers (2026-02-26T11:35:04Z)
GeneralVLA: Generalizable Vision-Language-Action Models with Knowledge-Guided Trajectory Planning [20.646039344274556]
GeneralVLA is a hierarchical vision-language-action (VLA) model that can be more effective in utilizing the generalization of foundation models.<n>GeneralVLA successfully generates trajectories for 14 tasks, significantly outperforming state-of-the-art methods such as VoxPoser.
arXiv Detail & Related papers (2026-02-04T08:30:27Z)
VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation [16.279434375658457]
VAMOS is a hierarchical VLA that decouples semantic planning from embodiment grounding.<n>We show VAMOS achieves higher success rates in both indoor and complex outdoor navigation.<n>This model significantly enhances single-robot reliability, achieving 3X higher success rates by rejecting physically infeasible plans.
arXiv Detail & Related papers (2025-10-23T17:59:45Z)
Grounded Vision-Language Navigation for UAVs with Open-Vocabulary Goal Understanding [1.280979348722635]
Vision-and-language navigation (VLN) is a long-standing challenge in autonomous robotics, aiming to empower agents with the ability to follow human instructions while navigating complex environments.<n>We propose Vision-Language Fly (VLFly), a framework tailored for Unmanned Aerial Vehicles (UAVs) to execute language-guided flight.
arXiv Detail & Related papers (2025-06-12T14:40:50Z)
UAV-VLN: End-to-End Vision Language guided Navigation for UAVs [0.0]
A core challenge in AI-guided autonomy is enabling agents to navigate realistically and effectively in previously unseen environments.<n>We propose UAV-VLN, a novel end-to-end Vision-Language Navigation framework for Unmanned Aerial Vehicles (UAVs)<n>Our system interprets free-form natural language instructions, grounds them into visual observations, and plans feasible aerial trajectories in diverse environments.
arXiv Detail & Related papers (2025-04-30T08:40:47Z)
FlexVLN: Flexible Adaptation for Diverse Vision-and-Language Navigation Tasks [13.969116430006215]
We propose FlexVLN, an innovative hierarchical approach to Vision-and-Language Navigation (VLN)<n>It integrates the navigation ability of a supervised-learning-based Instruction Follower with the robust generalization ability of the LLM Planner.<n>We take REVERIE, SOON, and CVDN-target as out-of-domain datasets for assessing generalization ability.
arXiv Detail & Related papers (2025-03-18T06:58:41Z)
Ola: Pushing the Frontiers of Omni-Modal Language Model [88.72389428177942]
We present Ola, an omni-modal language model that achieves competitive performance across image, video, and audio understanding.<n>Ola incorporates advanced visual understanding and audio recognition capabilities through several critical and effective improvements.<n>We aim to make Ola a fully open omni-modal understanding solution to advance future research in this emerging field.
arXiv Detail & Related papers (2025-02-06T18:59:55Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
OmniBench: Towards The Future of Universal Omni-Language Models [63.16606414452612]
We introduce OmniBench, a novel benchmark designed to evaluate models' ability to recognize, interpret, and reason across visual, acoustic, and textual inputs simultaneously.<n>Our evaluation reveals that open-source OLMs show significant limitations in instruction-following and reasoning in tri-modal contexts.<n>We advocate for developing more robust tri-modal integration techniques and training strategies to enhance OLM performance.
arXiv Detail & Related papers (2024-09-23T17:59:05Z)
DeepSeek-VL: Towards Real-World Vision-Language Understanding [24.57011093316788]
We present DeepSeek-VL, an open-source Vision-Language (VL) Model for real-world vision and language understanding applications. Our approach is structured around three key dimensions: We strive to ensure our data is diverse, scalable, and extensively covers real-world scenarios. We create a use case taxonomy from real user scenarios and construct an instruction tuning dataset.
arXiv Detail & Related papers (2024-03-08T18:46:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.