VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving
- URL: http://arxiv.org/abs/2511.12405v1
- Date: Sun, 16 Nov 2025 00:55:28 GMT
- Title: VLA-R: Vision-Language Action Retrieval toward Open-World End-to-End Autonomous Driving
- Authors: Hyunki Seong, Seongwoo Moon, Hojin Ahn, Jehun Kang, David Hyunchul Shim,
- Abstract summary: We present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving framework.<n>We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features.<n>To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme.
- Score: 6.785438664749581
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Exploring open-world situations in an end-to-end manner is a promising yet challenging task due to the need for strong generalization capabilities. In particular, end-to-end autonomous driving in unstructured outdoor environments often encounters conditions that were unfamiliar during training. In this work, we present Vision-Language Action Retrieval (VLA-R), an open-world end-to-end autonomous driving (OW-E2EAD) framework that integrates open-world perception with a novel vision-action retrieval paradigm. We leverage a frozen vision-language model for open-world detection and segmentation to obtain multi-scale, prompt-guided, and interpretable perception features without domain-specific tuning. A Q-Former bottleneck aggregates fine-grained visual representations with language-aligned visual features, bridging perception and action domains. To learn transferable driving behaviors, we introduce a vision-action contrastive learning scheme that aligns vision-language and action embeddings for effective open-world reasoning and action retrieval. Our experiments on a real-world robotic platform demonstrate strong generalization and exploratory performance in unstructured, unseen environments, even with limited data. Demo videos are provided in the supplementary material.
Related papers
- OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [54.661157616245966]
Open-world navigation requires robots to make decisions in complex everyday environments.<n>Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language.<n>We propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.
arXiv Detail & Related papers (2026-03-05T17:02:22Z) - Look, Zoom, Understand: The Robotic Eyeball for Embodied Perception [8.542874528320004]
Existing vision models and fixed RGB-D camera systems fail to reconcile wide-area coverage with fine-grained detail acquisition.<n>We propose EyeVLA, a robotic eyeball for active visual perception that can take proactive actions based on instructions.
arXiv Detail & Related papers (2025-11-19T09:42:08Z) - LMAD: Integrated End-to-End Vision-Language Model for Explainable Autonomous Driving [58.535516533697425]
Large vision-language models (VLMs) have shown promising capabilities in scene understanding.<n>We propose a novel vision-language framework tailored for autonomous driving, called LMAD.<n>Our framework emulates modern end-to-end driving paradigms by incorporating comprehensive scene understanding and a task-specialized structure with VLMs.
arXiv Detail & Related papers (2025-08-17T15:42:54Z) - Imagine, Verify, Execute: Memory-guided Agentic Exploration with Vision-Language Models [81.08295968057453]
We present IVE, an agentic exploration framework inspired by human curiosity.<n>We evaluate IVE in both simulated and real-world tabletop environments.
arXiv Detail & Related papers (2025-05-12T17:59:11Z) - OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model [24.90085777003393]
We present OpenDriveVLA, a Vision-Language Action (VLA) model designed for end-to-end autonomous driving.<n>OpenDriveVLA builds upon open-source pre-trained large Vision-Language Models (VLMs) to generate reliable driving actions.
arXiv Detail & Related papers (2025-03-30T14:45:54Z) - InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving [3.8737986316149775]
We propose a novel end-to-end autonomous driving method called InsightDrive.<n>It organizes perception by language-guided scene representation.<n>In experiments, InsightDrive achieves state-of-the-art performance in end-to-end autonomous driving.
arXiv Detail & Related papers (2025-03-17T10:52:32Z) - Hawk: Learning to Understand Open-World Video Anomalies [76.9631436818573]
Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs.
We introduce Hawk, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely.
We have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions.
arXiv Detail & Related papers (2024-05-27T07:08:58Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - Training-Free Action Recognition and Goal Inference with Dynamic Frame Selection [51.004020874336284]
VidTFS is a Training-free, open-vocabulary video goal and action inference framework.
Our experiments demonstrate that the proposed frame selection module improves the performance of the framework significantly.
We validate the performance of the proposed VidTFS on four widely used video datasets.
arXiv Detail & Related papers (2024-01-23T03:45:05Z) - Drive Anywhere: Generalizable End-to-end Autonomous Driving with
Multi-modal Foundation Models [114.69732301904419]
We present an approach to apply end-to-end open-set (any environment/scene) autonomous driving that is capable of providing driving decisions from representations queryable by image and text.
Our approach demonstrates unparalleled results in diverse tests while achieving significantly greater robustness in out-of-distribution situations.
arXiv Detail & Related papers (2023-10-26T17:56:35Z) - Contrastive Learning for Enhancing Robust Scene Transfer in Vision-based
Agile Flight [21.728935597793473]
This work proposes an adaptive multi-pair contrastive learning strategy for visual representation learning that enables zero-shot scene transfer and real-world deployment.
We demonstrate the performance of our approach on the task of agile, vision-based quadrotor flight.
arXiv Detail & Related papers (2023-09-18T15:25:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.