Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models
- URL: http://arxiv.org/abs/2505.23757v1
- Date: Thu, 29 May 2025 17:59:46 GMT
- Title: Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models
- Authors: Haohan Chi, Huan-ang Gao, Ziming Liu, Jianing Liu, Chenyu Liu, Jinwei Li, Kaisen Yang, Yangcheng Yu, Zeda Wang, Wenyi Li, Leichen Wang, Xingtao Hu, Hao Sun, Hang Zhao, Hao Zhao,
- Abstract summary: Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios.<n>We introduce Impromptu VLA: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets.<n>This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories.
- Score: 31.566051946153802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.
Related papers
- EdgeVLA: Efficient Vision-Language-Action Models [0.4005096060512278]
This paper introduces Edge VLA, a novel approach designed to significantly enhance the inference speed of Vision-Language-Action (VLA) models.<n>We achieve this through two key innovations: 1) Eliminating the autoregressive requirement for end-effector position prediction, leading to a 7x speedup in inference, and 2) Leveraging the efficiency of Small Language Models (SLMs)<n>Our early results demonstrate that EVLA achieves comparable training characteristics to OpenVLA while offering substantial gains in inference speed and memory efficiency.
arXiv Detail & Related papers (2025-07-18T16:15:09Z) - CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models [89.44024245194315]
We introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs)<n>We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens.<n>Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks.
arXiv Detail & Related papers (2025-03-27T22:23:04Z) - Expand VSR Benchmark for VLLM to Expertize in Spatial Rules [11.320245739677826]
Visual spatial reasoning is a basic part of human cognition which requires fine-grained perception on cross-instance.<n>There is still a lack of sufficient quantity and quality evaluation and optimization datasets for Vision Large Language Models(VLLMs) specifically targeting visual positional reasoning.<n>We found current VLLMs to exhibit a contradiction of over-sensitivity to language instructions and under-sensitivity to visual positional information.<n>To our knowledge, we expanded spatially positioned image data controllably using diffusion models for the first time and integrated original visual encoding.
arXiv Detail & Related papers (2024-12-24T07:13:17Z) - VidLPRO: A $\underline{Vid}$eo-$\underline{L}$anguage $\underline{P}$re-training Framework for $\underline{Ro}$botic and Laparoscopic Surgery [4.12931136981508]
We introduce VidLPRO, a novel video-language (VL) pre-training framework designed specifically for robotic and laparoscopic surgery.
VidLPRO integrates video-text contrastive learning, video-text matching, and masked language modeling objectives to learn rich VL representations.
Our model demonstrates improvements of up to 21.5% in accuracy and 15.7% in F1 score, setting a new benchmark in the field.
arXiv Detail & Related papers (2024-09-07T06:33:12Z) - Concept-skill Transferability-based Data Selection for Large Vision-Language Models [56.0725292404808]
We introduce COINCIDE, an effective and scalable data selection technique for training vision-language models.
We cluster the training data using internal activations from a small model, which identifies concept-skill compositions needed by a target LVLM.
Experiments demonstrate that COINCIDE achieves superior performance and data selection efficiency against 8 strong baselines.
arXiv Detail & Related papers (2024-06-16T16:15:20Z) - OpenVLA: An Open-Source Vision-Language-Action Model [131.74098076670103]
We introduce OpenVLA, an open-source VLA trained on a diverse collection of 970k real-world robot demonstrations.
OpenVLA shows strong results for generalist manipulation, outperforming closed models such as RT-2-X (55B) by 16.5% in absolute task success rate.
We release model checkpoints, fine-tuning notebooks, and our PyTorch with built-in support for training VLAs at scale on Open X-Embodiment datasets.
arXiv Detail & Related papers (2024-06-13T15:46:55Z) - Enhancing End-to-End Autonomous Driving with Latent World Model [78.22157677787239]
We propose a novel self-supervised learning approach using the LAtent World model (LAW) for end-to-end driving.<n> LAW predicts future scene features based on current features and ego trajectories.<n>This self-supervised task can be seamlessly integrated into perception-free and perception-based frameworks.
arXiv Detail & Related papers (2024-06-12T17:59:21Z) - SearchLVLMs: A Plug-and-Play Framework for Augmenting Large Vision-Language Models by Searching Up-to-Date Internet Knowledge [56.772051051558215]
Large vision-language models (LVLMs) are ignorant of the up-to-date knowledge, such as LLaVA series, because they cannot be updated frequently.
We propose a plug-and-play framework, for augmenting existing LVLMs in handling visual question answering (VQA) about up-to-date knowledge, dubbed SearchLVLMs.
arXiv Detail & Related papers (2024-05-23T13:32:07Z) - CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario [19.730287885060633]
Traffic Safety Description and Analysis plays a pivotal role in applications ranging from insurance inspection to accident prevention.
This paper introduces CityLLaVA, a novel fine-tuning framework for Visual Language Models (VLMs) designed for urban scenarios.
arXiv Detail & Related papers (2024-05-06T06:38:49Z) - Improving Commonsense in Vision-Language Models via Knowledge Graph
Riddles [83.41551911845157]
This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models.
We propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE)
For better commonsense evaluation, we propose the first retrieval-based commonsense diagnostic benchmark.
arXiv Detail & Related papers (2022-11-29T18:59:59Z) - ConStruct-VL: Data-Free Continual Structured VL Concepts Learning [57.86651057895222]
We introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark.
We propose a data-free method comprised of a new approach of Adrial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models.
We show this approach outperforms all data-free methods by as much as 7% while even matching some levels of experience-replay.
arXiv Detail & Related papers (2022-11-17T18:57:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.