Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI
- URL: http://arxiv.org/abs/2505.05895v3
- Date: Tue, 05 Aug 2025 16:23:24 GMT
- Title: Leveraging Vision-Language Models for Visual Grounding and Analysis of Automotive UI
- Authors: Benjamin Raphael Ernhofer, Daniil Prokhorov, Jannica Langner, Dominik Bollmann,
- Abstract summary: This work introduces a vision-language framework to facilitate the understanding of and interaction with automotive UIs.<n>To support research in this field, AutomotiveUI-Bench-4K, an open-source dataset comprising 998 images with 4,208 annotations, is also released.<n>A Molmo-7B-based model is fine-tuned using Low-Rank Adaptation (LoRa), incorporating generated reasoning along with visual grounding and evaluation capabilities.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Modern automotive infotainment systems necessitate intelligent and adaptive solutions to manage frequent User Interface (UI) updates and diverse design variations. This work introduces a vision-language framework to facilitate the understanding of and interaction with automotive UIs, enabling seamless adaptation across different UI designs. To support research in this field, AutomotiveUI-Bench-4K, an open-source dataset comprising 998 images with 4,208 annotations, is also released. Additionally, a data pipeline for generating training data is presented. A Molmo-7B-based model is fine-tuned using Low-Rank Adaptation (LoRa), incorporating generated reasoning along with visual grounding and evaluation capabilities. The fine-tuned Evaluative Large Action Model (ELAM) achieves strong performance on AutomotiveUI-Bench-4K (model and dataset are available on Hugging Face). The approach demonstrates strong cross-domain generalization, including a +5.6% improvement on ScreenSpot over the baseline model. An average accuracy of 80.8% is achieved on ScreenSpot, closely matching or surpassing specialized models for desktop, mobile, and web, despite being trained primarily on the automotive domain. This research investigates how data collection and subsequent fine-tuning can lead to AI-driven advancements in automotive UI understanding and interaction. The applied method is cost-efficient, and fine-tuned models can be deployed on consumer-grade GPUs.
Related papers
- Data Scaling Laws for End-to-End Autonomous Driving [83.85463296830743]
We evaluate the performance of a simple end-to-end driving architecture on internal driving datasets ranging in size from 16 to 8192 hours.<n>Specifically, we investigate how much additional training data is needed to achieve a target performance gain.
arXiv Detail & Related papers (2025-04-06T03:23:48Z) - EfficientLLaVA:Generalizable Auto-Pruning for Large Vision-language Models [64.18350535770357]
We propose an automatic pruning method for large vision-language models to enhance the efficiency of multimodal reasoning.<n>Our approach only leverages a small number of samples to search for the desired pruning policy.<n>We conduct extensive experiments on the ScienceQA, Vizwiz, MM-vet, and LLaVA-Bench datasets for the task of visual question answering.
arXiv Detail & Related papers (2025-03-19T16:07:04Z) - UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.18100825673032]
This paper introduces UI-TARS, a native GUI agent model that solely perceives the screenshots as input and performs human-like interactions.<n>In the OSWorld benchmark, UI-TARS achieves scores of 24.6 with 50 steps and 22.7 with 15 steps, outperforming Claude (22.0 and 14.9 respectively)
arXiv Detail & Related papers (2025-01-21T17:48:10Z) - AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving [68.73885845181242]
We propose an Automatic Data Engine (AIDE) that automatically identifies issues, efficiently curates data, improves the model through auto-labeling, and verifies the model through generation of diverse scenarios.
We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms, demonstrating our method's superior performance at a reduced cost.
arXiv Detail & Related papers (2024-03-26T04:27:56Z) - LingoQA: Visual Question Answering for Autonomous Driving [14.620546951115328]
We introduce LingoQA, a novel dataset and benchmark for visual question answering in autonomous driving.
The dataset contains 28K unique short video scenarios, and 419K annotations.
On our benchmark, vision-language models respond truthfully to 59.6% of the questions compared to 96.6% for humans.
arXiv Detail & Related papers (2023-12-21T18:40:34Z) - Trajeglish: Traffic Modeling as Next-Token Prediction [67.28197954427638]
A longstanding challenge for self-driving development is simulating dynamic driving scenarios seeded from recorded driving logs.
We apply tools from discrete sequence modeling to model how vehicles, pedestrians and cyclists interact in driving scenarios.
Our model tops the Sim Agents Benchmark, surpassing prior work along the realism meta metric by 3.3% and along the interaction metric by 9.9%.
arXiv Detail & Related papers (2023-12-07T18:53:27Z) - Tryage: Real-time, intelligent Routing of User Prompts to Large Language
Models [1.0878040851637998]
With over 200, 000 models in the Hugging Face ecosystem, users grapple with selecting and optimizing models to suit multifaceted and data domains.
Here, we propose a context-aware routing system, Tryage, that leverages a language model router for optimal selection of expert models from a model library.
arXiv Detail & Related papers (2023-08-22T17:48:24Z) - Multi-modal Machine Learning for Vehicle Rating Predictions Using Image,
Text, and Parametric Data [3.463438487417909]
We propose a multi-modal learning model for accurate vehicle rating predictions.
The model simultaneously learns features from the parametric specifications, text descriptions, and images of vehicles.
We find that the multi-modal model's explanatory power is 4% - 12% higher than that of the unimodal models.
arXiv Detail & Related papers (2023-05-24T14:58:49Z) - Masked World Models for Visual Control [90.13638482124567]
We introduce a visual model-based RL framework that decouples visual representation learning and dynamics learning.
We demonstrate that our approach achieves state-of-the-art performance on a variety of visual robotic tasks.
arXiv Detail & Related papers (2022-06-28T18:42:27Z) - AutoFIS: Automatic Feature Interaction Selection in Factorization Models
for Click-Through Rate Prediction [75.16836697734995]
We propose a two-stage algorithm called Automatic Feature Interaction Selection (AutoFIS)
AutoFIS can automatically identify important feature interactions for factorization models with computational cost just equivalent to training the target model to convergence.
AutoFIS has been deployed onto the training platform of Huawei App Store recommendation service.
arXiv Detail & Related papers (2020-03-25T06:53:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.