Related papers: CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving

CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving

URL: http://arxiv.org/abs/2602.15645v1
Date: Tue, 17 Feb 2026 15:13:36 GMT
Title: CARE Drive A Framework for Evaluating Reason-Responsiveness of Vision Language Models in Automated Driving
Authors: Lucas Elbert Suryana, Farah Bierenga, Sanne van Buuren, Pepijn Kooij, Elsefien Tulleners, Federico Scari, Simeon Calvert, Bart van Arem, Arkady Zgonnikov,
Abstract summary: CARE Drive is a framework for evaluating reason responsiveness in vision language models applied to automated driving.<n>It compares baseline and reason augmented model decisions under controlled contextual variation to assess whether human reasons causally influence decision behavior.<n>Results show that explicit human reasons significantly influence model decisions, improving alignment with expert recommended behavior.
Score: 3.5279672254773353
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Foundation models, including vision language models, are increasingly used in automated driving to interpret scenes, recommend actions, and generate natural language explanations. However, existing evaluation methods primarily assess outcome based performance, such as safety and trajectory accuracy, without determining whether model decisions reflect human relevant considerations. As a result, it remains unclear whether explanations produced by such models correspond to genuine reason responsive decision making or merely post hoc rationalizations. This limitation is especially significant in safety critical domains because it can create false confidence. To address this gap, we propose CARE Drive, Context Aware Reasons Evaluation for Driving, a model agnostic framework for evaluating reason responsiveness in vision language models applied to automated driving. CARE Drive compares baseline and reason augmented model decisions under controlled contextual variation to assess whether human reasons causally influence decision behavior. The framework employs a two stage evaluation process. Prompt calibration ensures stable outputs. Systematic contextual perturbation then measures decision sensitivity to human reasons such as safety margins, social pressure, and efficiency constraints. We demonstrate CARE Drive in a cyclist overtaking scenario involving competing normative considerations. Results show that explicit human reasons significantly influence model decisions, improving alignment with expert recommended behavior. However, responsiveness varies across contextual factors, indicating uneven sensitivity to different types of reasons. These findings provide empirical evidence that reason responsiveness in foundation models can be systematically evaluated without modifying model parameters.

Related papers

On the Assessment of Sensitivity of Autonomous Vehicle Perception [0.13858851827255522]
The viability of automated driving is heavily dependent on the performance of perception systems.<n>We evaluate perception performance using predictive sensitivity quantification based on an ensemble of models.<n>A perception assessment criterion is developed based on an AV's stopping distance at a stop sign on varying road surfaces.
arXiv Detail & Related papers (2026-01-30T21:06:05Z)
AutoDriDM: An Explainable Benchmark for Decision-Making of Vision-Language Models in Autonomous Driving [26.866150191410032]
We present AutoDriDM, a decision-centric, progressive benchmark with 6,650 questions across three dimensions - Object, Scene, and Decision.<n>We evaluate mainstream vision-language models to delineate the perception-to-decision capability boundary in autonomous driving.<n>We conduct explainability analyses of models' reasoning processes, identifying key failure modes such as logical reasoning errors.
arXiv Detail & Related papers (2026-01-21T06:29:09Z)
ARISE: An Adaptive Resolution-Aware Metric for Test-Time Scaling Evaluation in Large Reasoning Models [102.4511331368587]
ARISE (Adaptive Resolution-aware Scaling Evaluation) is a novel metric designed to assess the test-time scaling effectiveness of large reasoning models.<n>We conduct comprehensive experiments evaluating state-of-the-art reasoning models across diverse domains.
arXiv Detail & Related papers (2025-10-07T15:10:51Z)
Rationales Are Not Silver Bullets: Measuring the Impact of Rationales on Model Performance and Reliability [70.4107059502882]
Training language models with rationales augmentation has been shown to be beneficial in many existing works.<n>We conduct comprehensive investigations to thoroughly inspect the impact of rationales on model performance.
arXiv Detail & Related papers (2025-05-30T02:39:37Z)
ReasonDrive: Efficient Visual Question Answering for Autonomous Vehicles with Reasoning-Enhanced Small Vision-Language Models [9.316712964093506]
Vision-language models (VLMs) show promise for autonomous driving but often lack transparent reasoning capabilities that are critical for safety.<n>We investigate whether explicitly modeling reasoning during fine-tuning enhances VLM performance on driving decision tasks.
arXiv Detail & Related papers (2025-04-14T23:16:07Z)
VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z)
Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) are used to automate decision-making tasks.<n>In this paper, we evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention.<n>We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types.<n>These benchmarks allow us to isolate the ability of LLMs to accurately predict changes resulting from their ability to memorize facts or find other shortcuts.
arXiv Detail & Related papers (2024-04-08T14:15:56Z)
DME-Driver: Integrating Human Decision Logic and 3D Scene Perception in Autonomous Driving [65.04871316921327]
This paper introduces a new autonomous driving system that enhances the performance and reliability of autonomous driving system. DME-Driver utilizes a powerful vision language model as the decision-maker and a planning-oriented perception model as the control signal generator. By leveraging this dataset, our model achieves high-precision planning accuracy through a logical thinking process.
arXiv Detail & Related papers (2024-01-08T03:06:02Z)
Empirical Estimates on Hand Manipulation are Recoverable: A Step Towards Individualized and Explainable Robotic Support in Everyday Activities [80.37857025201036]
Key challenge for robotic systems is to figure out the behavior of another agent. Processing correct inferences is especially challenging when (confounding) factors are not controlled experimentally. We propose equipping robots with the necessary tools to conduct observational studies on people.
arXiv Detail & Related papers (2022-01-27T22:15:56Z)
Reason induced visual attention for explainable autonomous driving [2.090380922731455]
Deep learning (DL) based computer vision (CV) models are generally considered as black boxes due to poor interpretability. This study is motivated by the need to enhance the interpretability of DL model in autonomous driving. The proposed framework imitates the learning process of human drivers by jointly modeling the visual input (images) and natural language.
arXiv Detail & Related papers (2021-10-11T18:50:41Z)
Modeling Perception Errors towards Robust Decision Making in Autonomous Vehicles [11.503090828741191]
We propose a simulation-based methodology towards answering the question: is a perception subsystem sufficient for the decision making subsystem to make robust, safe decisions? We show how to analyze the impact of different kinds of sensing and perception errors on the behavior of the autonomous system.
arXiv Detail & Related papers (2020-01-31T08:02:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.