Related papers: E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

URL: http://arxiv.org/abs/2512.04733v1
Date: Thu, 04 Dec 2025 12:17:25 GMT
Title: E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving
Authors: Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu,
Abstract summary: We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory.<n>We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components.<n>A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions.
Score: 56.50212124887739
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and human-centric feedback.

Related papers

AUHead: Realistic Emotional Talking Head Generation via Action Units Control [67.20660861826357]
Realistic talking-head video generation is critical for virtual avatars, film production, and interactive systems.<n>Current methods struggle with nuanced emotional expressions due to the lack of fine-grained emotion control.<n>We introduce a novel two-stage method to disentangle emotion control, i.e. Action Units (AUs), from audio and achieve controllable generation.
arXiv Detail & Related papers (2026-02-10T08:45:51Z)
A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction [50.05919688888947]
This paper presents a unified spoken language model for emotional intelligence, enhanced by a novel data construction strategy termed Injected Emotional-Attribution Thinking (IEAT)<n>IEAT incorporates user emotional states and their underlying causes into the model's internal reasoning process, enabling emotion-aware reasoning to be internalized rather than treated as explicit supervision.<n> Experiments on the Human-like Spoken Dialogue Systems Challenge (HumDial) Emotional Intelligence benchmark demonstrate that the proposed approach achieves top-ranked performance across emotional trajectory modeling, emotional reasoning, and empathetic response generation.
arXiv Detail & Related papers (2026-01-08T14:07:30Z)
Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles [34.698147360764104]
ThinkDeeper is a framework that reasons about future spatial states before making grounding decisions.<n>It ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks.<n>In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought pipeline.
arXiv Detail & Related papers (2025-12-03T05:14:16Z)
Percept-WAM: Perception-Enhanced World-Awareness-Action Model for Robust End-to-End Autonomous Driving [48.512353531499286]
We introduce Percept-WAM, a perception-enhanced World-Awareness-Action Model that implicitly integrates 2D/3D scene understanding abilities within a single vision-language model (VLM)<n>We propose a grid-conditioned prediction mechanism for dense object perception, incorporating IoU-aware scoring and parallel autoregressive decoding, improving stability in long-tail, far-range, and small-object scenarios.<n>Experiments show that Percept-WAM matches or surpasses classical detectors and segmenters on downstream perception benchmarks, achieving 51.7/58.9 mAP on 2D detection and nuScenes BEV 3D detection
arXiv Detail & Related papers (2025-11-24T15:28:25Z)
StyleDrive: Towards Driving-Style Aware Benchmarking of End-To-End Autonomous Driving [7.525510086747996]
Personalization has been largely overlooked in the context of end-to-end autonomous driving (E2EAD)<n>We introduce the first large-scale real-world dataset explicitly curated for personalized E2EAD.<n>We introduce the first standardized benchmark for systematically evaluating personalized E2EAD models.
arXiv Detail & Related papers (2025-06-30T15:48:38Z)
ReCogDrive: A Reinforced Cognitive Framework for End-to-End Autonomous Driving [49.07731497951963]
ReCogDrive is a novel Reinforced Cognitive framework for end-to-end autonomous driving.<n>We introduce a hierarchical data pipeline that mimics the sequential cognitive process of human drivers.<n>We then address the language-action mismatch by injecting the VLM's learned driving priors into a diffusion planner.
arXiv Detail & Related papers (2025-06-09T03:14:04Z)
InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving [3.8737986316149775]
We propose a novel end-to-end autonomous driving method called InsightDrive.<n>It organizes perception by language-guided scene representation.<n>In experiments, InsightDrive achieves state-of-the-art performance in end-to-end autonomous driving.
arXiv Detail & Related papers (2025-03-17T10:52:32Z)
Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning [24.511628941825116]
We introduce Sce2DriveX, a human-like driving chain-of-thought (CoT) reasoning framework framework.<n>It reconstructs the implicit cognitive chain inherent in human driving, covering scene understanding, meta-action reasoning, behavior interpretation analysis, motion planning and control.<n>It achieves state-of-the-art performance from scene understanding to end-to-end driving, as well as robust generalization on the CARLA Bench2Drive benchmark.
arXiv Detail & Related papers (2025-02-19T09:50:44Z)
From Rational Answers to Emotional Resonance: The Role of Controllable Emotion Generation in Language Models [16.350658746140788]
Large language models (LLMs) struggle to express emotions in a consistent, controllable, and contextually appropriate manner.<n>We propose a controllable emotion generation framework based on Emotion Vectors (EVs)<n>Our method enables fine-grained, continuous modulation of emotional tone without any additional training or architectural modification.
arXiv Detail & Related papers (2025-02-06T13:38:57Z)
DiFSD: Ego-Centric Fully Sparse Paradigm with Uncertainty Denoising and Iterative Refinement for Efficient End-to-End Self-Driving [55.53171248839489]
We propose an ego-centric fully sparse paradigm, named DiFSD, for end-to-end self-driving.<n>Specifically, DiFSD mainly consists of sparse perception, hierarchical interaction and iterative motion planner.<n>Experiments conducted on nuScenes and Bench2Drive datasets demonstrate the superior planning performance and great efficiency of DiFSD.
arXiv Detail & Related papers (2024-09-15T15:55:24Z)
Commonsense Visual Sensemaking for Autonomous Driving: On Generalised Neurosymbolic Online Abduction Integrating Vision and Semantics [9.359018642178917]
We demonstrate the need and potential of systematically integrated vision and semantics solutions for visual sensemaking in the backdrop of autonomous driving. A general neurosymbolic method for online visual sensemaking using answer set programming (ASP) is systematically formalised and fully implemented.
arXiv Detail & Related papers (2020-12-28T16:55:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.