Related papers: FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling

FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling

URL: http://arxiv.org/abs/2511.12708v1
Date: Sun, 16 Nov 2025 17:50:30 GMT
Title: FSDAM: Few-Shot Driving Attention Modeling via Vision-Language Coupling
Authors: Kaiser Hamid, Can Cui, Khandakar Ashrafi Akbar, Ziran Wang, Nade Liang,
Abstract summary: We present FSDAM, a framework that achieves joint attention prediction and caption generation with 100 annotated examples.<n> FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations.<n>This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems.
Score: 5.609178055761294
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Understanding where drivers look and why they shift their attention is essential for autonomous systems that read human intent and justify their actions. Most existing models rely on large-scale gaze datasets to learn these patterns; however, such datasets are labor-intensive to collect and time-consuming to curate. We present FSDAM (Few-Shot Driver Attention Modeling), a framework that achieves joint attention prediction and caption generation with approximately 100 annotated examples, two orders of magnitude fewer than existing approaches. Our approach introduces a dual-pathway architecture where separate modules handle spatial prediction and caption generation while maintaining semantic consistency through cross-modal alignment. Despite minimal supervision, FSDAM achieves competitive performance on attention prediction, generates coherent, and context-aware explanations. The model demonstrates robust zero-shot generalization across multiple driving benchmarks. This work shows that effective attention-conditioned generation is achievable with limited supervision, opening new possibilities for practical deployment of explainable driver attention systems in data-constrained scenarios.

Related papers

Unifying Language-Action Understanding and Generation for Autonomous Driving [25.23561391638388]
Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving.<n>Existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation.<n>We introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency.
arXiv Detail & Related papers (2026-03-02T04:41:10Z)
Efficient and Explainable End-to-End Autonomous Driving via Masked Vision-Language-Action Diffusion [23.834662472392694]
Masked Vision-Language-Action Diffusion for Autonomous Driving (MVLAD-AD) is a novel framework designed to bridge the gap between efficient planning and semantic explainability.<n>We introduce a discrete action tokenization strategy that constructs a compact codebook of kinematically feasible waypoints from real-world driving distributions.<n>Experiments on nuScenes and derived benchmarks demonstrate that MVLAD-AD achieves superior efficiency and outperforms state-of-the-art autoregressive and diffusion baselines in planning precision.
arXiv Detail & Related papers (2026-02-24T05:59:10Z)
Kelix Technical Report [86.64551727600104]
We present Kelix, a fully discrete autoregressive unified model that closes the understanding gap between discrete and continuous visual representations.<n>Recent work has explored discrete visual tokenization to enable fully autoregressive multimodal modeling.
arXiv Detail & Related papers (2026-02-10T14:48:26Z)
SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving [52.02379432801349]
We propose SGDrive, a novel framework that structures the VLM's representation learning around driving-specific knowledge hierarchies.<n>Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition.
arXiv Detail & Related papers (2026-01-09T08:55:42Z)
Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z)
Vision-LLMs for Spatiotemporal Traffic Forecasting [14.700408329373998]
Large Language Models (LLMs) inherently struggle to model the complex spatial dependencies of grid-based traffic data.<n>We propose ST-Vision-LLM, a novel framework reframe thatstemporal forecasting as a vision-language fusion problem.<n>We show that ST-Vision-LLM outperforms existing methods by 15.6% in long-term prediction accuracy and exceeds the second-best baseline by over 30.04% in cross-domain scenarios.
arXiv Detail & Related papers (2025-10-13T11:15:56Z)
Where, What, Why: Towards Explainable Driver Attention Prediction [28.677786362573638]
We introduce Explainable Driver Attention Prediction, a novel task paradigm that jointly predicts spatial attention regions (where), parses attended semantics (what), and provides cognitive reasoning for attention allocation (why)<n>We propose LLada, a Large Language model-driven framework for driver attention prediction, which unifies pixel modeling, semantic parsing, and cognitive reasoning within an end-to-end architecture.<n>This work serves as a key step toward a deeper understanding of driver attention mechanisms, with significant implications for autonomous driving, intelligent driver training, and human-computer interaction.
arXiv Detail & Related papers (2025-06-29T04:59:39Z)
Enhancing End-to-End Autonomous Driving with Latent World Model [78.22157677787239]
We propose a novel self-supervised learning approach using the LAtent World model (LAW) for end-to-end driving.<n> LAW predicts future scene features based on current features and ego trajectories.<n>This self-supervised task can be seamlessly integrated into perception-free and perception-based frameworks.
arXiv Detail & Related papers (2024-06-12T17:59:21Z)
SCOUT+: Towards Practical Task-Driven Drivers' Gaze Prediction [12.246649738388388]
SCOUT+ is a task- and context-aware model for drivers' gaze prediction. We evaluate our model on two datasets, DR(eye)VE and BDD-A.
arXiv Detail & Related papers (2024-04-12T18:29:10Z)
Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary. We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z)
Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset. We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z)
Scaling Vision-based End-to-End Driving with Multi-View Attention Learning [7.14967754486195]
We present CIL++, which improves on CILRS by both processing higher-resolution images using a human-inspired HFOV as an inductive bias and incorporating a proper attention mechanism. We propose to replace CILRS with CIL++ as a strong vision-based pure end-to-end driving baseline supervised by only vehicle signals and trained by conditional imitation learning.
arXiv Detail & Related papers (2023-02-07T02:14:45Z)
Online Multiple Object Tracking with Cross-Task Synergy [120.70085565030628]
We propose a novel unified model with synergy between position prediction and embedding association. The two tasks are linked by temporal-aware target attention and distractor attention, as well as identity-aware memory aggregation model.
arXiv Detail & Related papers (2021-04-01T10:19:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.