Related papers: VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments

URL: http://arxiv.org/abs/2508.05852v1
Date: Thu, 07 Aug 2025 21:01:43 GMT
Title: VISTA: Vision-Language Imitation of Situational Thinking and Attention for Human-Like Driver Focus in Dynamic Environments
Authors: Kaiser Hamid, Khandakar Ashrafi Akbar, Nade Liang,
Abstract summary: We propose a vision-language framework that models the changing landscape of drivers' gaze through natural language.<n>Our approach integrates both low-level cues and top-down context, enabling language-based descriptions of gaze behavior.<n>Results show that our fine-tuned model outperforms general-purpose VLMs in attention shift detection and interpretability.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Driver visual attention prediction is a critical task in autonomous driving and human-computer interaction (HCI) research. Most prior studies focus on estimating attention allocation at a single moment in time, typically using static RGB images such as driving scene pictures. In this work, we propose a vision-language framework that models the changing landscape of drivers' gaze through natural language, using few-shot and zero-shot learning on single RGB images. We curate and refine high-quality captions from the BDD-A dataset using human-in-the-loop feedback, then fine-tune LLaVA to align visual perception with attention-centric scene understanding. Our approach integrates both low-level cues and top-down context (e.g., route semantics, risk anticipation), enabling language-based descriptions of gaze behavior. We evaluate performance across training regimes (few shot, and one-shot) and introduce domain-specific metrics for semantic alignment and response diversity. Results show that our fine-tuned model outperforms general-purpose VLMs in attention shift detection and interpretability. To our knowledge, this is among the first attempts to generate driver visual attention allocation and shifting predictions in natural language, offering a new direction for explainable AI in autonomous driving. Our approach provides a foundation for downstream tasks such as behavior forecasting, human-AI teaming, and multi-agent coordination.

Related papers

Vision and language: Novel Representations and Artificial intelligence for Driving Scene Safety Assessment and Autonomous Vehicle Planning [2.1379801460200416]
Vision-language models (VLMs) have emerged as powerful representation learning systems that align visual observations with natural language concepts.<n>This paper investigates how vision-language representations support driving scene safety assessment and decision-making when integrated into perception, prediction, and planning pipelines.
arXiv Detail & Related papers (2026-02-07T20:04:21Z)
InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving [3.8737986316149775]
We propose a novel end-to-end autonomous driving method called InsightDrive.<n>It organizes perception by language-guided scene representation.<n>In experiments, InsightDrive achieves state-of-the-art performance in end-to-end autonomous driving.
arXiv Detail & Related papers (2025-03-17T10:52:32Z)
OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction [95.6266030753644]
Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions.<n>Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies.<n>We propose OTTER, a novel VLA architecture that leverages existing alignments through explicit, text-aware visual feature extraction.
arXiv Detail & Related papers (2025-03-05T18:44:48Z)
doScenes: An Autonomous Driving Dataset with Natural Language Instruction for Human Interaction and Vision-Language Navigation [0.0]
doScenes is a novel dataset designed to facilitate research on human-vehicle instruction interactions.<n>DoScenes bridges the gap between instruction and driving response, enabling context-aware and adaptive planning.
arXiv Detail & Related papers (2024-12-08T11:16:47Z)
Driver Activity Classification Using Generalizable Representations from Vision-Language Models [0.0]
We present a novel approach leveraging generalizable representations from vision-language models for driver activity classification. Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems.
arXiv Detail & Related papers (2024-04-23T10:42:24Z)
LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN) Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z)
CAVL: Learning Contrastive and Adaptive Representations of Vision and Language [10.57079240576682]
Visual and linguistic pre-training aims to learn vision and language representations together. Current pre-trained models tend to take lots of computation resources for fine-tuning when transferred to downstream tasks. We present a simple but effective approach for learning Contrastive and Adaptive representations of Vision and Language, namely CAVL.
arXiv Detail & Related papers (2023-04-10T05:54:03Z)
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision [13.268399018823903]
We propose a novel approach via Vision-Language semantic self-supervision for context-aware Pedestrian Detection. First, we propose a self-supervised Vision-Language Semantic (VLS) segmentation method, which learns both fully-supervised pedestrian detection and contextual segmentation. Second, a self-supervised Prototypical Semantic Contrastive (PSC) learning method is proposed to better discriminate pedestrians and other classes.
arXiv Detail & Related papers (2023-04-06T15:16:29Z)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models. Our empirical observations suggest that vision-and-language models are better at label prediction tasks. We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
Connecting Language and Vision for Natural Language-Based Vehicle Retrieval [77.88818029640977]
In this paper, we apply one new modality, i.e., the language description, to search the vehicle of interest. To connect language and vision, we propose to jointly train the state-of-the-art vision models with the transformer-based language model. Our proposed method has achieved the 1st place on the 5th AI City Challenge, yielding competitive performance 18.69% MRR accuracy.
arXiv Detail & Related papers (2021-05-31T11:42:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.