Vision-Based Natural Language Scene Understanding for Autonomous Driving: An Extended Dataset and a New Model for Traffic Scene Description Generation
- URL: http://arxiv.org/abs/2601.14438v1
- Date: Tue, 20 Jan 2026 19:50:42 GMT
- Title: Vision-Based Natural Language Scene Understanding for Autonomous Driving: An Extended Dataset and a New Model for Traffic Scene Description Generation
- Authors: Danial Sadrian Zadeh, Otman A. Basir, Behzad Moshiri,
- Abstract summary: This paper presents a framework that transforms a single frontal-view camera image into a concise natural language description.<n>It integrates spatial and semantic feature extraction to generate contextually rich and detailed scene descriptions.<n>The proposed model achieves strong performance and effectively fulfills its intended objectives on a newly developed dataset.
- Score: 1.3682156035049033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traffic scene understanding is essential for enabling autonomous vehicles to accurately perceive and interpret their environment, thereby ensuring safe navigation. This paper presents a novel framework that transforms a single frontal-view camera image into a concise natural language description, effectively capturing spatial layouts, semantic relationships, and driving-relevant cues. The proposed model leverages a hybrid attention mechanism to enhance spatial and semantic feature extraction and integrates these features to generate contextually rich and detailed scene descriptions. To address the limited availability of specialized datasets in this domain, a new dataset derived from the BDD100K dataset has been developed, with comprehensive guidelines provided for its construction. Furthermore, the study offers an in-depth discussion of relevant evaluation metrics, identifying the most appropriate measures for this task. Extensive quantitative evaluations using metrics such as CIDEr and SPICE, complemented by human judgment assessments, demonstrate that the proposed model achieves strong performance and effectively fulfills its intended objectives on the newly developed dataset.
Related papers
- ObjectVisA-120: Object-based Visual Attention Prediction in Interactive Street-crossing Environments [15.487686125490812]
We present a novel dataset of spatial street-crossing navigation in virtual reality specifically geared to object-based attention evaluations.<n>The uniqueness of the presented dataset lies in the ethical and safety affiliated challenges that make collecting comparable data in real-world environments highly difficult.<n>We propose object-based similarity (oSIM) as a novel metric to evaluate the performance of object-based visual attention models.
arXiv Detail & Related papers (2026-01-19T16:48:45Z) - Scaling Up Occupancy-centric Driving Scene Generation: Dataset and Method [54.461213497603154]
Occupancy-centric methods have recently achieved state-of-the-art results by offering consistent conditioning across frames and modalities.<n>Nuplan-Occ is the largest occupancy dataset to date, constructed from the widely used Nuplan benchmark.<n>We develop a unified framework that jointly synthesizes high-quality occupancy, multi-view videos, and LiDAR point clouds.
arXiv Detail & Related papers (2025-10-27T03:52:45Z) - Point Cloud Based Scene Segmentation: A Survey [3.0846824529023387]
We provide an overview of the current state-of-the-art methods in the field of Point Cloud Semantics for autonomous driving.<n>We categorize the approaches into projection-based, 3D-based and hybrid methods.<n>We also emphasize the importance of synthetic data to support research when real-world data is limited.
arXiv Detail & Related papers (2025-03-16T18:02:41Z) - Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - ANNA: A Deep Learning Based Dataset in Heterogeneous Traffic for
Autonomous Vehicles [2.932123507260722]
This study discusses a custom-built dataset that includes some unidentified vehicles in the perspective of Bangladesh.
A dataset validity check was performed by evaluating models using the Intersection Over Union (IOU) metric.
The results demonstrated that the model trained on our custom dataset was more precise and efficient than the models trained on the KITTI or COCO dataset concerning Bangladeshi traffic.
arXiv Detail & Related papers (2024-01-21T01:14:04Z) - Less is More: Toward Zero-Shot Local Scene Graph Generation via
Foundation Models [16.08214739525615]
We present a new task called Local Scene Graph Generation.
It aims to abstract pertinent structural information with partial objects and their relationships in an image.
We introduce zEro-shot Local scEne GrAph geNeraTion (ELEGANT), a framework harnessing foundation models renowned for their powerful perception and commonsense reasoning.
arXiv Detail & Related papers (2023-10-02T17:19:04Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - Information-Theoretic Odometry Learning [83.36195426897768]
We propose a unified information theoretic framework for learning-motivated methods aimed at odometry estimation.
The proposed framework provides an elegant tool for performance evaluation and understanding in information-theoretic language.
arXiv Detail & Related papers (2022-03-11T02:37:35Z) - Generative Counterfactuals for Neural Networks via Attribute-Informed
Perturbation [51.29486247405601]
We design a framework to generate counterfactuals for raw data instances with the proposed Attribute-Informed Perturbation (AIP)
By utilizing generative models conditioned with different attributes, counterfactuals with desired labels can be obtained effectively and efficiently.
Experimental results on real-world texts and images demonstrate the effectiveness, sample quality as well as efficiency of our designed framework.
arXiv Detail & Related papers (2021-01-18T08:37:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.