CARScenes: Semantic VLM Dataset for Safe Autonomous Driving
- URL: http://arxiv.org/abs/2511.10701v2
- Date: Tue, 18 Nov 2025 15:20:04 GMT
- Title: CARScenes: Semantic VLM Dataset for Safe Autonomous Driving
- Authors: Yuankai He, Weisong Shi,
- Abstract summary: CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models.<n>We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes.
- Score: 3.9876810376226057
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable, scene-level understanding. We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes using a 28-key category/sub-category knowledge base covering environment, road geometry, background-vehicle behavior, ego-vehicle behavior, vulnerable road users, sensor states, and a discrete severity scale (1-10), totaling 350+ leaf attributes. Labels are produced by a GPT-4o-assisted vision-language pipeline with human-in-the-loop verification; we release the exact prompts, post-processing rules, and per-field baseline model performance. CAR-Scenes also provides attribute co-occurrence graphs and JSONL records that support semantic retrieval, dataset triage, and risk-aware scenario mining across sources. To calibrate task difficulty, we include reproducible, non-benchmark baselines, notably a LoRA-tuned Qwen2-VL-2B with deterministic decoding, evaluated via scalar accuracy, micro-averaged F1 for list attributes, and severity MAE/RMSE on a fixed validation split. We publicly release the annotation and analysis scripts, including graph construction and evaluation scripts, to enable explainable, data-centric workflows for future intelligent vehicles. Dataset: https://github.com/Croquembouche/CAR-Scenes
Related papers
- Segment Any Vehicle: Semantic and Visual Context Driven SAM and A Benchmark [12.231630639022335]
We propose SAV, a novel framework comprising three core components: a SAM-based encoder-decoder, a vehicle part knowledge graph, and a context sample retrieval encoding module.<n>The knowledge graph explicitly models the spatial and geometric relationships among vehicle parts through a structured ontology, effectively encoding prior structural knowledge.<n>We introduce a new large-scale benchmark dataset for vehicle part segmentation, named VehicleSeg10K, which contains 11,665 high-quality pixel-level annotations.
arXiv Detail & Related papers (2025-08-06T09:46:49Z) - Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving [0.5249805590164902]
We propose a novel context-aware motion retrieval framework to support targeted evaluation of autonomous driving systems in diverse, human-centered scenarios.<n>Our approach outperforms state-of-the-art models by up to 27.5% accuracy in motion-context retrieval, when evaluated on the WayMoCo dataset.
arXiv Detail & Related papers (2025-08-01T12:41:52Z) - EMT: A Visual Multi-Task Benchmark Dataset for Autonomous Driving [8.97091577113286]
Emirates Multi-Task dataset is designed to support multi-task benchmarking within a unified framework.<n>It comprises over 30,000 frames from a dash-camera perspective and 570,000 annotated bounding boxes, covering approximately 150 kilometers of driving routes.
arXiv Detail & Related papers (2025-02-26T16:06:35Z) - GPD-1: Generative Pre-training for Driving [77.06803277735132]
We propose a unified Generative Pre-training for Driving (GPD-1) model to accomplish all these tasks.<n>We represent each scene with ego, agent, and map tokens and formulate autonomous driving as a unified token generation problem.<n>Our GPD-1 successfully generalizes to various tasks without finetuning, including scene generation, traffic simulation, closed-loop simulation, map prediction, and motion planning.
arXiv Detail & Related papers (2024-12-11T18:59:51Z) - DriveLM: Driving with Graph Visual Question Answering [57.51930417790141]
We study how vision-language models (VLMs) trained on web-scale data can be integrated into end-to-end driving systems.<n>We propose a VLM-based baseline approach (DriveLM-Agent) for jointly performing Graph VQA and end-to-end driving.
arXiv Detail & Related papers (2023-12-21T18:59:12Z) - Exploiting Contextual Target Attributes for Target Sentiment
Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task.
We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z) - Weakly Supervised Semantic Segmentation for Driving Scenes [27.0285166404621]
State-of-the-art techniques in weakly-supervised semantic segmentation (WSSS) exhibit severe performance degradation on driving scene datasets.
We develop a new WSSS framework tailored to driving scene datasets.
arXiv Detail & Related papers (2023-12-21T08:16:26Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - ST-KeyS: Self-Supervised Transformer for Keyword Spotting in Historical
Handwritten Documents [3.9688530261646653]
Keywords spotting (KWS) in historical documents is an important tool for the initial exploration of digitized collections.
We propose ST-KeyS, a masked auto-encoder model based on vision transformers where the pretraining stage is based on the mask-and-predict paradigm.
In the fine-tuning stage, the pre-trained encoder is integrated into a siamese neural network model that is fine-tuned to improve feature embedding from the input images.
arXiv Detail & Related papers (2023-03-06T13:39:41Z) - Towards Optimal Strategies for Training Self-Driving Perception Models
in Simulation [98.51313127382937]
We focus on the use of labels in the synthetic domain alone.
Our approach introduces both a way to learn neural-invariant representations and a theoretically inspired view on how to sample the data from the simulator.
We showcase our approach on the bird's-eye-view vehicle segmentation task with multi-sensor data.
arXiv Detail & Related papers (2021-11-15T18:37:43Z) - Detecting 32 Pedestrian Attributes for Autonomous Vehicles [103.87351701138554]
In this paper, we address the problem of jointly detecting pedestrians and recognizing 32 pedestrian attributes.
We introduce a Multi-Task Learning (MTL) model relying on a composite field framework, which achieves both goals in an efficient way.
We show competitive detection and attribute recognition results, as well as a more stable MTL training.
arXiv Detail & Related papers (2020-12-04T15:10:12Z) - Deep Representation Learning and Clustering of Traffic Scenarios [0.0]
We introduce two data driven autoencoding models that learn latent representation of traffic scenes.
We show how the latent scenario embeddings can be used for clustering traffic scenarios and similarity retrieval.
arXiv Detail & Related papers (2020-07-15T15:12:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.