Related papers: Mapillary Vistas Validation for Fine-Grained Traffic Signs: A Benchmark Revealing Vision-Language Model Limitations

Mapillary Vistas Validation for Fine-Grained Traffic Signs: A Benchmark Revealing Vision-Language Model Limitations

URL: http://arxiv.org/abs/2508.02047v1
Date: Mon, 04 Aug 2025 04:29:06 GMT
Title: Mapillary Vistas Validation for Fine-Grained Traffic Signs: A Benchmark Revealing Vision-Language Model Limitations
Authors: Sparsh Garg, Abhishek Aich,
Abstract summary: We present a new validation set for traffic signs derived from the Mapillary dataset called Mapillary Vistas Validation for Traffic Signs (MVV)<n>The dataset includes pixel-level instance masks and has been manually annotated by expert annotators to ensure label fidelity.<n>We benchmark several state-of-the-art VLMs against the self-supervised DINOv2 model on this dataset and show that DINOv2 consistently outperforms all VLM baselines.
Score: 5.159407277301709
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Obtaining high-quality fine-grained annotations for traffic signs is critical for accurate and safe decision-making in autonomous driving. Widely used datasets, such as Mapillary, often provide only coarse-grained labels - without distinguishing semantically important types such as stop signs or speed limit signs. To this end, we present a new validation set for traffic signs derived from the Mapillary dataset called Mapillary Vistas Validation for Traffic Signs (MVV), where we decompose composite traffic signs into granular, semantically meaningful categories. The dataset includes pixel-level instance masks and has been manually annotated by expert annotators to ensure label fidelity. Further, we benchmark several state-of-the-art VLMs against the self-supervised DINOv2 model on this dataset and show that DINOv2 consistently outperforms all VLM baselines-not only on traffic sign recognition, but also on heavily represented categories like vehicles and humans. Our analysis reveals significant limitations in current vision-language models for fine-grained visual understanding and establishes DINOv2 as a strong baseline for dense semantic matching in autonomous driving scenarios. This dataset and evaluation framework pave the way for more reliable, interpretable, and scalable perception systems. Code and data are available at: https://github.com/nec-labs-ma/relabeling

Related papers

SafeAuto: Knowledge-Enhanced Safe Autonomous Driving with Multimodal Foundation Models [63.71984266104757]
We propose SafeAuto, a framework that enhances MLLM-based autonomous driving by incorporating both unstructured and structured knowledge.<n>To explicitly integrate safety knowledge, we develop a reasoning component that translates traffic rules into first-order logic.<n>Our Multimodal Retrieval-Augmented Generation model leverages video, control signals, and environmental attributes to learn from past driving experiences.
arXiv Detail & Related papers (2025-02-28T21:53:47Z)
Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data. We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation. Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z)
Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition [49.20086587208214]
We propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition. By using description texts, our method reduces the cross-domain differences between template and real traffic signs. Our approach requires only simple and uniform textual indications, without the need for large-scale traffic sign images and labels.
arXiv Detail & Related papers (2024-07-08T10:51:03Z)
MUSES: The Multi-Sensor Semantic Perception Dataset for Driving under Uncertainty [46.369657697892634]
We introduce MUSES, the MUlti-SEnsor Semantic perception dataset for driving in adverse conditions under increased uncertainty. The dataset integrates a frame camera, a lidar, a radar, an event camera, and an IMU/GNSS sensor. MUSES proves both effective for training and challenging for evaluating models under diverse visual conditions.
arXiv Detail & Related papers (2024-01-23T13:43:17Z)
Fusing Pseudo Labels with Weak Supervision for Dynamic Traffic Scenarios [0.0]
We introduce a weakly-supervised label unification pipeline that amalgamates pseudo labels from object detection models trained on heterogeneous datasets. Our pipeline engenders a unified label space through the amalgamation of labels from disparate datasets, rectifying bias and enhancing generalization. We retrain a solitary object detection model using the merged label space, culminating in a resilient model proficient in dynamic traffic scenarios.
arXiv Detail & Related papers (2023-08-30T11:33:07Z)
OpenLane-V2: A Topology Reasoning Benchmark for Unified 3D HD Mapping [84.65114565766596]
We present OpenLane-V2, the first dataset on topology reasoning for traffic scene structure. OpenLane-V2 consists of 2,000 annotated road scenes that describe traffic elements and their correlation to the lanes. We evaluate various state-of-the-art methods, and present their quantitative and qualitative results on OpenLane-V2 to indicate future avenues for investigating topology reasoning in traffic scenes.
arXiv Detail & Related papers (2023-04-20T16:31:22Z)
Traffic Scene Parsing through the TSP6K Dataset [109.69836680564616]
We introduce a specialized traffic monitoring dataset, termed TSP6K, with high-quality pixel-level and instance-level annotations. The dataset captures more crowded traffic scenes with several times more traffic participants than the existing driving scenes. We propose a detail refining decoder for scene parsing, which recovers the details of different semantic regions in traffic scenes.
arXiv Detail & Related papers (2023-03-06T02:05:14Z)
ConMAE: Contour Guided MAE for Unsupervised Vehicle Re-Identification [8.950873153831735]
This work designs a Contour Guided Masked Autoencoder for Unsupervised Vehicle Re-Identification (ConMAE) Considering that Masked Autoencoder (MAE) has shown excellent performance in self-supervised learning, this work designs a Contour Guided Masked Autoencoder for Unsupervised Vehicle Re-Identification (ConMAE)
arXiv Detail & Related papers (2023-02-11T12:10:25Z)
TrafficCAM: A Versatile Dataset for Traffic Flow Segmentation [9.744937939618161]
Existing traffic flow datasets have two major limitations. They feature a limited number of classes, usually limited to one type of vehicle, and the scarcity of unlabelled data. We introduce a new benchmark traffic flow image dataset called TrafficCAM.
arXiv Detail & Related papers (2022-11-17T16:14:38Z)
Pluggable Weakly-Supervised Cross-View Learning for Accurate Vehicle Re-Identification [53.6218051770131]
Cross-view consistent feature representation is key for accurate vehicle ReID. Existing approaches resort to supervised cross-view learning using extensive extra viewpoints annotations. We present a pluggable Weakly-supervised Cross-View Learning (WCVL) module for vehicle ReID.
arXiv Detail & Related papers (2021-03-09T11:51:09Z)
Towards Accurate Vehicle Behaviour Classification With Multi-Relational Graph Convolutional Networks [22.022759283770377]
We propose a pipeline for understanding vehicle behaviour from a monocular image sequence or video. A temporal sequence of such encodings is fed to a recurrent network to label vehicle behaviours. The proposed framework can classify a variety of vehicle behaviours to high fidelity on datasets that are diverse.
arXiv Detail & Related papers (2020-02-03T14:34:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.