Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision
- URL: http://arxiv.org/abs/2507.23331v1
- Date: Thu, 31 Jul 2025 08:23:30 GMT
- Title: Contrastive Learning-Driven Traffic Sign Perception: Multi-Modal Fusion of Text and Vision
- Authors: Qiang Lu, Waikit Xiu, Xiying Li, Shenyu Hu, Shengbo Sun,
- Abstract summary: We propose a novel framework combining open-vocabulary detection and cross-modal learning.<n>For traffic sign detection, our NanoVerse YOLO model integrates a vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module.<n>For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL)<n>On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition.
- Score: 2.0720154517628417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traffic sign recognition, as a core component of autonomous driving perception systems, directly influences vehicle environmental awareness and driving safety. Current technologies face two significant challenges: first, the traffic sign dataset exhibits a pronounced long-tail distribution, resulting in a substantial decline in recognition performance of traditional convolutional networks when processing low-frequency and out-of-distribution classes; second, traffic signs in real-world scenarios are predominantly small targets with significant scale variations, making it difficult to extract multi-scale features.To overcome these issues, we propose a novel two-stage framework combining open-vocabulary detection and cross-modal learning. For traffic sign detection, our NanoVerse YOLO model integrates a reparameterizable vision-language path aggregation network (RepVL-PAN) and an SPD-Conv module to specifically enhance feature extraction for small, multi-scale targets. For traffic sign classification, we designed a Traffic Sign Recognition Multimodal Contrastive Learning model (TSR-MCL). By contrasting visual features from a Vision Transformer with semantic features from a rule-based BERT, TSR-MCL learns robust, frequency-independent representations, effectively mitigating class confusion caused by data imbalance. On the TT100K dataset, our method achieves a state-of-the-art 78.4% mAP in the long-tail detection task for all-class recognition. The model also obtains 91.8% accuracy and 88.9% recall, significantly outperforming mainstream algorithms and demonstrating superior accuracy and generalization in complex, open-world scenarios.
Related papers
- Lane-Wise Highway Anomaly Detection [8.086502588472783]
This paper proposes a scalable and interpretable framework for lane-wise highway traffic anomaly detection.<n>Unlike traditional sensor-dependent methods, our approach uses AI-powered vision models to extract lane-specific features.<n>Our framework outperforms state-of-the-art methods in precision, recall, and F1-score.
arXiv Detail & Related papers (2025-05-05T12:32:23Z) - Visual Dominance and Emerging Multimodal Approaches in Distracted Driving Detection: A Review of Machine Learning Techniques [3.378738346115004]
Distracted driving continues to be a significant cause of road traffic injuries and fatalities worldwide.<n>Recent developments in machine learning (ML) and deep learning (DL) have primarily focused on visual data to detect distraction.<n>This systematic review assesses 74 studies that utilize ML/DL techniques for distracted driving detection across visual, sensor-based, multimodal, and emerging modalities.
arXiv Detail & Related papers (2025-05-04T02:51:00Z) - Cross-domain Multi-step Thinking: Zero-shot Fine-grained Traffic Sign Recognition in the Wild [49.20086587208214]
Cross-domain Multi-step Thinking (CdMT) proposed to improve zero-shot fine-grained traffic sign recognition (TSR) performance in the wild.<n>CdMT is independent of training data and requires only simple and uniform instructions, enabling it to achieve cross-country TSR.<n> framework achieved superior performance compared with other state-of-the-art methods on five datasets.
arXiv Detail & Related papers (2024-09-03T02:08:47Z) - Semantic Communication for Cooperative Perception using HARQ [51.148203799109304]
We leverage an importance map to distill critical semantic information, introducing a cooperative perception semantic communication framework.
To counter the challenges posed by time-varying multipath fading, our approach incorporates the use of frequency-division multiplexing (OFDM) along with channel estimation and equalization strategies.
We introduce a novel semantic error detection method that is integrated with our semantic communication framework in the spirit of hybrid automatic repeated request (HARQ)
arXiv Detail & Related papers (2024-08-29T08:53:26Z) - Cross-domain Few-shot In-context Learning for Enhancing Traffic Sign Recognition [49.20086587208214]
We propose a cross-domain few-shot in-context learning method based on the MLLM for enhancing traffic sign recognition.
By using description texts, our method reduces the cross-domain differences between template and real traffic signs.
Our approach requires only simple and uniform textual indications, without the need for large-scale traffic sign images and labels.
arXiv Detail & Related papers (2024-07-08T10:51:03Z) - Revolutionizing Traffic Sign Recognition: Unveiling the Potential of Vision Transformers [0.0]
Traffic Sign Recognition (TSR) holds a vital role in advancing driver assistance systems and autonomous vehicles.
This study explores three variants of Vision Transformers (PVT, TNT, LNL) and six convolutional neural networks (AlexNet, ResNet, VGG16, MobileNet, EfficientNet, GoogleNet) as baseline models.
To address the shortcomings of traditional methods, a novel pyramid EATFormer backbone is proposed, amalgamating Evolutionary Algorithms (EAs) with the Transformer architecture.
arXiv Detail & Related papers (2024-04-29T19:18:52Z) - Traffic Sign Recognition Using Local Vision Transformer [1.8416014644193066]
This paper proposes a new novel model that blends the advantages of both convolutional and transformer-based networks for traffic sign recognition.
The proposed model includes convolutional blocks for capturing local correlations and transformer-based blocks for learning global dependencies.
The experimental evaluations demonstrate that the hybrid network with the locality module outperforms pure transformer-based models and some of the best convolutional networks in accuracy.
arXiv Detail & Related papers (2023-11-11T19:42:41Z) - Robust Semi-supervised Federated Learning for Images Automatic
Recognition in Internet of Drones [57.468730437381076]
We present a Semi-supervised Federated Learning (SSFL) framework for privacy-preserving UAV image recognition.
There are significant differences in the number, features, and distribution of local data collected by UAVs using different camera modules.
We propose an aggregation rule based on the frequency of the client's participation in training, namely the FedFreq aggregation rule.
arXiv Detail & Related papers (2022-01-03T16:49:33Z) - Aerial Images Meet Crowdsourced Trajectories: A New Approach to Robust
Road Extraction [110.61383502442598]
We introduce a novel neural network framework termed Cross-Modal Message Propagation Network (CMMPNet)
CMMPNet is composed of two deep Auto-Encoders for modality-specific representation learning and a tailor-designed Dual Enhancement Module for cross-modal representation refinement.
Experiments on three real-world benchmarks demonstrate the effectiveness of our CMMPNet for robust road extraction.
arXiv Detail & Related papers (2021-11-30T04:30:10Z) - End-to-End Intersection Handling using Multi-Agent Deep Reinforcement
Learning [63.56464608571663]
Navigating through intersections is one of the main challenging tasks for an autonomous vehicle.
In this work, we focus on the implementation of a system able to navigate through intersections where only traffic signs are provided.
We propose a multi-agent system using a continuous, model-free Deep Reinforcement Learning algorithm used to train a neural network for predicting both the acceleration and the steering angle at each time step.
arXiv Detail & Related papers (2021-04-28T07:54:40Z) - A Driving Behavior Recognition Model with Bi-LSTM and Multi-Scale CNN [59.57221522897815]
We propose a neural network model based on trajectories information for driving behavior recognition.
We evaluate the proposed model on the public BLVD dataset, achieving a satisfying performance.
arXiv Detail & Related papers (2021-03-01T06:47:29Z) - Traffic Signs Detection and Recognition System using Deep Learning [0.0]
This paper describes an approach for efficiently detecting and recognizing traffic signs in real-time.
We tackle the traffic sign detection problem using the state-of-the-art of multi-object detection systems.
The focus of this paper is going to be F-RCNN Inception v2 and Tiny YOLO v2 as they achieved the best results.
arXiv Detail & Related papers (2020-03-06T14:54:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.