Vision-Language Feature Alignment for Road Anomaly Segmentation
- URL: http://arxiv.org/abs/2603.01029v1
- Date: Sun, 01 Mar 2026 10:17:00 GMT
- Title: Vision-Language Feature Alignment for Road Anomaly Segmentation
- Authors: Zhuolin He, Jiacheng Tang, Jian Pu, Xiangyang Xue,
- Abstract summary: We propose a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs)<n>Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories.<n>At inference time, we introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence.
- Score: 38.2615882515309
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Safe autonomous systems in complex environments require robust road anomaly segmentation to identify unknown obstacles. However, existing approaches often rely on pixel-level statistics to determine whether a region appears anomalous. This reliance leads to high false-positive rates on semantically normal background regions such as sky or vegetation, and poor recall of true Out-of-distribution (OOD) instances, thereby posing safety risks for robotic perception and decision-making. To address these challenges, we propose VL-Anomaly, a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs). Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories, effectively suppressing spurious anomaly responses in background regions. At inference time, we further introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence, enabling more reliable anomaly prediction by leveraging complementary information sources. Extensive experiments demonstrate that VL-Anomaly achieves state-of-the-art performance on benchmark datasets including RoadAnomaly, SMIYC and Fishyscapes.Code is released on https://github.com/NickHezhuolin/VL-aligner-Road-anomaly-segment.
Related papers
- Semantically Aware UAV Landing Site Assessment from Remote Sensing Imagery via Multimodal Large Language Models [5.987458168544856]
Safe UAV emergency landing requires understanding complex semantic risks invisible to traditional geometric sensors.<n>We propose a novel framework leveraging Remote Sensing (RS) imagery and Multimodal Large Language Models (MLLMs) for context-aware landing site assessment.
arXiv Detail & Related papers (2026-02-01T11:30:03Z) - Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection [65.29550320117526]
We propose a novel framework named FineGrainedAD to improve anomaly localization performance.<n> Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings.
arXiv Detail & Related papers (2025-10-30T13:09:00Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - Segmenting Objectiveness and Task-awareness Unknown Region for Autonomous Driving [46.70405993442064]
We propose a novel framework termed Segmenting Objectiveness and Task-Awareness (SOTA) for autonomous driving scenes.<n>SOTA enhances the segmentation of objectiveness through a Semantic Fusion Block (SFB) and filters anomalies irrelevant to road navigation tasks.
arXiv Detail & Related papers (2025-04-27T10:08:54Z) - Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety [0.0]
We propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection.<n>We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations.<n>Our findings highlight the strengths and limitations of current vision-language-based approaches.
arXiv Detail & Related papers (2025-04-18T01:25:02Z) - Unbiased Region-Language Alignment for Open-Vocabulary Dense Prediction [80.67150791183126]
Pre-trained vision-language models (VLMs) have demonstrated impressive zero-shot recognition capability, but still underperform in dense prediction tasks.<n>We propose DenseVLM, a framework designed to learn unbiased region-language alignment from powerful pre-trained VLM representations.<n>We show that DenseVLM can directly replace the original VLM in open-vocabulary object detection and image segmentation methods.
arXiv Detail & Related papers (2024-12-09T06:34:23Z) - VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection [5.66050466694651]
We propose Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness.
We also propose a new scoring function that enables data- and training-free outlier supervision via textual prompts.
The resulting VL4AD model achieves competitive performance on widely used benchmark datasets.
arXiv Detail & Related papers (2024-09-25T20:12:10Z) - Applying Unsupervised Semantic Segmentation to High-Resolution UAV Imagery for Enhanced Road Scene Parsing [12.558144256470827]
A novel unsupervised road parsing framework is presented.
The proposed method achieves a mean Intersection over Union (mIoU) of 89.96% on the development dataset without any manual annotation.
arXiv Detail & Related papers (2024-02-05T13:16:12Z) - Semi-supervised Domain Adaptive Structure Learning [72.01544419893628]
Semi-supervised domain adaptation (SSDA) is a challenging problem requiring methods to overcome both 1) overfitting towards poorly annotated data and 2) distribution shift across domains.
We introduce an adaptive structure learning method to regularize the cooperation of SSL and DA.
arXiv Detail & Related papers (2021-12-12T06:11:16Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Inter-class Discrepancy Alignment for Face Recognition [55.578063356210144]
We propose a unified framework calledInter-class DiscrepancyAlignment(IDA)
IDA-DAO is used to align the similarity scores considering the discrepancy between the images and its neighbors.
IDA-SSE can provide convincing inter-class neighbors by introducing virtual candidate images generated with GAN.
arXiv Detail & Related papers (2021-03-02T08:20:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.