AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception
- URL: http://arxiv.org/abs/2510.27047v1
- Date: Thu, 30 Oct 2025 23:30:33 GMT
- Title: AD-SAM: Fine-Tuning the Segment Anything Vision Foundation Model for Autonomous Driving Perception
- Authors: Mario Camarena, Het Patel, Fatemeh Nazari, Evangelos Papalexakis, Mohamadhossein Noruzoliaee, Jia Chen,
- Abstract summary: The Autonomous Driving Segment Anything Model (AD-SAM) is a fine-tuned vision foundation model for semantic segmentation in autonomous driving (AD)<n>AD-SAM extends the Segment Anything Model (SAM) with a dual-encoder and deformable decoder tailored to spatial and geometric complexity of road scenes.<n> Experiments show that AD-SAM surpasses SAM, Generalized SAM (G-SAM), and a deep learning baseline (DeepLabV3) in segmentation accuracy.
- Score: 3.298091299319354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents the Autonomous Driving Segment Anything Model (AD-SAM), a fine-tuned vision foundation model for semantic segmentation in autonomous driving (AD). AD-SAM extends the Segment Anything Model (SAM) with a dual-encoder and deformable decoder tailored to spatial and geometric complexity of road scenes. The dual-encoder produces multi-scale fused representations by combining global semantic context from SAM's pretrained Vision Transformer (ViT-H) with local spatial detail from a trainable convolutional deep learning backbone (i.e., ResNet-50). A deformable fusion module aligns heterogeneous features across scales and object geometries. The decoder performs progressive multi-stage refinement using deformable attention. Training is guided by a hybrid loss that integrates Focal, Dice, Lovasz-Softmax, and Surface losses, improving semantic class balance, boundary precision, and optimization stability. Experiments on the Cityscapes and Berkeley DeepDrive 100K (BDD100K) benchmarks show that AD-SAM surpasses SAM, Generalized SAM (G-SAM), and a deep learning baseline (DeepLabV3) in segmentation accuracy. It achieves 68.1 mean Intersection over Union (mIoU) on Cityscapes and 59.5 mIoU on BDD100K, outperforming SAM, G-SAM, and DeepLabV3 by margins of up to +22.9 and +19.2 mIoU in structured and diverse road scenes, respectively. AD-SAM demonstrates strong cross-domain generalization with a 0.87 retention score (vs. 0.76 for SAM), and faster, more stable learning dynamics, converging within 30-40 epochs, enjoying double the learning speed of benchmark models. It maintains 0.607 mIoU with only 1000 samples, suggesting data efficiency critical for reducing annotation costs. These results confirm that targeted architectural and optimization enhancements to foundation models enable reliable and scalable AD perception.
Related papers
- Generalization vs. Specialization: Evaluating Segment Anything Model (SAM3) Zero-Shot Segmentation Against Fine-Tuned YOLO Detectors [3.5648679864643573]
This work presents a comparison between SAM3 (Segment Anything Model, also called SAMv3) operating in zero-shot mode and three variants of Ultralytics YOLO11 fine-tuned for instance segmentation.<n>YOLO exhibits steep degradation 48-50 points across IoU ranges whereas SAM3 drops only 4 points, revealing 12 times superior boundary stability of SAM3.
arXiv Detail & Related papers (2025-12-09T01:54:04Z) - Continual Alignment for SAM: Rethinking Foundation Models for Medical Image Segmentation in Continual Learning [42.22450978339022]
In medical image segmentation, heterogeneous privacy policies across institutions often make joint training on pooled datasets infeasible.<n>While the Segment Anything Model (SAM) offers strong zero-shot priors, its large parameter count and computational overhead challenge practical deployment.<n>This paper demonstrates that the SAM paradigm is highly promising once its computational efficiency and performance can be balanced.
arXiv Detail & Related papers (2025-11-21T12:27:49Z) - VesSAM: Efficient Multi-Prompting for Segmenting Complex Vessel [68.24765319399286]
We present VesSAM, a powerful and efficient framework tailored for 2D vessel segmentation.<n>VesSAM integrates (1) a convolutional adapter to enhance local texture features, (2) a multi-prompt encoder that fuses anatomical prompts, and (3) a lightweight mask decoder to reduce jagged artifacts.<n>VesSAM consistently outperforms state-of-the-art PEFT-based SAM variants by over 10% Dice and 13% IoU.
arXiv Detail & Related papers (2025-11-02T15:47:05Z) - Adapting Segment Anything Model for Unseen Object Instance Segmentation [70.60171342436092]
Unseen Object Instance (UOIS) is crucial for autonomous robots operating in unstructured environments.
We propose UOIS-SAM, a data-efficient solution for the UOIS task.
UOIS-SAM integrates two key components: (i) a Heatmap-based Prompt Generator (HPG) to generate class-agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM's mask decoder.
arXiv Detail & Related papers (2024-09-23T19:05:50Z) - PAM: A Propagation-Based Model for Segmenting Any 3D Objects across Multi-Modal Medical Images [11.373941923130305]
PAM (Propagating Anything Model) is a segmentation approach that uses a 2D prompt, like a bounding box or sketch, to create a complete 3D segmentation of medical image volumes.
It significantly outperformed existing models like MedSAM and SegVol, with an average improvement of over 18.1% in dice similarity coefficient (DSC) across 44 medical datasets and various object types.
arXiv Detail & Related papers (2024-08-25T13:42:47Z) - From SAM to SAM 2: Exploring Improvements in Meta's Segment Anything Model [0.5639904484784127]
The Segment Anything Model (SAM) was introduced to the computer vision community by Meta in April 2023.
SAM excels in zero-shot performance, segmenting unseen objects without additional training, stimulated by a large dataset of over one billion image masks.
SAM 2 expands this functionality to video, leveraging memory from preceding and subsequent frames to generate accurate segmentation across entire videos.
arXiv Detail & Related papers (2024-08-12T17:17:35Z) - HRSAM: Efficient Interactive Segmentation in High-Resolution Images [59.537068118473066]
Segment Anything Model (SAM) has advanced interactive segmentation but is limited by the high computational cost on high-resolution images.
We focus on visual length extrapolation and propose a lightweight model named HRSAM.
The extrapolation enables HRSAM trained on low resolutions to generalize to high resolutions.
arXiv Detail & Related papers (2024-07-02T09:51:56Z) - Moving Object Segmentation: All You Need Is SAM (and Flow) [82.78026782967959]
We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects.
In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt.
These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks.
arXiv Detail & Related papers (2024-04-18T17:59:53Z) - TinySAM: Pushing the Envelope for Efficient Segment Anything Model [73.06322749886483]
We propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance.<n>With all these proposed methods, our TinySAM leads to orders of magnitude computational reduction and pushes the envelope for efficient segment anything task.
arXiv Detail & Related papers (2023-12-21T12:26:11Z) - Stable Segment Anything Model [79.9005670886038]
The Segment Anything Model (SAM) achieves remarkable promptable segmentation given high-quality prompts.
This paper presents the first comprehensive analysis on SAM's segmentation stability across a diverse spectrum of prompt qualities.
Our solution, termed Stable-SAM, offers several advantages: 1) improved SAM's segmentation stability across a wide range of prompt qualities, while 2) retaining SAM's powerful promptable segmentation efficiency and generality.
arXiv Detail & Related papers (2023-11-27T12:51:42Z) - SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding [40.40630116715132]
The landscape of publicly available vision foundation models (VFMs) is expanding rapidly.
We introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise.
By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer.
arXiv Detail & Related papers (2023-10-23T19:21:57Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.