How to Benchmark Vision Foundation Models for Semantic Segmentation?
- URL: http://arxiv.org/abs/2404.12172v2
- Date: Mon, 10 Jun 2024 10:05:01 GMT
- Title: How to Benchmark Vision Foundation Models for Semantic Segmentation?
- Authors: Tommie Kerssies, Daan de Geus, Gijs Dubbelman,
- Abstract summary: This paper studies how vision foundation models (VFMs) should be benchmarked for semantic segmentation.
Various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed.
Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies.
- Score: 1.8570591025615457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent vision foundation models (VFMs) have demonstrated proficiency in various tasks but require supervised fine-tuning to perform the task of semantic segmentation effectively. Benchmarking their performance is essential for selecting current models and guiding future model developments for this task. The lack of a standardized benchmark complicates comparisons. Therefore, the primary objective of this paper is to study how VFMs should be benchmarked for semantic segmentation. To do so, various VFMs are fine-tuned under various settings, and the impact of individual settings on the performance ranking and training time is assessed. Based on the results, the recommendation is to fine-tune the ViT-B variants of VFMs with a 16x16 patch size and a linear decoder, as these settings are representative of using a larger model, more advanced decoder and smaller patch size, while reducing training time by more than 13 times. Using multiple datasets for training and evaluation is also recommended, as the performance ranking across datasets and domain shifts varies. Linear probing, a common practice for some VFMs, is not recommended, as it is not representative of end-to-end fine-tuning. The benchmarking setup recommended in this paper enables a performance analysis of VFMs for semantic segmentation. The findings of such an analysis reveal that pretraining with promptable segmentation is not beneficial, whereas masked image modeling (MIM) with abstract representations is crucial, even more important than the type of supervision used. The code for efficiently fine-tuning VFMs for semantic segmentation can be accessed through the project page at: https://tue-mps.github.io/benchmark-vfm-ss/.
Related papers
- Rein++: Efficient Generalization and Adaptation for Semantic Segmentation with Vision Foundation Models [47.66611300605174]
Rein++ is an efficient VFM-based segmentation framework.<n>It demonstrates superior generalization from limited data.<n>It enables effective adaptation to diverse unlabeled scenarios.
arXiv Detail & Related papers (2025-08-03T08:53:30Z) - Benchmarking Feature Upsampling Methods for Vision Foundation Models using Interactive Segmentation [24.531539125814877]
Vision Foundation Models (VFMs) are large-scale, pre-trained models that serve as general-purpose backbones for various computer vision tasks.<n>One way to tackle this limitation is by employing a task-agnostic feature upsampling module that refines VFM features resolution.<n>Our benchmarking experiments show that selecting appropriate upsampling strategies significantly improves VFM features quality.
arXiv Detail & Related papers (2025-05-04T11:59:26Z) - Prior2Former -- Evidential Modeling of Mask Transformers for Assumption-Free Open-World Panoptic Segmentation [74.55677741919035]
We propose Prior2Former (P2F) as the first approach for segmentation vision transformers rooted in evidential learning.
P2F extends the mask vision transformer architecture by incorporating a Beta prior for computing model uncertainty in pixel-wise binary mask assignments.
It achieves the highest ranking in the OoDIS anomaly instance benchmark among methods not using OOD data in any way.
arXiv Detail & Related papers (2025-04-07T08:53:14Z) - Exploring Few-Shot Defect Segmentation in General Industrial Scenarios with Metric Learning and Vision Foundation Models [8.96299670050608]
This paper aims to explore few-shot semantic segmentation (FSS) in broader industrial products with various defect types.
We thoroughly investigate metric learning-based FSS methods, including those based on meta-learning and those based on Vision Foundation Models (VFMs)
We propose a novel efficient FDS method based on feature matching.
arXiv Detail & Related papers (2025-02-03T10:13:34Z) - Foundation Model or Finetune? Evaluation of few-shot semantic segmentation for river pollution [16.272314073324626]
Foundation models (FMs) are a popular topic of research in AI.
In this work, we compare the performance of FMs to finetuned pre-trained supervised models in the task of semantic segmentation.
We see that finetuned models consistently outperform the FMs tested, even in cases were data is scarce.
arXiv Detail & Related papers (2024-09-05T17:59:32Z) - Variational Autoencoder for Anomaly Detection: A Comparative Study [1.9131868049527914]
This paper aims to conduct a comparative analysis of contemporary Variational Autoencoder (VAE) architectures employed in anomaly detection.
The architectural configurations under consideration encompass the original VAE baseline, the VAE with a Gaussian Random Field prior (VAE-GRF), and the VAE incorporating a vision transformer (ViT-VAE)
arXiv Detail & Related papers (2024-08-24T12:07:57Z) - Precision matters: Precision-aware ensemble for weakly supervised semantic segmentation [14.931551206723041]
Weakly Supervised Semantic (WSSS) employs weak supervision, such as image-level labels, to train the segmentation model.
We propose ORANDNet, an advanced ensemble approach tailored for WSSS.
arXiv Detail & Related papers (2024-06-28T03:58:02Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - A Novel Benchmark for Few-Shot Semantic Segmentation in the Era of Foundation Models [7.428199805959228]
Few-shot semantic segmentation (FSS) is a crucial challenge in computer vision.<n>With the emergence of vision foundation models (VFM) as generalist feature extractors, we seek to explore the adaptation of these models for FSS.<n>We propose a novel realistic benchmark with a simple and straightforward adaptation process tailored for this task.
arXiv Detail & Related papers (2024-01-20T19:50:51Z) - RAP-SAM: Towards Real-Time All-Purpose Segment Anything [120.17175256421622]
Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation.
Current real-time segmentation mainly has one purpose, such as semantic segmentation on the driving scene.
This work explores a new real-time segmentation setting, named all-purpose segmentation in real-time, to transfer VFMs in real-time deployment.
arXiv Detail & Related papers (2024-01-18T18:59:30Z) - Temporal-aware Hierarchical Mask Classification for Video Semantic
Segmentation [62.275143240798236]
Video semantic segmentation dataset has limited categories per video.
Less than 10% of queries could be matched to receive meaningful gradient updates during VSS training.
Our method achieves state-of-the-art performance on the latest challenging VSS benchmark VSPW without bells and whistles.
arXiv Detail & Related papers (2023-09-14T20:31:06Z) - What a MESS: Multi-Domain Evaluation of Zero-Shot Semantic Segmentation [2.7036595757881323]
We build a benchmark for Multi-domain Evaluation of Semantic (MESS)
MESS allows a holistic analysis of performance across a wide range of domain-specific datasets.
We evaluate eight recently published models on the proposed MESS benchmark and analyze characteristics for the performance of zero-shot transfer models.
arXiv Detail & Related papers (2023-06-27T14:47:43Z) - Task Residual for Tuning Vision-Language Models [69.22958802711017]
We propose a new efficient tuning approach for vision-language models (VLMs) named Task Residual Tuning (TaskRes)
TaskRes explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task.
The proposed TaskRes is simple yet effective, which significantly outperforms previous methods on 11 benchmark datasets.
arXiv Detail & Related papers (2022-11-18T15:09:03Z) - MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains.
We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images.
A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - Reviving Iterative Training with Mask Guidance for Interactive
Segmentation [8.271859911016719]
Recent works on click-based interactive segmentation have demonstrated state-of-the-art results by using various inference-time optimization schemes.
We propose a simple feedforward model for click-based interactive segmentation that employs the segmentation masks from previous steps.
We find that the models trained on a combination of COCO and LVIS with diverse and high-quality annotations show performance superior to all existing models.
arXiv Detail & Related papers (2021-02-12T15:44:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.