Quickly Tuning Foundation Models for Image Segmentation
- URL: http://arxiv.org/abs/2508.17283v1
- Date: Sun, 24 Aug 2025 10:06:02 GMT
- Title: Quickly Tuning Foundation Models for Image Segmentation
- Authors: Breenda Das, Lennart Purucker, Timur Carstensen, Frank Hutter,
- Abstract summary: We introduce QTT-SEG, a meta-learning-driven approach for automating and accelerating the fine-tuning of SAM for image segmentation.<n> QTT-SEG predicts high-performing configurations using meta-learned cost and performance models.<n>Our results show that QTT-SEG consistently improves upon SAM's zero-shot performance and surpasses AutoGluon Multimodal.
- Score: 41.388525127324534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundation models like SAM (Segment Anything Model) exhibit strong zero-shot image segmentation performance, but often fall short on domain-specific tasks. Fine-tuning these models typically requires significant manual effort and domain expertise. In this work, we introduce QTT-SEG, a meta-learning-driven approach for automating and accelerating the fine-tuning of SAM for image segmentation. Built on the Quick-Tune hyperparameter optimization framework, QTT-SEG predicts high-performing configurations using meta-learned cost and performance models, efficiently navigating a search space of over 200 million possibilities. We evaluate QTT-SEG on eight binary and five multiclass segmentation datasets under tight time constraints. Our results show that QTT-SEG consistently improves upon SAM's zero-shot performance and surpasses AutoGluon Multimodal, a strong AutoML baseline, on most binary tasks within three minutes. On multiclass datasets, QTT-SEG delivers consistent gains as well. These findings highlight the promise of meta-learning in automating model adaptation for specialized segmentation tasks. Code available at: https://github.com/ds-brx/QTT-SEG/
Related papers
- SAM$^{*}$: Task-Adaptive SAM with Physics-Guided Rewards [0.5805874695844994]
Image segmentation is a critical task in microscopy, essential for accurately analyzing and interpreting complex visual data.<n>Here, we introduce a reward function-based optimization to fine-tune foundational models.<n>We demonstrate the effectiveness of this approach in microscopy imaging, where precise segmentation is crucial for analyzing cellular structures, material interfaces, and nanoscale features.
arXiv Detail & Related papers (2025-09-08T13:51:20Z) - SAMPO: Visual Preference Optimization for Intent-Aware Segmentation with Vision Foundation Models [5.3279948735247284]
We introduce SAMPO, a novel framework that teaches visual foundation models to infer high-level categorical intent from sparse visual interactions.<n>Our work establishes a new paradigm for intent-aware alignment in visual foundation models, removing dependencies on auxiliary prompt generators or language-model-assisted preference learning.
arXiv Detail & Related papers (2025-08-04T14:31:11Z) - Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation [15.941958367737408]
Seg-TTO is a framework for zero-shot, open-vocabulary semantic segmentation.<n>We focus on segmentation-specific test-time optimization to address this gap.<n>Seg-TTO demonstrates clear performance improvements (up to 27% mIoU increase on some datasets) establishing new state-of-the-art.
arXiv Detail & Related papers (2025-01-08T18:58:24Z) - TAVP: Task-Adaptive Visual Prompt for Cross-domain Few-shot Segmentation [40.49924427388922]
We propose a task-adaptive auto-visual prompt framework for Cross-dominan Few-shot segmentation (CD-FSS)<n>We incorporate a Class Domain Task-Adaptive Auto-Prompt (CDTAP) module to enable class-domain feature extraction and generate high-quality, learnable visual prompts.<n>Our model outperforms the state-of-the-art CD-FSS approach, achieving an average accuracy improvement of 1.3% in the 1-shot setting and 11.76% in the 5-shot setting.
arXiv Detail & Related papers (2024-09-09T07:43:58Z) - OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs.
We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z) - RMP-SAM: Towards Real-Time Multi-Purpose Segment Anything [117.02741621686677]
This work explores a novel real-time segmentation setting called real-time multi-purpose segmentation.<n>It contains three fundamental sub-tasks: interactive segmentation, panoptic segmentation, and video instance segmentation.<n>We present a novel dynamic convolution-based method, Real-Time Multi-Purpose SAM (RMP-SAM)<n>It contains an efficient encoder and an efficient decoupled adapter to perform prompt-driven decoding.
arXiv Detail & Related papers (2024-01-18T18:59:30Z) - Prompt Tuning for Parameter-efficient Medical Image Segmentation [79.09285179181225]
We propose and investigate several contributions to achieve a parameter-efficient but effective adaptation for semantic segmentation on two medical imaging datasets.
We pre-train this architecture with a dedicated dense self-supervision scheme based on assignments to online generated prototypes.
We demonstrate that the resulting neural network model is able to attenuate the gap between fully fine-tuned and parameter-efficiently adapted models.
arXiv Detail & Related papers (2022-11-16T21:55:05Z) - Integrative Few-Shot Learning for Classification and Segmentation [37.50821005917126]
We introduce the integrative task of few-shot classification and segmentation (FS-CS)
FS-CS aims to classify and segment target objects in a query image when the target classes are given with a few examples.
We propose the integrative few-shot learning framework for FS-CS, which trains a learner to construct class-wise foreground maps.
arXiv Detail & Related papers (2022-03-29T16:14:40Z) - Score-based Generative Modeling in Latent Space [93.8985523558869]
Score-based generative models (SGMs) have recently demonstrated impressive results in terms of both sample quality and distribution coverage.
Here, we propose the Latent Score-based Generative Model (LSGM), a novel approach that trains SGMs in a latent space.
Moving from data to latent space allows us to train more expressive generative models, apply SGMs to non-continuous data, and learn smoother SGMs in a smaller space.
arXiv Detail & Related papers (2021-06-10T17:26:35Z) - Fast Video Object Segmentation With Temporal Aggregation Network and
Dynamic Template Matching [67.02962970820505]
We introduce "tracking-by-detection" into Video Object (VOS)
We propose a new temporal aggregation network and a novel dynamic time-evolving template matching mechanism to achieve significantly improved performance.
We achieve new state-of-the-art performance on the DAVIS benchmark without complicated bells and whistles in both speed and accuracy, with a speed of 0.14 second per frame and J&F measure of 75.9% respectively.
arXiv Detail & Related papers (2020-07-11T05:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.