Related papers: SAMPO: Visual Preference Optimization for Intent-Aware Segmentation with Vision Foundation Models

SAMPO: Visual Preference Optimization for Intent-Aware Segmentation with Vision Foundation Models

URL: http://arxiv.org/abs/2508.02464v1
Date: Mon, 04 Aug 2025 14:31:11 GMT
Title: SAMPO: Visual Preference Optimization for Intent-Aware Segmentation with Vision Foundation Models
Authors: Yonghuang Wu, Wenwen Zeng, Xuan Xie, Chengqian Zhao, Guoqing Wu, Jinhua Yu,
Abstract summary: We introduce SAMPO, a novel framework that teaches visual foundation models to infer high-level categorical intent from sparse visual interactions.<n>Our work establishes a new paradigm for intent-aware alignment in visual foundation models, removing dependencies on auxiliary prompt generators or language-model-assisted preference learning.
Score: 5.3279948735247284
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Foundation models like Segment Anything Model (SAM) excel in promptable segmentation but suffer from an intent gap: they segment only explicitly prompted objects, failing to generalize to semantically related instances implicitly desired by users. This limitation is critical in domains with dense homogeneous objects (e.g., biomedical nuclei segmentation), where sparse visual prompts typically yield incomplete results, rendering dense annotations impractical due to prohibitive cost. To bridge this gap, we introduce SAMPO (Segment Anything Model with Preference Optimization), a novel framework that teaches visual foundation models to infer high-level categorical intent from sparse visual interactions. Unlike conventional pixel-level fine-tuning, SAMPO optimizes models to implicitly capture target-class characteristics through preference optimization. This approach, which operates without dependency on language models, enables robust multi-object segmentation even under sparse prompting and demonstrates superior data efficiency during fine-tuning. Validated on three medical segmentation tasks, SAMPO achieves state-of-the-art performance: on challenging tasks like PanNuke-T2, our method, when fine-tuned with only 10% of the training data, significantly outperforms all existing methods trained on the full 100% dataset, achieving an improvement of over 9 percentage points compared to the best baseline. Our work establishes a new paradigm for intent-aware alignment in visual foundation models, removing dependencies on auxiliary prompt generators or language-model-assisted preference learning.

Related papers

Stable Diffusion Models are Secretly Good at Visual In-Context Learning [9.829303881652548]
We show that off-the-shelf Stable Diffusion models can be repurposed for visual in-context learning (V-ICL)<n>We formulate an in-place attention re-computation within the self-attention layers of the Stable Diffusion architecture.<n>We show that this repurposed Stable Diffusion model is able to adapt to six different tasks.
arXiv Detail & Related papers (2025-08-13T17:08:22Z)
Prompt Tuning Vision Language Models with Margin Regularizer for Few-Shot Learning under Distribution Shifts [13.21626568246313]
We analyze whether vision-language foundation models can be adapted to target datasets with very different distributions and classes.<n>We propose a novel prompt-tuning method, PromptMargin, for adapting such large-scale VLMs directly on the few target samples.<n>PromptMargin effectively tunes the text as well as visual prompts for this task, and has two main modules.
arXiv Detail & Related papers (2025-05-21T13:26:56Z)
Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent [72.10987117380584]
Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data.<n>We find existing methods discard task-specific information that, while causing conflicts, is crucial for performance.<n>Our approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.
arXiv Detail & Related papers (2025-01-02T12:45:21Z)
Precision matters: Precision-aware ensemble for weakly supervised semantic segmentation [14.931551206723041]
Weakly Supervised Semantic (WSSS) employs weak supervision, such as image-level labels, to train the segmentation model. We propose ORANDNet, an advanced ensemble approach tailored for WSSS.
arXiv Detail & Related papers (2024-06-28T03:58:02Z)
Labeling Indoor Scenes with Fusion of Out-of-the-Box Perception Models [4.157013247909771]
We propose to leverage the recent advancements in state-of-the-art models for bottom-up segmentation (SAM), object detection (Detic), and semantic segmentation (MaskFormer) We aim to develop a cost-effective labeling approach to obtain pseudo-labels for semantic segmentation and object instance detection in indoor environments. We demonstrate the effectiveness of the proposed approach on the Active Vision dataset and the ADE20K dataset.
arXiv Detail & Related papers (2023-11-17T21:58:26Z)
Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage. We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets. By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z)
MSeg: A Composite Dataset for Multi-domain Semantic Segmentation [100.17755160696939]
We present MSeg, a composite dataset that unifies semantic segmentation datasets from different domains. We reconcile the generalization and bring the pixel-level annotations into alignment by relabeling more than 220,000 object masks in more than 80,000 images. A model trained on MSeg ranks first on the WildDash-v1 leaderboard for robust semantic segmentation, with no exposure to WildDash data during training.
arXiv Detail & Related papers (2021-12-27T16:16:35Z)
Multi-dataset Pretraining: A Unified Model for Semantic Segmentation [97.61605021985062]
We propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented annotations of different datasets. This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets. In order to better model the relationship among images and classes from different datasets, we extend the pixel level embeddings via cross dataset mixing.
arXiv Detail & Related papers (2021-06-08T06:13:11Z)
Reviving Iterative Training with Mask Guidance for Interactive Segmentation [8.271859911016719]
Recent works on click-based interactive segmentation have demonstrated state-of-the-art results by using various inference-time optimization schemes. We propose a simple feedforward model for click-based interactive segmentation that employs the segmentation masks from previous steps. We find that the models trained on a combination of COCO and LVIS with diverse and high-quality annotations show performance superior to all existing models.
arXiv Detail & Related papers (2021-02-12T15:44:31Z)
Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion. In this paper, a new paradigm for semantic segmentation is proposed. Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image. We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z)
Learning Fast and Robust Target Models for Video Object Segmentation [83.3382606349118]
Video object segmentation (VOS) is a highly challenging problem since the initial mask, defining the target object, is only given at test-time. Most previous approaches fine-tune segmentation networks on the first frame, resulting in impractical frame-rates and risk of overfitting. We propose a novel VOS architecture consisting of two network components.
arXiv Detail & Related papers (2020-02-27T21:58:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.