Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization
- URL: http://arxiv.org/abs/2505.04905v1
- Date: Thu, 08 May 2025 02:44:53 GMT
- Title: Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization
- Authors: Xi Yang, Songsong Duan, Nannan Wang, Xinbo Gao,
- Abstract summary: We propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task.<n>First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt.<n> Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask.
- Score: 54.91271106816616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Current studies focus on the Class Activation Map (CAM) of CNN and the self-attention map of transformer to identify the region of objects. However, both CAM and self-attention maps can not learn pixel-level fine-grained information on the foreground objects, which hinders the further advance of WSOL. To address this problem, we initiatively leverage the capability of zero-shot generalization and fine-grained segmentation in Segment Anything Model (SAM) to boost the activation of integral object regions. Further, to alleviate the semantic ambiguity issue accrued in single point prompt-based SAM, we propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and novel global tokens to learn foreground semantics. Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask, which avoids the lack of objects caused by a single point/box prompt. Finally, we propose a pixel-level similarity metric to come true the mask matching from mask prompt to SAM, where the mask with the highest score is viewed as the final localization map. Experiments show that the proposed Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC, with 84.03\% and 66.85\% Top-1 Loc, respectively.
Related papers
- Auto-Prompting SAM for Weakly Supervised Landslide Extraction [17.515220489213743]
We propose a simple yet effective method by auto-prompting the Segment Anything Model (SAM)<n>Instead of depending on high-quality class activation maps (CAMs) for pseudo-labeling or fine-tuning SAM, our method directly yields fine-grained segmentation masks from SAM inference through prompt engineering.<n> Experimental results on high-resolution aerial and satellite datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2025-01-23T07:08:48Z) - Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - MaskInversion: Localized Embeddings via Optimization of Explainability Maps [49.50785637749757]
MaskInversion generates a context-aware embedding for a query image region specified by a mask at test time.
It can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation.
arXiv Detail & Related papers (2024-07-29T14:21:07Z) - Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks [9.113769643415868]
We introduce Mask2Map, a novel end-to-end online HD map construction method designed for autonomous driving applications.<n>Our approach focuses on predicting the class and ordered point set of map instances within a scene.<n>Mask2Map achieves remarkable performance improvements over previous state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T13:48:52Z) - MaskSAM: Towards Auto-prompt SAM with Mask Classification for Volumetric Medical Image Segmentation [17.25946659884426]
We propose MaskSAM, a mask classification prompt-free framework for medical image segmentation.<n>Our method achieves state-of-the-art performance on AMOS2022, 90.52% Dice, which improved by 2.7% compared to nnUNet.
arXiv Detail & Related papers (2024-03-21T03:28:24Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Repurposing SAM for User-Defined Semantics Aware Segmentation [23.88643687043431]
We propose U-SAM, a novel framework that imbibes semantic awareness into SAM.<n>U-SAM provides pixel-level semantic annotations for images without requiring any labeled/unlabeled samples from the test data distribution.<n>We evaluate U-SAM on PASCAL VOC 2012 and MSCOCO-80, achieving significant mIoU improvements of +17.95% and +520%, respectively.
arXiv Detail & Related papers (2023-12-05T01:37:18Z) - Background Activation Suppression for Weakly Supervised Object
Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels.
New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization.
This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - Spatial-Aware Token for Weakly Supervised Object Localization [137.0570026552845]
We propose a task-specific spatial-aware token to condition localization in a weakly supervised manner.
Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc.
arXiv Detail & Related papers (2023-03-18T15:38:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.