Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization
- URL: http://arxiv.org/abs/2505.04905v1
- Date: Thu, 08 May 2025 02:44:53 GMT
- Title: Pro2SAM: Mask Prompt to SAM with Grid Points for Weakly Supervised Object Localization
- Authors: Xi Yang, Songsong Duan, Nannan Wang, Xinbo Gao,
- Abstract summary: We propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task.<n>First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt.<n> Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask.
- Score: 54.91271106816616
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weakly Supervised Object Localization (WSOL), which aims to localize objects by only using image-level labels, has attracted much attention because of its low annotation cost in real applications. Current studies focus on the Class Activation Map (CAM) of CNN and the self-attention map of transformer to identify the region of objects. However, both CAM and self-attention maps can not learn pixel-level fine-grained information on the foreground objects, which hinders the further advance of WSOL. To address this problem, we initiatively leverage the capability of zero-shot generalization and fine-grained segmentation in Segment Anything Model (SAM) to boost the activation of integral object regions. Further, to alleviate the semantic ambiguity issue accrued in single point prompt-based SAM, we propose an innovative mask prompt to SAM (Pro2SAM) network with grid points for WSOL task. First, we devise a Global Token Transformer (GTFormer) to generate a coarse-grained foreground map as a flexible mask prompt, where the GTFormer jointly embeds patch tokens and novel global tokens to learn foreground semantics. Secondly, we deliver grid points as dense prompts into SAM to maximize the probability of foreground mask, which avoids the lack of objects caused by a single point/box prompt. Finally, we propose a pixel-level similarity metric to come true the mask matching from mask prompt to SAM, where the mask with the highest score is viewed as the final localization map. Experiments show that the proposed Pro2SAM achieves state-of-the-art performance on both CUB-200-2011 and ILSVRC, with 84.03\% and 66.85\% Top-1 Loc, respectively.
Related papers
- SAMTok: Representing Any Mask with Two Words [70.74140779649856]
We present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens.<n>By treating masks as new language tokens, SAMTok enables base MLLMs to learn pixel-wise capabilities.<n>QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation.
arXiv Detail & Related papers (2026-01-22T16:44:09Z) - Evaluating SAM2 for Video Semantic Segmentation [60.157605818225186]
The Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos.<n>This paper explores the extension of SAM2 to dense Video Semantic (VSS)<n>Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
arXiv Detail & Related papers (2025-12-01T15:15:16Z) - GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation [81.0871900167463]
We introduce GeoSAM2, a prompt-controllable framework for 3D part segmentation.<n>Given a textureless object, we render normal and point maps from predefined viewpoints.<n>We accept simple 2D prompts - clicks or boxes - to guide part selection.<n>The predicted masks are back-projected to the object and aggregated across views.
arXiv Detail & Related papers (2025-08-19T17:58:51Z) - Auto-Prompting SAM for Weakly Supervised Landslide Extraction [17.515220489213743]
We propose a simple yet effective method by auto-prompting the Segment Anything Model (SAM)<n>Instead of depending on high-quality class activation maps (CAMs) for pseudo-labeling or fine-tuning SAM, our method directly yields fine-grained segmentation masks from SAM inference through prompt engineering.<n> Experimental results on high-resolution aerial and satellite datasets demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2025-01-23T07:08:48Z) - Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - MaskInversion: Localized Embeddings via Optimization of Explainability Maps [49.50785637749757]
MaskInversion generates a context-aware embedding for a query image region specified by a mask at test time.
It can be used for a broad range of tasks, including open-vocabulary class retrieval, referring expression comprehension, as well as for localized captioning and image generation.
arXiv Detail & Related papers (2024-07-29T14:21:07Z) - Mask2Map: Vectorized HD Map Construction Using Bird's Eye View Segmentation Masks [9.113769643415868]
We introduce Mask2Map, a novel end-to-end online HD map construction method designed for autonomous driving applications.<n>Our approach focuses on predicting the class and ordered point set of map instances within a scene.<n>Mask2Map achieves remarkable performance improvements over previous state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T13:48:52Z) - MaskSAM: Towards Auto-prompt SAM with Mask Classification for Volumetric Medical Image Segmentation [17.25946659884426]
We propose MaskSAM, a mask classification prompt-free framework for medical image segmentation.<n>Our method achieves state-of-the-art performance on AMOS2022, 90.52% Dice, which improved by 2.7% compared to nnUNet.
arXiv Detail & Related papers (2024-03-21T03:28:24Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Repurposing SAM for User-Defined Semantics Aware Segmentation [23.88643687043431]
We propose U-SAM, a novel framework that imbibes semantic awareness into SAM.<n>U-SAM provides pixel-level semantic annotations for images without requiring any labeled/unlabeled samples from the test data distribution.<n>We evaluate U-SAM on PASCAL VOC 2012 and MSCOCO-80, achieving significant mIoU improvements of +17.95% and +520%, respectively.
arXiv Detail & Related papers (2023-12-05T01:37:18Z) - Background Activation Suppression for Weakly Supervised Object
Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels.
New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization.
This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - Spatial-Aware Token for Weakly Supervised Object Localization [137.0570026552845]
We propose a task-specific spatial-aware token to condition localization in a weakly supervised manner.
Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc.
arXiv Detail & Related papers (2023-03-18T15:38:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.