Related papers: UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

URL: http://arxiv.org/abs/2511.13714v1
Date: Mon, 17 Nov 2025 18:58:34 GMT
Title: UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity
Authors: Junwei Yu, Trevor Darrell, XuDong Wang,
Abstract summary: We introduce UnSAMv2, which enables segment anything at any granularity without human annotations.<n>UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs.<n>We show that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.
Score: 54.41309926099154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}_{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}_{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.

Related papers

Learning Accurate Segmentation Purely from Self-Supervision [87.78965637247107]
Selfment is a fully self-supervised framework that segments objects directly from raw images without human labels.<n>Selfment sets new state-of-the-art (SoTA) results across multiple benchmarks.
arXiv Detail & Related papers (2026-02-27T07:36:32Z)
SAM2S: Segment Anything in Surgical Videos via Semantic Long-term Tracking [15.279735515011817]
Surgical segmentation is crucial for computer-assisted surgery, enabling precise localization and tracking of instruments and tissues.<n>Interactive Video Object (iVOS) models such as Segment Anything Model 2 (SAM2) provide prompt-based flexibility beyond methods with predefined categories, but face challenges in surgical scenarios due to the domain gap and limited long-term tracking.<n>We construct SA-SV, the largest surgical iVOS benchmark with instance-level temporal annotations (masklets) spanning eight procedure types (61k frames, 1.6k masklets)<n>We propose SAM2S, a foundation model enhancing bftextSAM2 for
arXiv Detail & Related papers (2025-11-20T18:18:49Z)
Segment Anything without Supervision [65.93211374889196]
We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image segmentation. UnSAM utilizes a divide-and-conquer strategy to "discover" the hierarchical structure of visual scenes. We show that supervised SAM can also benefit from our self-supervised labels.
arXiv Detail & Related papers (2024-06-28T17:47:32Z)
Moving Object Segmentation: All You Need Is SAM (and Flow) [82.78026782967959]
We investigate two models for combining SAM with optical flow that harness the segmentation power of SAM with the ability of flow to discover and group moving objects. In the first model, we adapt SAM to take optical flow, rather than RGB, as an input. In the second, SAM takes RGB as an input, and flow is used as a segmentation prompt. These surprisingly simple methods, without any further modifications, outperform all previous approaches by a considerable margin in both single and multi-object benchmarks.
arXiv Detail & Related papers (2024-04-18T17:59:53Z)
WSI-SAM: Multi-resolution Segment Anything Model (SAM) for histopathology whole-slide images [8.179859593451285]
We present WSI-SAM, enhancing Segment Anything Model (SAM) with precise object segmentation capabilities for histopathology images. To fully exploit pretrained knowledge while minimizing training overhead, we keep SAM frozen, introducing only minimal extra parameters. Our model outperforms SAM by 4.1 and 2.5 percent points on a ductal carcinoma in situ (DCIS) segmentation tasks and breast cancer metastasis segmentation task.
arXiv Detail & Related papers (2024-03-14T10:30:43Z)
TinySAM: Pushing the Envelope for Efficient Segment Anything Model [73.06322749886483]
We propose a framework to obtain a tiny segment anything model (TinySAM) while maintaining the strong zero-shot performance.<n>With all these proposed methods, our TinySAM leads to orders of magnitude computational reduction and pushes the envelope for efficient segment anything task.
arXiv Detail & Related papers (2023-12-21T12:26:11Z)
$\mathrm{SAM^{Med}}$: A medical image annotation framework based on large vision model [23.095778923771732]
Large vision model, Segment Anything Model (SAM), has revolutionized the computer vision field. In this study, we present $mathrmSAMMed$, an enhanced framework for medical image annotation. Results show a significant improvement in segmentation accuracy with only approximately 5 input points.
arXiv Detail & Related papers (2023-07-11T03:00:22Z)
Semantic-SAM: Segment and Recognize Anything at Any Granularity [83.64686655044765]
We introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. We consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts. For the multi-granularity capability, we propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels.
arXiv Detail & Related papers (2023-07-10T17:59:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.