The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation
- URL: http://arxiv.org/abs/2512.06032v1
- Date: Thu, 04 Dec 2025 16:27:18 GMT
- Title: The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation
- Authors: Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee,
- Abstract summary: This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3.<n>We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3.
- Score: 3.7414278978078204
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3. We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3. SAM2 operates through spatial prompts points, boxes, and masks yielding purely geometric and temporal segmentation. In contrast, SAM3 introduces a unified vision-language architecture capable of open-vocabulary reasoning, semantic grounding, contrastive alignment, and exemplar-based concept understanding. We structure this analysis through five core components: (1) a Conceptual Break Between Prompt-Based and Concept-Based Segmentation, contrasting spatial prompt semantics of SAM2 with multimodal fusion and text-conditioned mask generation of SAM3; (2) Architectural Divergence, detailing pure vision-temporal design of SAM2 versus integration of vision-language encoders, geometry and exemplar encoders, fusion modules, DETR-style decoders, object queries, and ambiguity-handling via Mixture-of-Experts in SAM3; (3) Dataset and Annotation Differences, contrasting SA-V video masks with multimodal concept-annotated corpora of SAM3; (4) Training and Hyperparameter Distinctions, showing why SAM2 optimization knowledge does not apply to SAM3; and (5) Evaluation, Metrics, and Failure Modes, outlining the transition from geometric IoU metrics to semantic, open-vocabulary evaluation. Together, these analyses establish SAM3 as a new class of segmentation foundation model and chart future directions for the emerging concept-driven segmentation era.
Related papers
- SAM3-I: Segment Anything with Instructions [86.92593395772029]
We present SAM3-I, an enhanced framework that unifies concept-level understanding and instructionlevel reasoning within the SAM family.<n>We design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs.
arXiv Detail & Related papers (2025-12-04T09:00:25Z) - Evaluating SAM2 for Video Semantic Segmentation [60.157605818225186]
The Anything Model 2 (SAM2) has proven to be a powerful foundation model for promptable visual object segmentation in both images and videos.<n>This paper explores the extension of SAM2 to dense Video Semantic (VSS)<n>Our experiments suggest that leveraging SAM2 enhances overall performance in VSS, primarily due to its precise predictions of object boundaries.
arXiv Detail & Related papers (2025-12-01T15:15:16Z) - SAM 3: Segment Anything with Concepts [93.97262932669081]
We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts.<n>PCS takes such prompts and returns segmentation masks and identities for all matching object instances.<n>Our model consists of an image-level detector and a memory-based video tracker that share a single backbone.
arXiv Detail & Related papers (2025-11-20T18:59:56Z) - MemorySAM: Memorize Modalities and Semantics with Segment Anything Model 2 for Multi-modal Semantic Segmentation [22.482211353379927]
Large vision model, AnythingCube Model 2 (SAM2) has shown strong zero-shot segmentation performance on both images and videos.<n>Inspired by cross-frame correlation in videos, we propose to treat multi-modal data as a sequence of frames representing the same scene.<n>Our key idea is to ''memorize'' the modality-agnostic information and'memorize' the semantics related to the targeted scene.
arXiv Detail & Related papers (2025-03-09T17:33:15Z) - Inspiring the Next Generation of Segment Anything Models: Comprehensively Evaluate SAM and SAM 2 with Diverse Prompts Towards Context-Dependent Concepts under Different Scenes [97.8612925017964]
Large-scale foundation models trained on billions of image--mask pairs cover a vast diversity of scenes, objects, and contexts.<n>SAM and its upgraded version, SAM2, have significantly influenced multiple fields within computer vision.<n>We conduct a thorough evaluation of SAMs on 11 CD concepts across 2D and 3D images and videos in various visual modalities within natural, medical, and industrial scenes.
arXiv Detail & Related papers (2024-12-02T08:03:56Z) - SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation [88.80792308991867]
Segment Anything model (SAM) has shown ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges.<n>This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation.<n> Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains.
arXiv Detail & Related papers (2024-07-23T17:47:25Z) - EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model [39.967352995143855]
We introduce the Early Vision-language Fusion-based SAM (EVF-SAM)<n>EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text)<n>Experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation.
arXiv Detail & Related papers (2024-06-28T17:38:18Z) - Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes Interactively [69.97238935096094]
The Open-Vocabulary SAM is a SAM-inspired model designed for simultaneous interactive segmentation and recognition.
Our method can segment and recognize approximately 22,000 classes.
arXiv Detail & Related papers (2024-01-05T18:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.