SAM3-I: Segment Anything with Instructions
- URL: http://arxiv.org/abs/2512.04585v1
- Date: Thu, 04 Dec 2025 09:00:25 GMT
- Title: SAM3-I: Segment Anything with Instructions
- Authors: Jingjing Li, Yue Feng, Yuchen Guo, Jincai Huang, Yongri Piao, Qi Bi, Miao Zhang, Xiaoqi Zhao, Qiang Chen, Shihao Zou, Wei Ji, Huchuan Lu, Li Cheng,
- Abstract summary: We present SAM3-I, an enhanced framework that unifies concept-level understanding and instructionlevel reasoning within the SAM family.<n>We design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs.
- Score: 86.92593395772029
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Segment Anything Model 3 (SAM3) has advanced open-vocabulary segmentation through promptable concept segmentation, allowing users to segment all instances corresponding to a given concept, typically specified with short noun-phrase (NP) prompts. While this marks the first integration of language-level concepts within the SAM family, real-world usage typically requires far richer expressions that include attributes, spatial relations, functionalities, actions, states, and even implicit reasoning over instances. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and then conduct iterative mask filtering. However, these NP-level concepts remain overly coarse, often failing to precisely represent a specific instance. In this work, we present SAM3-I, an enhanced framework that unifies concept-level understanding and instruction-level reasoning within the SAM family. SAM3-I introduces an instruction-aware cascaded adaptation mechanism that progressively aligns expressive instruction semantics with SAM3's existing vision-language representations, enabling direct instruction-following segmentation without sacrificing its original concept-driven capabilities. Furthermore, we design a structured instruction taxonomy spanning concept, simple, and complex levels, and develop a scalable data engine to construct a dataset with diverse instruction-mask pairs. Experiments show that SAM3-I delivers appealing performance, demonstrating that SAM3 can be effectively extended to follow natural-language instructions while preserving its strong concept grounding. We open-source SAM3-I and provide practical fine-tuning workflows, enabling researchers to adapt it to domain-specific applications. The source code is available here.
Related papers
- The SAM2-to-SAM3 Gap in the Segment Anything Model Family: Why Prompt-Based Expertise Fails in Concept-Driven Image Segmentation [3.7414278978078204]
This paper investigates the fundamental discontinuity between the latest two Segment Anything Models: SAM2 and SAM3.<n>We explain why the expertise in prompt-based segmentation of SAM2 does not transfer to the multimodal concept-driven paradigm of SAM3.
arXiv Detail & Related papers (2025-12-04T16:27:18Z) - SAM 3: Segment Anything with Concepts [93.97262932669081]
We present Segment Anything Model (SAM) 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts.<n>PCS takes such prompts and returns segmentation masks and identities for all matching object instances.<n>Our model consists of an image-level detector and a memory-based video tracker that share a single backbone.
arXiv Detail & Related papers (2025-11-20T18:59:56Z) - SAM-PTx: Text-Guided Fine-Tuning of SAM with Parameter-Efficient, Parallel-Text Adapters [0.5755004576310334]
This paper introduces SAM-PTx, a parameter-efficient approach for adapting SAM using frozen CLIP-derived text embeddings as class-level semantic guidance.<n>Specifically, we propose a lightweight adapter called Parallel-Text that injects text embeddings into SAM's image, enabling semantics-guided segmentation.<n>We show that incorporating fixed text embeddings as input improves segmentation performance over purely spatial prompt baselines.
arXiv Detail & Related papers (2025-07-31T23:26:39Z) - OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts [20.327695503392274]
We present OpenWorldSAM, a framework that extends the prompt-driven Segment Anything Model v2 (SAM2) to open-vocabulary scenarios.<n>OpenWorldSAM supports a diverse range of prompts, including category-level and sentence-level language descriptions.<n>We enhance the model's spatial understanding through novel positional tie-breaker embeddings and cross-attention layers.
arXiv Detail & Related papers (2025-07-07T19:16:22Z) - Talk2SAM: Text-Guided Semantic Enhancement for Complex-Shaped Object Segmentation [0.0]
We propose Talk2SAM, a novel approach that integrates textual guidance to improve object segmentation.<n>The method uses CLIP-based embeddings derived from user-provided text prompts to identify relevant semantic regions.<n>Talk2SAM consistently outperforms SAM-HQ, achieving up to +5.9% IoU and +8.3% boundary IoU improvements.
arXiv Detail & Related papers (2025-06-03T19:53:10Z) - SAM-CP: Marrying SAM with Composable Prompts for Versatile Segmentation [88.80792308991867]
Segment Anything model (SAM) has shown ability to group image pixels into patches, but applying it to semantic-aware segmentation still faces major challenges.<n>This paper presents SAM-CP, a simple approach that establishes two types of composable prompts beyond SAM and composes them for versatile segmentation.<n> Experiments show that SAM-CP achieves semantic, instance, and panoptic segmentation in both open and closed domains.
arXiv Detail & Related papers (2024-07-23T17:47:25Z) - AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning [61.666973416903005]
Segment Anything Model (SAM) has demonstrated its impressive generalization capabilities in open-world scenarios with the guidance of prompts.
We propose a novel framework, termed AlignSAM, designed for automatic prompting for aligning SAM to an open context.
arXiv Detail & Related papers (2024-06-01T16:21:39Z) - VRP-SAM: SAM with Visual Reference Prompt [76.71829864364283]
We propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM)<n>VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image.<n>To enhance the generalization ability of VRP-SAM, the VRP encoder adopts a meta-learning strategy.
arXiv Detail & Related papers (2024-02-27T17:58:09Z) - Boosting Segment Anything Model Towards Open-Vocabulary Learning [69.24734826209367]
Segment Anything Model (SAM) has emerged as a new paradigmatic vision foundation model.<n>Despite SAM finding applications and adaptations in various domains, its primary limitation lies in the inability to grasp object semantics.<n>We present Sambor to seamlessly integrate SAM with the open-vocabulary object detector in an end-to-end framework.
arXiv Detail & Related papers (2023-12-06T17:19:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.