Fugu-MT 論文翻訳(概要): CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation

論文の概要: CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation

arxiv url: http://arxiv.org/abs/2605.24807v1
Date: Sun, 24 May 2026 01:40:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.456598
Title: CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation
Title（参考訳）: CLIP-Guided SAM: プロンプタブルセグメンテーションのためのパラメータ効率の良いセマンティックコンディショニング
Authors: Shayan Jalilian, Abdul Bais,
Abstract要約: 内部的セマンティック・コンディショニングに基づくパラメータ効率のセグメンテーションフレームワークであるCLIP-Guided SAMを提案する。セマンティック信号のみを使用してプロンプトを生成する代わりに、CLIP由来のテキスト、ビジョン、および類似機能をSAMの画像エンコーダに直接注入する。我々のフレームワークは低ラベルデータ設定用に設計されており、汎用ベンチマークと特化下流タスクの両方に適用できる。
参考スコア（独自算出の注目度）: 6.517222960194991
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Promptable foundation models such as the Segment Anything Model (SAM) produce high-quality masks but remain semantically blind, relying on external prompts to specify categories. Existing vision-language approaches address this limitation by using external prompt coupling, where a vision-language model generates spatial prompts for SAM as a separate stage. We propose CLIP-Guided SAM, a parameter-efficient segmentation framework built on internal semantic conditioning. Instead of using semantic signals only to generate prompts, we inject CLIP-derived text, vision, and similarity features directly into SAM's image encoder through lightweight multi-modal semantic adapters. These adapters condition SAM's internal feature representations, allowing semantic information to influence mask prediction while preserving SAM's original promptable interface. Our framework is designed for low labeled-data settings and applies to both general-domain benchmarks and specialized downstream tasks. It supports two operating modes: Manual mode, for interactive segmentation with both text and spatial prompts, and Semi-Automatic text-only mode, for applications that require concept-specific segmentation using only textual input. We show that robustness depends on aligning training with the type of prompts used at inference, making train-test prompt consistency an important design principle. Through extensive experiments and ablations, we evaluate our method against SAM+PEFT baselines without semantic conditioning, vision-language + SAM pipelines, SAM 3, and strong semi-supervised segmentation methods that rely on large amounts of unlabeled data. Across these settings, CLIP-Guided SAM consistently achieves superior or competitive performance while remaining parameter-efficient in both training and deployment.
Abstract（参考訳）: SAM(Segment Anything Model)のような確率的な基礎モデルは、高品質なマスクを生成するが、カテゴリを指定するための外部のプロンプトに依存して意味的に盲目のままである。既存の視覚言語アプローチでは、外的プロンプト結合を用いて、SAMの空間的プロンプトを別段に生成する。内部的セマンティック・コンディショニングに基づくパラメータ効率のセグメンテーションフレームワークであるCLIP-Guided SAMを提案する。セマンティック信号のみを使用してプロンプトを生成する代わりに、軽量なマルチモーダルセマンティックアダプタを通じて、CLIP由来のテキスト、ビジョン、および類似機能をSAMのイメージエンコーダに直接注入する。これらのアダプタはSAMの内部特徴表現を条件にしており、SAMの元々のプロンプト可能なインターフェースを保ちながら、意味情報をマスク予測に影響を与えることができる。我々のフレームワークは低ラベルデータ設定用に設計されており、汎用ベンチマークと特化下流タスクの両方に適用できる。手動モード:テキストと空間プロンプトの両方でインタラクティブなセグメンテーションを行うためのマニュアルモードと、テキスト入力のみを使用して概念固有のセグメンテーションを必要とするアプリケーションのためのセミオートマチックテキストオンリーモードである。頑健性は、推論で使用されるプロンプトの種類とトレーニングの整合性に依存し、トレイン-テストのプロンプトの一貫性を重要な設計原則とすることを示す。提案手法は, 意味条件のないSAM+PEFTベースライン, 視覚言語+SAMパイプライン, SAM3, および大量のラベルのないデータに依存する強力な半教師付きセグメンテーション手法について検討した。これらの設定全体で、CLIP-Guided SAMは、トレーニングとデプロイメントの両方でパラメータ効率を保ちながら、優れた、または競争的なパフォーマンスを一貫して達成します。

論文の概要: CLIP-Guided SAM: Parameter-Efficient Semantic Conditioning for Promptable Segmentation

関連論文リスト