Fugu-MT 論文翻訳(概要): InstructSAM: Segment Any Instance with Any Instructions

論文の概要: InstructSAM: Segment Any Instance with Any Instructions

arxiv url: http://arxiv.org/abs/2605.26102v2
Date: Sun, 31 May 2026 07:20:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 18:24:16.436022
Title: InstructSAM: Segment Any Instance with Any Instructions
Title（参考訳）: InstructSAM:任意のインストラクションでインスタンスをセグメンテーションする
Authors: Yuqian Yuan, Wentong Li, Zhaocheng Li, Yutong Lin, Juncheng Li, Siliang Tang, Jun Xiao, Yueting Zhuang, Wenqiao Zhang,
Abstract要約: InstructSAMは任意の命令の下でマルチインスタンスセグメンテーションのために設計されたフレームワークである。学習可能なインスタンスクエリのバンクを視覚言語モデル(VLM)とSAM3に注入する。ハイブリッドアテンション機構は、これらのクエリ、ビジュアルトークン、命令トークン間の相互作用を促進する。
参考スコア（独自算出の注目度）: 70.32433456722613
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we introduce InstructSAM, a unified and streamlined framework designed for multi-instance segmentation under arbitrary instructions. We formulates instruction-driven instance segmentation as a set-structured query prediction problem and propose an explicit reasoning-to-instance query interface that elegantly bridges a vision-language model (VLM) and SAM3. Specifically, a bank of learnable instance queries is injected into the VLM and contextualized with instruction and visual information, enabling each query to serve as an instance-aware slot. A hybrid-attention mechanism further promotes interaction among these queries, visual tokens, and instruction tokens, improving instance enumeration and reducing duplicate predictions. The resulting LLM-conditioned queries are projected into SAM3's detector query space to drive accurate multi-instance segmentation in a single forward pass. This design equips SAM3 with high-level instruction understanding, compositional reasoning, and instance-level set prediction without modifying its core architecture. To support training and evaluation, we further construct Inst2Seg, a high-quality and large-scale instruction-based instance segmentation dataset and benchmark that couples free-form instructions with instance-level masks. Extensive experiments show that only 2B-scale InstructSAM achieves strong results across complex instruction-driven and phrase-level referring segmentation benchmarks, outperforming prior end-to-end methods and SAM3's agentic pipeline while enabling efficient single-pass multi-instance prediction.
Abstract（参考訳）: 本稿では,任意の命令下でのマルチインスタンスセグメンテーションのために設計された,統一的で合理化されたフレームワークであるInstructSAMを紹介する。本稿では,命令駆動のインスタンスセグメント化を集合構造的クエリ予測問題として定式化し,視覚言語モデル(VLM)とSAM3をエレガントにブリッジする明示的推論・インスタンスクエリインタフェースを提案する。具体的には、学習可能なインスタンスクエリのバンクがVLMに注入され、インストラクションと視覚情報によってコンテキスト化され、各クエリがインスタンス対応スロットとして機能する。ハイブリッドアテンション機構は、これらのクエリ、ビジュアルトークン、命令トークン間の相互作用をさらに促進し、インスタンス列挙を改善し、重複予測を減らす。 LLM条件のクエリはSAM3の検出器クエリ空間に投影され、1つのフォワードパスで正確なマルチインスタンスセグメンテーションを駆動する。この設計はSAM3に高レベルの命令理解、構成的推論、そしてコアアーキテクチャを変更することなくインスタンスレベルのセット予測を備える。トレーニングと評価を支援するため,高品質で大規模な命令ベースのインスタンスセグメンテーションデータセットであるInst2Segと,フリーフォーム命令とインスタンスレベルのマスクを結合したベンチマークを構築した。 2BスケールのインストラクタSAMは、複雑な命令駆動およびフレーズレベルの参照セグメンテーションベンチマークにまたがって強力な結果が得られ、従来のエンドツーエンドメソッドやSAM3のエージェントパイプラインよりも優れ、効率的なシングルパスマルチインスタンス予測を実現している。

論文の概要: InstructSAM: Segment Any Instance with Any Instructions

関連論文リスト