Fugu-MT 論文翻訳(概要): Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

論文の概要: Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

arxiv url: http://arxiv.org/abs/2604.07916v1
Date: Thu, 09 Apr 2026 07:37:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:05.769923
Title: Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation
Title（参考訳）: Tarot-SAM3:任意の参照式セグメンテーションのためのトレーニング不要SAM3
Authors: Weiming Zhang, Dingwen Xiao, Songyue Guo, Guangyu Xiang, Shiqi Wen, Minwei Zhao, Lei Chen, Lin Wang,
Abstract要約: Tarot-SAM3は、任意の参照式から正確にセグメンテーションできる、トレーニング不要のフレームワークである。 Tarot-SAM3は2つの重要なフェーズで構成されている。第一に、推論補助的なプロンプトオプションを導入するReasoning Expression Interpreter (ERI) フェーズである。第2に、マスク自己精製(MSR)フェーズは、プロンプトタイプにまたがる最高のマスクを選択し、自己精製を行う。
参考スコア（独自算出の注目度）: 18.568343383992072
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring Expression Segmentation (RES) aims to segment image regions described by natural-language expressions, serving as a bridge between vision and language understanding. Existing RES methods, however, rely heavily on large annotated datasets and are limited to either explicit or implicit expressions, hindering their ability to generalize to any referring expression. Recently, the Segment Anything Model 3 (SAM3) has shown impressive robustness in Promptable Concept Segmentation. Nonetheless, applying it to RES remains challenging: (1) SAM3 struggles with longer or implicit expressions; (2) naive coupling of SAM3 with a multimodal large language model (MLLM) makes the final results overly dependent on the MLLM's reasoning capability, without enabling refinement of SAM3's segmentation outputs. To this end, we present Tarot-SAM3, a novel training-free framework that can accurately segment from any referring expression. Specifically, Tarot-SAM3 consists of two key phases. First, the Expression Reasoning Interpreter (ERI) phase introduces reasoning-assisted prompt options to support structured expression parsing and evaluation-aware rephrasing. This transforms arbitrary queries into robust heterogeneous prompts for generating reliable masks with SAM3. Second, the Mask Self-Refining (MSR) phase selects the best mask across prompt types and performs self-refinement by leveraging rich feature relationships from DINOv3 to compare discriminative regions among ERI outputs. It then infers region affiliation to the target, thereby correcting over- and under-segmentation. Extensive experiments demonstrate that Tarot-SAM3 achieves strong performance on both explicit and implicit RES benchmarks, as well as open-world scenarios. Ablation studies further validate the effectiveness of each phase.
Abstract（参考訳）: Referring Expression Segmentation (RES)は、自然言語で記述された画像領域をセグメント化することを目的としており、視覚と言語理解の橋渡しとして機能している。しかし、既存のRESメソッドは、大きな注釈付きデータセットに大きく依存しており、明示的あるいは暗黙的な表現に限られており、参照表現に一般化する能力を妨げている。最近、Segment Anything Model 3 (SAM3)は、Promptable Concept Segmentationにおいて印象的な堅牢性を示している。 2) SAM3とマルチモーダルな大言語モデル(MLLM)との自然な結合は、SAM3のセグメンテーション出力の洗練を可能とせず、最終的な結果をMLLMの推論能力に過度に依存させる。この目的のために,任意の参照表現から正確にセグメンテーションできる新しいトレーニングフリーフレームワークTarot-SAM3を提案する。具体的には、Tarot-SAM3は2つの重要な相から構成される。第一に、Expression Reasoning Interpreter (ERI) フェーズでは、構造化された式解析と評価対応のリフレクションをサポートするための推論支援プロンプトオプションが導入されている。これにより任意のクエリをロバストなヘテロジニアスプロンプトに変換し、SAM3で信頼できるマスクを生成する。次に,Mask Self-Refining (MSR) フェーズは,DINOv3 からのリッチな特徴関係を利用して,ERR 出力間の識別領域を比較することで,プロンプト型間で最高のマスクを選択し,自己精製を行う。その後、ターゲットへの領域アフィリエイトを推論し、オーバー・セグメンテーションとアンダー・セグメンテーションを補正する。大規模な実験では、Tarot-SAM3は明示的および暗黙的なRESベンチマークとオープンワールドシナリオの両方で強力なパフォーマンスを達成している。アブレーション研究は、各相の有効性をさらに検証する。

論文の概要: Tarot-SAM3: Training-free SAM3 for Any Referring Expression Segmentation

関連論文リスト