Fugu-MT 論文翻訳(概要): From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

論文の概要: From Pixels to Concepts: Do Segmentation Models Understand What They Segment?

arxiv url: http://arxiv.org/abs/2605.09591v1
Date: Sun, 10 May 2026 15:07:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 02:24:05.532711
Title: From Pixels to Concepts: Do Segmentation Models Understand What They Segment?
Title（参考訳）: Pixelからコンセプトへ:セグメンテーションモデルはセグメンテーションを理解するか?
Authors: Shuang Liang, Zeqing Wang, Yuxian Li, Xihui Liu, Han Wang,
Abstract要約: 我々は,CAFE: textbfCounterfactual textbfAttribute textbfFactuality textbfEvaluationを紹介した。本ベンチマークでは,2,146対のサンプルを対象画像,接地トラスマスク,正のプロンプト,誤った負のプロンプトから構成した。
参考スコア（独自算出の注目度）: 26.89265297426196
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Segmentation is a fundamental vision task underlying numerous downstream applications. Recent promptable segmentation models, such as Segment Anything Model 3 (SAM3), extend segmentation from category-agnostic mask prediction to concept-guided localization conditioned on high-level textual prompts. However, existing benchmarks primarily evaluate mask accuracy or object presence, leaving unclear whether these models faithfully ground the queried concept or instead rely on visually salient but semantically misleading cues. We introduce CAFE: \textbf{C}ounterfactual \textbf{A}ttribute \textbf{F}actuality \textbf{E}valuation, a novel benchmark for evaluating concept-faithful segmentation in promptable segmentation models. Our \textbf{CAFE} is built on attribute-level counterfactual manipulation: the target region and ground-truth mask are preserved, while attributes such as surface appearance, context, or material composition are modified to introduce misleading semantic cues. The benchmark contains 2,146 paired test samples, each consisting of a target image, a ground-truth mask, a positive prompt, and a misleading negative prompt. These samples cover three counterfactual categories: Superficial Mimicry (\textbf{SM}), Context Conflict (\textbf{CC}), and Ontological Conflict (\textbf{OC}). We evaluate various model types and sizes on our CAFE. Experiments reveal a systematic gap between localization quality and concept discrimination: models often generate accurate masks even for misleading prompts, suggesting that strong mask prediction does not necessarily imply faithful semantic grounding. Our CAFE provides a controlled benchmark for diagnosing whether promptable segmentation models perform concept-faithful grounding rather than shortcut-driven mask retrieval.
Abstract（参考訳）: セグメント化は多くの下流アプリケーションの基礎となる基本的なビジョンタスクである。 Segment Anything Model 3 (SAM3) のような近年のプロンプト可能なセグメンテーションモデルは、カテゴリに依存しないマスク予測から、高レベルのテキストプロンプトで条件付けられた概念誘導のローカライゼーションまでセグメンテーションを拡張している。しかし、既存のベンチマークは主にマスクの精度や物体の存在を評価しており、これらのモデルが精査された概念を忠実に基礎づけているのか、それとも視覚的に健全だが意味的に誤解を招く手がかりに依存しているのかははっきりしない。本稿では,CAFE: \textbf{C}ounterfactual \textbf{A}ttribute \textbf{F}actuality \textbf{E}valuationを紹介する。対象領域と接地トラスマスクは保存され, 表面の外観, コンテキスト, 材料組成などの属性は, 誤解を招くセマンティックキューを導入するために修正される。ベンチマークには、2,146個のペアテストサンプルが含まれており、それぞれがターゲット画像、接地トラスマスク、正のプロンプト、誤った負のプロンプトで構成されている。これらのサンプルは、Superficial Mimicry (\textbf{SM})、Context Conflict (\textbf{CC})、Ontological Conflict (\textbf{OC})の3つのカウンターファクトのカテゴリをカバーしている。各種モデルタイプとサイズをCAFEで評価した。モデルはしばしば、誤解を招くプロンプトであっても正確なマスクを生成し、強いマスク予測が必ずしも忠実なセマンティックグラウンドを暗示していないことを示唆している。我々のCAFEは、ショートカット駆動マスク検索よりも、素早いセグメンテーションモデルが概念に忠実なグラウンディングを行うかどうかを診断するための制御されたベンチマークを提供する。

関連論文リスト

S$^3$POT: Contrast-Driven Face Occlusion Segmentation via Self-Supervised Prompt Learning [46.05577414378133]
S$3$POTは、自己教師付き空間的プロンプトによる顔生成を相乗化するためのコントラスト駆動型フレームワークである。特に、S$3$POTは、参照生成、機能拡張、Prompt Selectionの3つのモジュールで構成されている。専用のデータセットの実験では、S$3$POTの優れたパフォーマンスと各モジュールの有効性が示されている。
論文参考訳（メタデータ） (2026-01-31T10:05:13Z)
ResAgent: Entropy-based Prior Point Discovery and Visual Reasoning for Referring Expression Segmentation [21.87321809019825]
Referring Expression(RES)は、自由形式の言語表現によるターゲットのピクセルレベルの理解を可能にする、コアビジョン言語セグメンテーションタスクである。 textbfmodelは textbfEntropy-textbfBased Point textbfDiscovery (textbfEBD) と textbfVision-textbfBased textbfReasoning (textbfVBR) を統合する新しいRESフレームワークである。 model は coarse-to を実装します
論文参考訳（メタデータ） (2026-01-23T01:56:04Z)
SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space [11.534994345027362]
MLLM(Multimodal large language model)は、推論セグメンテーションなどの視覚言語タスクにおいて顕著な機能を示す。そこで本研究では,従来の問合せの意味を保ちつつ,セグメンテーション性能を劣化させつつ,文法的に正しい言い回しを生成する,新しい逆の言い回しタスクを提案する。テキストオートエンコーダの低次元意味潜在空間で動作するブラックボックスであるSPARTAを導入する。
論文参考訳（メタデータ） (2025-10-28T14:09:05Z)
LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
参照3Dは、クエリの文で記述された3Dポイントクラウドから、指定されたオブジェクトのすべてのポイントをセグメントする視覚言語タスクである。本稿では,LESSと呼ばれるレファレンス3次元パイプラインを提案する。 ScanReferデータセット上での最先端のパフォーマンスは、バイナリラベルのみを使用して、以前の3.7% mIoUの手法を上回ります。
論文参考訳（メタデータ） (2024-10-17T07:47:41Z)
Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
プレトレーニング技術の最近の進歩により、視覚基礎モデルの能力が向上した。最近の研究はSAMをFew-shot Semantic segmentation (FSS)に拡張している。本稿では,グラフ解析に基づく簡易かつ効果的な手法を提案する。
論文参考訳（メタデータ） (2024-10-09T15:02:28Z)
Boosting Unsupervised Semantic Segmentation with Principal Mask Proposals [15.258631373740686]
教師なしセマンティックセグメンテーションは、画像コーパス内のグローバルセマンティックカテゴリをアノテーションなしで識別することで、画像を自動的に意味のある領域に分割することを目的としている。そこで,PriMaP - 主マスク提案 - 特徴表現に基づいてイメージを意味的に意味のあるマスクに分解する。これにより、予測最大化アルゴリズムであるPriMaPs-EMを用いて、クラスプロトタイプをPriMaPsに適合させることで、教師なしセマンティックセマンティックセマンティクスを実現することができる。
論文参考訳（メタデータ） (2024-04-25T17:58:09Z)
CPCM: Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation [60.0893353960514]
疎アノテーションを用いた弱教師付きポイントクラウドセマンティックセマンティックセグメンテーションの課題について検討する。本研究では,地域マスキング(RegionMask)戦略とコンテキストマスキングトレーニング(CMT)手法の2つの部分からなるコンテキストポイントクラウドモデリング(CPCM)手法を提案する。
論文参考訳（メタデータ） (2023-07-19T04:41:18Z)
Discovering Object Masks with Transformers for Unsupervised Semantic Segmentation [75.00151934315967]
MaskDistillは教師なしセマンティックセグメンテーションのための新しいフレームワークである。我々のフレームワークは、低レベルの画像キューにラッチを付けず、オブジェクト中心のデータセットに限らない。
論文参考訳（メタデータ） (2022-06-13T17:59:43Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。