Fugu-MT 論文翻訳(概要): Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation

論文の概要: Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation

arxiv url: http://arxiv.org/abs/2603.27993v1
Date: Mon, 30 Mar 2026 03:33:10 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.211118
Title: Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation
Title（参考訳）: 画像セグメント参照のためのプログレッシブ・プロンプトガイド型クロスモーダル推論
Authors: Jiachen Li, Hongyun Wang, Jinyu Xu, Wenbo Jiang, Yanchun Ma, Yongjian Liu, Qing Xie, Bolong Zheng,
Abstract要約: 画像セグメンテーションの参照は、自由形式の参照表現に基づいて画像中の対象オブジェクトをローカライズし、セグメンテーションすることを目的としている。画像セグメンテーションを参考にしたプログレッシブプロンプト誘導型クロスモーダル推論フレームワークであるPPCRを提案する。 PPCRは、推論プロセスをSemantic-Spatial Grounding-Instanceパイプラインとして明示的に構成する。
参考スコア（独自算出の注目度）: 11.276795416626385
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Referring image segmentation aims to localize and segment a target object in an image based on a free-form referring expression. The core challenge lies in effectively bridging linguistic descriptions with object-level visual representations, especially when referring expressions involve detailed attributes and complex inter-object relationships. Existing methods either rely on cross-modal alignment or employ Semantic Segmentation Prompts, but they often lack explicit reasoning mechanisms for grounding language descriptions to target regions in the image. To address these limitations, we propose PPCR, a Progressive Prompt-guided Cross-modal Reasoning framework for referring image segmentation. PPCR explicitly structures the reasoning process as a Semantic Understanding-Spatial Grounding-Instance Segmentation pipeline. Specifically, PPCR first employs multimodal large language models (MLLMs) to generate Semantic Segmentation Prompt that capture key semantic cues of the target object. Based on this semantic context, Spatial Segmentation Prompt are further generated to reason about object location and spatial extent, enabling a progressive transition from semantic understanding to spatial grounding. The Semantic and Spatial Segmentation prompts are then jointly integrated into the segmentation module to guide accurate target localization and segmentation. Extensive experiments on standard referring image segmentation benchmarks demonstrate that PPCR consistently outperforms existing methods. The code will be publicly released to facilitate reproducibility.
Abstract（参考訳）: 画像セグメンテーションの参照は、自由形式の参照表現に基づいて画像中の対象オブジェクトをローカライズし、セグメンテーションすることを目的としている。中心となる課題は、特に詳細な属性や複雑なオブジェクト間関係を含む表現を参照する場合、オブジェクトレベルの視覚表現で言語記述を効果的にブリッジすることにある。既存の手法は、クロスモーダルアライメントかセマンティックセグメンテーション・プロンプツ(Semantic Segmentation Prompts)を利用するが、画像内のターゲット領域に言語記述を基礎付けるための明確な推論機構を欠いていることが多い。これらの制約に対処するため,画像セグメンテーションを参照するためのプログレッシブ・プロンプト誘導型クロスモーダル推論フレームワークであるPPCRを提案する。 PPCRは、推論プロセスをセマンティック理解-空間グラウンドリング-インスタンスセグメンテーションパイプラインとして明示的に構成する。具体的には、まずマルチモーダルな大言語モデル(MLLM)を使用して、ターゲットオブジェクトのキーセマンティックキューをキャプチャするセマンティックセグメンテーション・プロンプトを生成する。この意味的文脈に基づいて、空間的セグメンテーション・プロンプトがさらに生成され、対象の位置と空間的範囲を推論し、意味的理解から空間的接地への進歩的な遷移を可能にする。セマンティックと空間セグメンテーションのプロンプトはセグメンテーションモジュールに統合され、正確なターゲットのローカライゼーションとセグメンテーションを導く。基準参照画像セグメンテーションベンチマークの大規模な実験は、PPCRが既存の方法より一貫して優れていることを示した。再現性を促進するために、コードは公開されます。

論文の概要: Progressive Prompt-Guided Cross-Modal Reasoning for Referring Image Segmentation

関連論文リスト