Fugu-MT 論文翻訳(概要): Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

論文の概要: Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

arxiv url: http://arxiv.org/abs/2606.09303v1
Date: Mon, 08 Jun 2026 10:10:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:06.918899
Title: Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning
Title（参考訳）: Reason Twice: 候補発見と比較推論によるセグメンテーション
Authors: Xinyan Gao, Haoran Hao, Xiangyu Yue,
Abstract要約: マスク生成と選択のための2段階フレームワークRea2Segを提案する。このフレームワークは、まず、セグメンテーションMLLMのアテンションマップに基づいて、潜在的領域を候補マスクとして識別する。次にMLLMを使って質問と候補者のマスクを推論し、各マスクにスコアを割り当てる。最終セグメンテーション結果は、候補を再ランクし、最高スコアマスクを選択することで得られる。
参考スコア（独自算出の注目度）: 10.180485222685492
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid development of pretrained foundation models has enabled more general image segmentation. Multimodal large language models (MLLMs) have been widely explored for image segmentation with complex queries that require high-level reasoning. Despite promising progress, existing methods are often constrained by limited training data and the gap between MLLMs and mask generation modules. To better transfer MLLMs' perception and reasoning ability to complex reasoning-based segmentation tasks, we propose a two-stage framework Rea2Seg for mask generation and selection. Specifically, the framework first identifies potential regions as candidate masks based on the attention maps of a segmentation MLLM. It then employs an MLLM to reason over the question and candidate masks and assign scores to each mask. The final segmentation result is obtained by reranking the candidates and selecting the highest-scoring mask, reformulating image segmentation as candidate discovery followed by discriminative mask selection. We also notice that a large portion of questions in existing benchmarks focus on commonsense reasoning, and these questions usually do not fully require joint visual observation and reasoning. To address this issue, we introduce a new benchmark called ReasonSeg-SGDR that comprehensively evaluates a model's perception, grounding, and reasoning abilities across multiple dimensions, including discriminative recognition, spatial reasoning, geometric reasoning, and multi-step reasoning, with fine-grained mask generation. In addition, we collect training data to enhance MLLMs' ability to jointly understand multimodal queries and candidate masks, and to assign scores through reasoning. Experimental results on the proposed benchmark and ReasonSeg demonstrate the effectiveness of the unified mask generation and selection framework.
Abstract（参考訳）: 事前訓練された基礎モデルの急速な開発により、より一般的な画像セグメンテーションが可能になった。マルチモーダル大言語モデル (MLLM) は、高レベルの推論を必要とする複雑なクエリによる画像セグメント化のために広く研究されている。有望な進歩にもかかわらず、既存の手法は限られたトレーニングデータとMLLMとマスク生成モジュールのギャップによって制約されることが多い。 MLLMの知覚と推論能力を複雑な推論に基づくセグメンテーションタスクに伝達するために,マスク生成と選択のための2段階フレームワークRea2Segを提案する。具体的には、まず、セグメンテーションMLLMのアテンションマップに基づいて、潜在的領域を候補マスクとして識別する。次にMLLMを使って質問と候補者のマスクを推論し、各マスクにスコアを割り当てる。最終セグメンテーション結果は、候補を再分類し、最高スコアマスクを選択し、画像セグメンテーションを候補発見として改定し、識別マスク選択する。また、既存のベンチマークでは、ほとんどの質問がコモンセンス推論に焦点を合わせており、これらの質問は通常、共同的な視覚的観察と推論を必要としない。この問題に対処するために,識別的認識,空間的推論,幾何学的推論,多段階推論など,複数の次元にわたるモデルの知覚,接地,推論能力を網羅的に評価するReasonSeg-SGDRという新しいベンチマークを導入する。さらに,マルチモーダルクエリと候補マスクを協調的に理解するMLLMの能力を高めるためのトレーニングデータを収集し,推論によってスコアを割り当てる。提案したベンチマークとReasonSegによる実験結果から,マスク生成と選択フレームワークの有効性が示された。

論文の概要: Reason Twice: Segmentation via Candidate Discovery and Comparative Reasoning

関連論文リスト