Fugu-MT 論文翻訳(概要): Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

論文の概要: Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

arxiv url: http://arxiv.org/abs/2605.12953v1
Date: Wed, 13 May 2026 03:36:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.790894
Title: Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation
Title（参考訳）: Seg-Agent: 学習自由言語誘導セグメンテーションのためのテスト時間マルチモーダル推論
Authors: Chao Hao, Jun Xu, Ji Du, Shuo Ye, Ziyue Qiao, Xiaodong Cun, Guangcong Wang, Xubin Zheng, Zitong Yu,
Abstract要約: Seg-Agentは完全にトレーニング不要のフレームワークで、Explicit Multimodal Chain-of-Reasoningの先駆者です。提案手法は, 生成, 選択, 洗練の3段階からなる対話型視覚推論ループを構築する。 various-LangSegは、明示的なセマンティック、ジェネリックオブジェクト、推論誘導セグメンテーションタスクをカバーする新しいベンチマークである。
参考スコア（独自算出の注目度）: 52.8308168727975
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language-guided segmentation transcends the scope limitations of traditional semantic segmentation, enabling models to segment arbitrary target regions based on natural language instructions. Existing approaches typically adopt a two-stage framework: employing Multimodal Large Language Models (MLLMs) to interpret instructions and generate visual prompts, followed by foundational segmentation models (e.g., SAM) to produce masks. However, due to the limited spatial grounding capabilities of off-the-shelf MLLMs, these methods often rely on extensive training on large-scale datasets to achieve satisfactory accuracy. While recent advances have introduced reasoning mechanisms to improve performance, they predominantly operate within the textual domain, performing chain-of-thought reasoning solely based on abstract text representations without direct visual feedback. In this paper, we propose Seg-Agent, a completely training-free framework that pioneers Explicit Multimodal Chain-of-Reasoning. Unlike prior text-only reasoning, our approach constructs an interactive visual reasoning loop comprising three stages: generation, selection, and refinement. Specifically, we leverage Set-of-Mark (SoM) visual prompting to render candidate regions directly onto the image, allowing the MLLM to ``see'' and iteratively reason about spatial relationships in the visual domain rather than just the textual one. This explicit multimodal interaction enables Seg-Agent to achieve performance comparable to state-of-the-art training-based methods without any parameter updates. Furthermore, to comprehensively evaluate generalization across diverse scenarios, we introduce Various-LangSeg, a novel benchmark covering explicit semantic, generic object, and reasoning-guided segmentation tasks. Extensive experiments demonstrate the effectiveness and robustness of our method.
Abstract（参考訳）: 言語誘導セグメンテーションは、従来の意味セグメンテーションの範囲を超越し、モデルが自然言語命令に基づいて任意の対象領域をセグメンテーションすることができる。既存のアプローチでは、命令を解釈し視覚的なプロンプトを生成するためにMLLM(Multimodal Large Language Models)を使用し、マスクを生成するために基本的なセグメンテーションモデル(例えばSAM)を採用する。しかし、市販のMLLMの空間接地能力に制限があるため、これらの手法は大規模なデータセットの広範囲な訓練に頼り、良好な精度を達成できる。最近の進歩は、性能向上のための推論機構を導入しているが、それらは主にテキスト領域内で動作し、直接視覚的なフィードバックなしに抽象的なテキスト表現のみに基づいてチェーン・オブ・シークレット推論を行う。本稿では,マルチモーダル・チェーン・オブ・推論の先駆者であるSeg-Agentを提案する。従来のテキストのみの推論とは異なり、本手法は生成、選択、洗練の3段階からなる対話型視覚推論ループを構築する。具体的には、Set-of-Mark(SoM)ビジュアルプロンプトを利用して、候補領域を直接画像上にレンダリングし、MLLMが'see'を許容し、テキストではなく視覚領域における空間的関係を反復的に推論する。この明示的なマルチモーダルインタラクションにより、Seg-Agentはパラメータを更新することなく、最先端のトレーニングベースのメソッドに匹敵するパフォーマンスを実現することができる。さらに,多種多様なシナリオにおける一般化を包括的に評価するために,明示的セマンティック,ジェネリックオブジェクト,推論誘導セグメンテーションタスクをカバーする新しいベンチマークである various-LangSeg を導入する。本手法の有効性とロバスト性を示す実験を行った。

論文の概要: Seg-Agent: Test-Time Multimodal Reasoning for Training-Free Language-Guided Segmentation

関連論文リスト