Fugu-MT 論文翻訳(概要): FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

論文の概要: FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

arxiv url: http://arxiv.org/abs/2510.21311v1
Date: Fri, 24 Oct 2025 10:14:17 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 09:00:15.434374
Title: FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning
Title（参考訳）: FineRS:強化学習による微小物体の微粒化とセグメンテーション
Authors: Lu Zhang, Jiazuo Yu, Haomiao Xiong, Ping Hu, Yunzhi Zhuge, Huchuan Lu, You He,
Abstract要約: textscFineRSは、非常に小さなオブジェクトをセグメント化するための2段階のMLLMベースの強化学習フレームワークである。 textscFineRS-4kは,属性レベルの推論に基づくMLLMの評価と,微妙で小規模なターゲットに対する画素レベルのセグメンテーションのための新しいデータセットである。
参考スコア（独自算出の注目度）: 62.11389260206383
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities across a wide range of vision-language tasks. However, due to the restricted input resolutions, MLLMs face significant challenges in precisely understanding and localizing visual details in high-resolution images -- particularly when dealing with extra-small objects embedded in cluttered contexts. To address this issue, we propose \textsc{FineRS}, a two-stage MLLM-based reinforcement learning framework for jointly reasoning and segmenting extremely small objects within high-resolution scenes. \textsc{FineRS} adopts a coarse-to-fine pipeline comprising Global Semantic Exploration (GSE) and Localized Perceptual Refinement (LPR). Specifically, GSE performs instruction-guided reasoning to generate a textural response and a coarse target region, while LPR refines this region to produce an accurate bounding box and segmentation mask. To couple the two stages, we introduce a locate-informed retrospective reward, where LPR's outputs are used to optimize GSE for more robust coarse region exploration. % Additionally, we present \textsc{FineRS}-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets in complex high-resolution scenes. Experimental results on \textsc{FineRS}-4k and public datasets demonstrate that our method consistently outperforms state-of-the-art MLLM-based approaches on both instruction-guided segmentation and visual reasoning tasks.
Abstract（参考訳）: MLLM(Multi-modal Large Language Models)は、様々な視覚言語タスクにおいて顕著な機能を示す。しかし、入力解像度が制限されているため、MLLMは高解像度画像の視覚的詳細を正確に理解し、ローカライズするという重大な課題に直面している。この問題に対処するために,2段階のMLLMに基づく強化学習フレームワークである‘textsc{FineRS} を提案する。 \textsc{FineRS}は、GSE(Global Semantic Exploration)とLPR(Localized Perceptual Refinement)を組み合わせた粗大なパイプラインを採用する。具体的には、GSEは指示誘導推論を行い、テキスト応答と粗いターゲット領域を生成し、LPRはこの領域を洗練し、正確なバウンディングボックスとセグメンテーションマスクを生成する。 2つの段階を合わせるために、我々は、より堅牢な粗い領域探索のために、LPRの出力を使用してGSEを最適化する位置インフォームされた振り返り報酬を導入する。さらに,属性レベルの推論に基づくMLLMと,複雑な高解像度シーンにおける微妙で小さなターゲットに対する画素レベルのセグメンテーションを評価するための新しいデータセットである,textsc{FineRS}-4kを提示する。提案手法は命令誘導セグメンテーションと視覚的推論の両タスクにおいて,最先端のMLLMに基づくアプローチより一貫して優れていることを示す。

論文の概要: FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning

関連論文リスト