Fugu-MT 論文翻訳(概要): Spatial Preference Rewarding for MLLMs Spatial Understanding

論文の概要: Spatial Preference Rewarding for MLLMs Spatial Understanding

arxiv url: http://arxiv.org/abs/2510.14374v1
Date: Thu, 16 Oct 2025 07:16:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:14.75575
Title: Spatial Preference Rewarding for MLLMs Spatial Understanding
Title（参考訳）: MLLMの空間的理解のための空間的選好リワード
Authors: Han Qiu, Peng Gao, Lewei Lu, Xiaoqin Zhang, Ling Shao, Shijian Lu,
Abstract要約: マルチモーダル大言語モデル (MLLM) は, 有望な空間理解能力を示した。彼らの成功にもかかわらず、MLLMは依然として微粒な空間知覚能力に不足している。本稿では,MLLMの空間能力を高めるSPR(Spatial Preference Rewarding)アプローチを提案する。
参考スコア（独自算出の注目度）: 92.25703021388142
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models~(MLLMs) have demonstrated promising spatial understanding capabilities, such as referencing and grounding object descriptions. Despite their successes, MLLMs still fall short in fine-grained spatial perception abilities, such as generating detailed region descriptions or accurately localizing objects. Additionally, they often fail to respond to the user's requirements for desired fine-grained spatial understanding. This issue might arise because existing approaches primarily focus on tuning MLLMs to model pre-annotated instruction data to inject spatial knowledge, without direct supervision of MLLMs' actual responses. We address this issue by SPR, a Spatial Preference Rewarding~(SPR) approach that enhances MLLMs' spatial capabilities by rewarding MLLMs' detailed responses with precise object localization over vague or inaccurate responses. With randomly selected image regions and region descriptions from MLLMs, SPR introduces semantic and localization scores to comprehensively evaluate the text quality and localization quality in MLLM-generated descriptions. We also refine the MLLM descriptions with better localization accuracy and pair the best-scored refinement with the initial descriptions of the lowest score for direct preference optimization, thereby enhancing fine-grained alignment with visual input. Extensive experiments over standard referring and grounding benchmarks show that SPR improves MLLM spatial understanding capabilities effectively with minimal overhead in training. Data and code will be released at https://github.com/hanqiu-hq/SPR
Abstract（参考訳）: マルチモーダルな大言語モデル~(MLLM)は、オブジェクト記述の参照やグラウンド化など、有望な空間理解能力を示してきた。彼らの成功にもかかわらず、MLLMは細かな空間認識能力に欠けており、例えば詳細な領域記述の生成や正確な位置決めが可能である。さらに、ユーザーは所望のきめ細かい空間的理解の要求に応えられないことが多い。この問題は、既存のアプローチがMLLMの実際の応答を直接監督することなく、事前に注釈付けされた命令データをモデル化して空間知識を注入することに焦点を当てているためである。本稿では, MLLMの詳細な応答に, 曖昧で不正確な応答に対して, 正確な対象位置を付与することにより, MLLMの空間能力を高めるSPR(Spatial Preference Rewarding~SPR)アプローチによってこの問題に対処する。 MLLMからランダムに選択された画像領域と領域記述により、SPRは、MLLM生成記述におけるテキスト品質と局所化品質を包括的に評価する意味的および局所化スコアを導入する。また,MLLM記述を局所化精度良く洗練し,最下位スコアの初期記述と組み合わせて直接選好最適化を行い,視覚入力との微粒化アライメントを向上する。標準参照およびグラウンドベンチマークに対する大規模な実験により、SPRは訓練のオーバーヘッドを最小限に抑え、MLLM空間理解能力を効果的に改善することが示された。データとコードはhttps://github.com/hanqiu-hq/SPRで公開される

論文の概要: Spatial Preference Rewarding for MLLMs Spatial Understanding

関連論文リスト