Fugu-MT 論文翻訳(概要): DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

論文の概要: DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

arxiv url: http://arxiv.org/abs/2505.23179v1
Date: Thu, 29 May 2025 07:16:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-30 18:14:07.730635
Title: DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes
Title（参考訳）: DIP-R1:複雑な場面を観察・理解するRLによる深い検査・知覚
Authors: Sungjune Park, Hyunjun Kim, Junho Kim, Seongho Kim, Yong Man Ro,
Abstract要約: RL(DIP-R1)を用いた深部検査と知覚は,MLLMの視覚知覚能力を高めるために設計されている。 DIP-R1は、3つのシンプルなルールベースの報酬モデルを通して、MLLMをビジュアルシーンの詳細な検査を通してガイドする。ドメイン内およびドメイン外のさまざまなシナリオにおいて、一貫性と大幅な改善を実現します。
参考スコア（独自算出の注目度）: 51.895756593200296
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated significant visual understanding capabilities, yet their fine-grained visual perception in complex real-world scenarios, such as densely crowded public areas, remains limited. Inspired by the recent success of reinforcement learning (RL) in both LLMs and MLLMs, in this paper, we explore how RL can enhance visual perception ability of MLLMs. Then we develop a novel RL-based framework, Deep Inspection and Perception with RL (DIP-R1) designed to enhance the visual perception capabilities of MLLMs, by comprehending complex scenes and looking through visual instances closely. DIP-R1 guides MLLMs through detailed inspection of visual scene via three simply designed rule-based reward modelings. First, we adopt a standard reasoning reward encouraging the model to include three step-by-step processes: 1) reasoning for understanding visual scenes, 2) observing for looking through interested but ambiguous regions, and 3) decision-making for predicting answer. Second, a variance-guided looking reward is designed to examine uncertain regions for the second observing process. It explicitly enables the model to inspect ambiguous areas, improving its ability to mitigate perceptual uncertainties. Third, we model a weighted precision-recall accuracy reward enhancing accurate decision-making. We explore its effectiveness across diverse fine-grained object detection data consisting of challenging real-world environments, such as densely crowded scenes. Built upon existing MLLMs, DIP-R1 achieves consistent and significant improvement across various in-domain and out-of-domain scenarios. It also outperforms various existing baseline models and supervised fine-tuning methods. Our findings highlight the substantial potential of integrating RL into MLLMs for enhancing capabilities in complex real-world perception tasks.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、視覚的な理解能力を示すが、密集した公共領域のような複雑な現実のシナリオにおいて、その微妙な視覚的認識は限定的のままである。本稿では,LLMとMLLMの両方における強化学習(RL)の成功に触発されて,RLがMLLMの視覚知覚能力を高める方法について検討する。そこで我々は,複雑なシーンを解釈し,視覚的インスタンスをよく見ることによって,MLLMの視覚的知覚能力を向上する新しいRLベースのフレームワークであるDeep Inspection and Perception with RL (DIP-R1)を開発した。 DIP-R1は、3つのシンプルなルールベースの報酬モデルを通して、MLLMをビジュアルシーンの詳細な検査を通してガイドする。まず、モデルに3つのステップバイステッププロセスを含めるよう奨励する標準的な推論報酬を採用します。 1 視覚的場面の理解の理由 2【興味あるがあいまいな地域を観察する】 3)回答の予測のための意思決定。第2に、分散誘導型ルック報酬は、第2の観察プロセスにおける不確実な領域を調べるように設計されている。これは、モデルが曖昧な領域を検査することを可能にし、知覚の不確実性を緩和する能力を向上させる。第3に、重み付き精度-リコール精度報酬をモデル化し、精度の高い意思決定を行う。密集したシーンなど、現実世界の環境に挑戦する課題からなる、さまざまなきめ細かい物体検出データにまたがって、その有効性について検討する。既存のMLLMに基づいて構築されたDIP-R1は、さまざまなドメイン内およびドメイン外のシナリオにおいて、一貫性と大幅な改善を実現している。また、既存のベースラインモデルや教師付き微調整手法よりも優れている。本研究は,RLをMLLMに組み込むことにより,複雑な実世界の知覚タスクにおける能力向上の可能性を明らかにするものである。

論文の概要: DIP-R1: Deep Inspection and Perception with RL Looking Through and Understanding Complex Scenes

関連論文リスト