Fugu-MT 論文翻訳(概要): KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge

論文の概要: KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge

arxiv url: http://arxiv.org/abs/2508.14080v1
Date: Tue, 12 Aug 2025 19:43:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-21 16:52:41.165087
Title: KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge
Title（参考訳）: KnowDR-REC:実世界の知識による表現理解の参照ベンチマーク
Authors: Guanghao Jin, Jingpei Wu, Tianpei Guo, Yiyi Niu, Weidong Zhou, Guoyang Liu,
Abstract要約: 本研究では,実世界の知識に基づいて構築されたKnowDR-RECを提案する。我々は、KnowDR-REC上で16の最先端マルチモーダルモデルを評価し、既存のMLLMが知識駆動型視覚接地作業に苦戦していることを示す実験結果を得た。
参考スコア（独自算出の注目度）: 1.5833270109954136
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Referring Expression Comprehension (REC) is a popular multimodal task that aims to accurately detect target objects within a single image based on a given textual expression. However, due to the limitations of earlier models, traditional REC benchmarks either rely solely on intra-image cues or lack sufficiently fine-grained instance annotations, making them inadequate for evaluating the reasoning capabilities of Multi-modal Large Language Models (MLLMs). To address this gap, we propose a new benchmark, KnowDR-REC, characterized by three key features: Firstly, it is built upon real-world knowledge, requiring fine-grained multimodal reasoning across text and image. Secondly, the dataset includes elaborately constructed negative samples via fine-grained expression editing, designed to evaluate a model's robustness and anti-hallucination ability. Lastly, we introduce three novel evaluation metrics to systematically explore the model's internal reasoning process. We evaluate 16 state-of-the-art multimodal models on KnowDR-REC, with experimental results showing that existing MLLMs still struggle with knowledge-driven visual grounding tasks. Furthermore, we observe a decoupling between textual understanding and visual grounding in MLLMs, where many models are significantly influenced by memorized shortcut correlations, which severely affect their behavior on our benchmark and hinder genuine multimodal reasoning. We anticipate that the proposed benchmark will inspire future research towards developing more robust, interpretable, and knowledge-intensive visual grounding frameworks, driving the development of more reliable and robust multimodal systems for complex real-world scenarios.
Abstract（参考訳）: Referring Expression Comprehension (REC)は、与えられたテキスト表現に基づいて単一の画像内のターゲットオブジェクトを正確に検出することを目的とした、一般的なマルチモーダルタスクである。しかし、以前のモデルの制限のため、従来のRECベンチマークは画像内キューのみに依存しているか、十分にきめ細かいインスタンスアノテーションが欠如しているため、MLLM(Multi-modal Large Language Models)の推論能力の評価には不十分である。このギャップに対処するために、我々は3つの重要な特徴を特徴付ける新しいベンチマーク、KnowDR-RECを提案する。第二に、データセットは、モデルの堅牢性と反幻覚能力を評価するために設計された、きめ細かい表現編集によって、精巧に構築された負のサンプルを含んでいる。最後に、モデルの内部推論過程を体系的に探求する3つの新しい評価指標を紹介する。我々は、KnowDR-REC上で16の最先端マルチモーダルモデルを評価し、既存のMLLMが知識駆動型視覚接地作業に苦戦していることを示す実験結果を得た。さらに,MLLMにおけるテキスト理解と視覚的接点の疎結合が観察され,多くのモデルが記憶されたショートカット相関に大きく影響され,ベンチマーク上でのそれらの挙動に大きく影響し,真のマルチモーダル推論を妨げている。提案したベンチマークは、より堅牢で、解釈可能で、知識集約的な視覚基盤フレームワークを開発するための将来の研究を刺激し、複雑な実世界のシナリオのためのより信頼性が高く、堅牢なマルチモーダルシステムの開発を促進することを期待する。

論文の概要: KnowDR-REC: A Benchmark for Referring Expression Comprehension with Real-World Knowledge

関連論文リスト