Fugu-MT 論文翻訳(概要): VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

論文の概要: VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

arxiv url: http://arxiv.org/abs/2605.26380v1
Date: Mon, 25 May 2026 23:01:05 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-27 17:51:41.499782
Title: VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes
Title（参考訳）: VisualNeedle: 情報密度シーンにおけるアクティブなビジュアル検索のベンチマーク
Authors: Jingru Chen, Yiming Liu, Mingtao Chen, Sijie Chen, Richeng Xuan, Liang Yang, Zhichao Hu, Fanyang Lu,
Abstract要約: マルチモーダル大言語モデル(MLLM)は、微粒な知覚ベンチマークにおいて90%以上の精度を達成することが報告されている。以前の研究では、ベンチマークのパフォーマンスを向上する3つのショートカットが特定されている。以上の結果から,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細な視覚検索が可能であることが示唆された。
参考スコア（独自算出の注目度）: 10.793484358165934
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. However, such scores do not necessarily imply faithful use of visual evidence. Prior studies have identified three shortcuts that inflate benchmark performance. First, linguistic priors and lexical cues in questions often enable models to infer plausible answers without seeing the image. Second, coarse global semantics from the visual encoder can bypass fine-grained local details. Third, in some ``think-with-images'' benchmarks, corrupting the intermediate images returned by visual tools barely affects the final answer. These findings suggest that higher input resolution or larger question pools alone do not elicit genuine active visual search. To address this, we introduce VisualNeedle, a challenging, information-dense, and fine-grained benchmark for scenes where critical evidence is spatially constrained to minute regions and not discernible at a glance. We further propose a counterfactual crop-black setting, which replaces crops returned by tools with black images of the same size, to test whether tool-enabled performance truly relies on intermediate visual evidence. We evaluate 9 promninent MLLMs across three settings: no-tool, standard tool-enabled, and crop-black. No-tool accuracy stays below 20\%, and the best tool-enabled model reaches only 56.01\%, still trailing the 63.00% human majority-vote accuracy. These results reveal persistent limitations in fine-grained visual search, while the crop-black ablation confirms that success on VisualNeedle hinges on genuine intermediate visual evidence.
Abstract（参考訳）: Frontier Multimodal Large Language Model (MLLM) は、きめ細かい知覚ベンチマークにおいて90%以上の精度を達成することが報告されている。しかし、このようなスコアは必ずしも視覚的証拠を忠実に用いているわけではない。以前の研究では、ベンチマークのパフォーマンスを向上する3つのショートカットが特定されている。第一に、疑問における言語的先行と語彙的手がかりは、しばしばモデルがその画像を見ることなく、妥当な答えを推測することができる。第二に、ビジュアルエンコーダからの粗いグローバルセマンティクスは、きめ細かい局所的な詳細をバイパスすることができる。第3に、いくつかの‘think-with-images’ベンチマークでは、ビジュアルツールによって返される中間イメージが最終的な答えにほとんど影響しない。以上の結果から,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度,高精細度の視力検索を行なわないことが示唆された。これを解決するために,重要な証拠が微小な領域に空間的に制約され,一目で識別できないシーンに対して,難易度、情報密度、きめ細かなベンチマークであるVisualNeedleを紹介した。さらに,ツール対応性能が中間的視覚的証拠に依存しているかどうかを確認するために,ツールが返した作物を同じ大きさの黒画像に置き換えた反ファクトリアルクロ設定を提案する。ノットル,標準ツール対応,作物黒の3つの設定で,9つのプロムナントMLLMを評価した。ノーツールの精度は20 %以下であり、最高のツール対応モデルは56.01 %に留まり、それでも63.00 %の人間の多数決投票精度に追随している。これらの結果は、細粒度のビジュアルサーチにおいて永続的な制限が示され、一方、クロップブラックアブレーションは、VisualNeedleの成功が真の中間的なビジュアルエビデンスに依存していることを確認する。

論文の概要: VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

関連論文リスト