Fugu-MT 論文翻訳(概要): UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

論文の概要: UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

arxiv url: http://arxiv.org/abs/2605.12237v1
Date: Tue, 12 May 2026 15:07:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.950578
Title: UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs
Title（参考訳）: UHR-Micro:地球観測VLMにおける分解能イリュージョンの診断と緩和
Authors: Shuo Ni, Tong Wang, Jing Zhang, He Chen, Haonan Guo, Ning Zhang, Bo Du,
Abstract要約: VLM(Vision-Language Models)は、超高解像度(UHR)地球観測画像で動作する。これらのモデルは、大規模なシーンコンテキストとマイクロスケールターゲットの間の深刻なスケールミスマッチに対して脆弱である。 11,253の命令を1,212のUHR画像にグラウンドしたベンチマークであるUHR-Microで、この課題をベンチマークする。
参考スコア（独自算出の注目度）: 40.3198846405438
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) increasingly operate on ultra-high-resolution (UHR) Earth observation imagery, yet they remain vulnerable to a severe scale mismatch between large-scale scene context and micro-scale targets. We refer to this empirical gap as a "resolution illusion": higher input resolution provides the appearance of richer visual detail, but does not necessarily yield reliable perception of spatially small, task-relevant evidence. To benchmark this challenge, we introduce UHR-Micro, a benchmark comprising 11,253 instructions grounded in 1,212 UHR images, designed to evaluate VLMs at the spatial limits of native Earth observation imagery. UHR-Micro spans diverse micro-target scales, context requirements, task families, and visual conditions, and provides diagnostic annotations that support controlled evaluation and fine-grained error attribution. Experiments with representative high-resolution VLMs show substantial failures in spatial grounding and evidence parsing, despite access to high-resolution inputs. Further analysis suggests that these failures are not fully resolved by increasing model capacity, but are closely tied to insufficient guidance in locating and using task-relevant micro-evidence. Motivated by this finding, we propose Micro-evidence Active Perception (MAP), a reference agent that decomposes queries into evidence-seeking steps, actively inspects candidate regions, and grounds its answers in localized observations. MAP-Agent improves micro-level perception by making high-resolution reasoning evidence-centered rather than image-centered. Together, UHR-Micro and MAP-Agent provide a diagnostic platform for evaluating, understanding, and advancing high-resolution reasoning in Earth observation VLMs. Datasets and source code were released at https://github.com/MiliLab/UHR-Micro.
Abstract（参考訳）: VLM(Vision-Language Models)は、超高解像度(UHR)地球観測画像でますます運用されるが、大規模なシーンコンテキストとマイクロスケールターゲットとの深刻なミスマッチに弱いままである。高い入力分解能は、よりリッチな視覚的詳細の外観を提供するが、必ずしも空間的に小さく、タスク関連のある証拠の信頼できる認識をもたらすとは限らない。この課題を評価するために、1212UHR画像に接地した11,253の命令からなるUHR-Microをネイティブ地球観測画像の空間的限界でVLMを評価するために導入した。 UHR-Microは、様々なマイクロターゲットスケール、コンテキスト要求、タスクファミリ、視覚条件にまたがり、制御された評価ときめ細かいエラー属性をサポートする診断アノテーションを提供する。代表的な高分解能VLMを用いた実験は、高分解能入力へのアクセスにもかかわらず、空間的接地やエビデンス解析においてかなりの失敗を示した。さらに分析したところ、これらの失敗はモデル能力の増大によって完全に解決されるのではなく、タスク関連マイクロエビデンス(英語版)の発見と利用において不十分なガイダンスと密接に関連していることが示唆された。この発見を動機として,クエリをエビデンス検索ステップに分解し,候補領域を積極的に検査し,その回答を局所的に観察する参照エージェントであるmicro-evidence Active Perception (MAP)を提案する。 MAP-Agentは画像中心ではなく、高解像度の推論エビデンスを中心としたマイクロレベルの知覚を改善する。 UHR-MicroとMAP-Agentは共に、地球観測VLMにおける高分解能推論の評価、理解、進歩のための診断プラットフォームを提供する。データセットとソースコードはhttps://github.com/MiliLab/UHR-Micro.comでリリースされた。

論文の概要: UHR-Micro: Diagnosing and Mitigating the Resolution Illusion in Earth Observation VLMs

関連論文リスト