Fugu-MT 論文翻訳(概要): Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

論文の概要: Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

arxiv url: http://arxiv.org/abs/2603.16932v1
Date: Sat, 14 Mar 2026 10:11:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.269809
Title: Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs
Title（参考訳）: 効率的なVLMのための高分解能クロップ検索
Authors: Nimrod Shabtay, Moshe Kimhi, Artem Spector, Sivan Haray, Ehud Rivlin, Chaim Baskin, Raja Giryes, Eli Schwartz,
Abstract要約: 視覚言語モデル(VLM)は、通常、ネイティブの高解像度の画像を処理し、精度と計算効率のトレードオフを強制する。 AwaResは、低解像度のグローバルビューで動作し、ツールコールを使用して、所定のクエリに必要な高解像度セグメントのみを検索するフレームワークである。
参考スコア（独自算出の注目度）: 28.88727946733177
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Vision-language models (VLMs) typically process images at a native high-resolution, forcing a trade-off between accuracy and computational efficiency: high-resolution inputs capture fine details but incur significant computational costs, while low-resolution inputs advocate for efficiency, they potentially miss critical visual information, like small text. We present AwaRes, a spatial-on-demand framework that resolves this accuracy-efficiency trade-off by operating on a low-resolution global view and using tool-calling to retrieve only high-resolution segments needed for a given query. We construct supervised data automatically: a judge compares low- vs.\ high-resolution answers to label whether cropping is needed, and an oracle grounding model localizes the evidence for the correct answer, which we map to a discrete crop set to form multi-turn tool-use trajectories. We train our framework with cold-start SFT followed by multi-turn GRPO with a composite reward that combines semantic answer correctness with explicit crop-cost penalties. Project page: https://nimrodshabtay.github.io/AwaRes
Abstract（参考訳）: 視覚言語モデル(VLM)は通常、画像をネイティブの高解像度で処理し、精度と計算効率のトレードオフを強制する。 AwaResは、低解像度のグローバルビューを運用し、ツールコールを使用して、所定のクエリに必要な高解像度セグメントのみを検索することで、この精度と効率のトレードオフを解決する、空間オンデマンドフレームワークである。我々は教師付きデータを自動的に構築する。オーラクルグラウンドリングモデルは正しい答えの証拠をローカライズし、それを離散的な作物集合にマップしてマルチターンツール利用軌跡を形成する。コールドスタート SFT とマルチターンGRPO を併用し,セマンティック応答の正しさと明示的な作物コストの罰則を組み合わせた複合報酬を用いたフレームワークを訓練する。プロジェクトページ:https://nimrodshabtay.github.io/AwaRes

論文の概要: Look Where It Matters: High-Resolution Crops Retrieval for Efficient VLMs

関連論文リスト