Fugu-MT 論文翻訳(概要): Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage

論文の概要: Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage

arxiv url: http://arxiv.org/abs/2601.22483v1
Date: Fri, 30 Jan 2026 02:46:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-02 18:28:15.176924
Title: Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage
Title（参考訳）: ヘッドアウェア・ビジュアル・クロップ:注意誘導サブイメージによる細粒度VQAの強化
Authors: Junfei Xie, Peng Pan, Xulong Zhang,
Abstract要約: 我々は,注目ヘッドの選択的に洗練されたサブセットを活用することにより,視覚的接地を改善する訓練不要な方法であるtextbfHead Visual Cropping (HAVC) を提案する。複数の微細なVQAベンチマークの実験は、HAVCが最先端の収穫戦略を一貫して上回っていることを示している。
参考スコア（独自算出の注目度）: 4.771792258699647
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) show strong performance in Visual Question Answering (VQA) but remain limited in fine-grained reasoning due to low-resolution inputs and noisy attention aggregation. We propose \textbf{Head Aware Visual Cropping (HAVC)}, a training-free method that improves visual grounding by leveraging a selectively refined subset of attention heads. HAVC first filters heads through an OCR-based diagnostic task, ensuring that only those with genuine grounding ability are retained. At inference, these heads are further refined using spatial entropy for stronger spatial concentration and gradient sensitivity for predictive contribution. The fused signals produce a reliable Visual Cropping Guidance Map, which highlights the most task-relevant region and guides the cropping of a subimage subsequently provided to the MLLM together with the image-question pair. Extensive experiments on multiple fine-grained VQA benchmarks demonstrate that HAVC consistently outperforms state-of-the-art cropping strategies, achieving more precise localization, stronger visual grounding, providing a simple yet effective strategy for enhancing precision in MLLMs.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は視覚質問応答(VQA)において高い性能を示すが,低分解能入力とノイズアテンションアグリゲーションによるきめ細かな推論には制限がある。本稿では,注目ヘッドのサブセットを選択的に洗練することにより視覚的接地を改善する訓練自由度手法である,HAVC(textbf{Head Aware Visual Cropping)を提案する。 HAVCはまず、OCRベースの診断タスクを通じて頭部をフィルタリングし、真の接地能力を持つ者のみを確実に保持する。推測では、これらのヘッドは空間エントロピーによりより強い空間濃度と予測貢献の勾配感度を向上する。融合信号は、最もタスク関連性の高い領域をハイライトし、その後、画像検索ペアと共にMLLMに提供されるサブイメージのトリミングをガイドする、信頼性の高いビジュアルクロッピングガイダンスマップを生成する。複数の粒度のVQAベンチマークによる大規模な実験により、HAVCは最先端の収穫戦略を一貫して上回り、より精密なローカライゼーション、より強力な視覚的接地を実現し、MLLMの精度を高めるための単純かつ効果的な戦略を提供する。

論文の概要: Head-Aware Visual Cropping: Enhancing Fine-Grained VQA with Attention-Guided Subimage

関連論文リスト