Fugu-MT 論文翻訳(概要): Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models

論文の概要: Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models

arxiv url: http://arxiv.org/abs/2509.12897v1
Date: Tue, 16 Sep 2025 09:54:01 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-17 17:50:53.017029
Title: Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models
Title（参考訳）: クロス層視覚平滑化:大規模視覚言語モデルにおけるキーオブジェクトの持続的焦点による視覚理解の促進
Authors: Jianfei Zhao, Feng Zhang, Xin Sun, Lingxing Kong, Zhixing Tan, Chong Feng,
Abstract要約: LVLM(Large Vision-Language Models)は、画像中の重要なオブジェクトを正確に見つけることができるが、これらのオブジェクトへの注意は非常に短い傾向にある。キーオブジェクトへの継続的なフォーカスがLVLMの視覚能力を向上させるという仮説に触発され、CLVS(Cross-Layer Vision Smoothing)を提案する。 CLVSは様々な視覚的理解タスクで最先端のパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 13.17978215666921
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Vision-Language Models (LVLMs) can accurately locate key objects in images, yet their attention to these objects tends to be very brief. Motivated by the hypothesis that sustained focus on key objects can improve LVLMs' visual capabilities, we propose Cross-Layer Vision Smoothing (CLVS). The core idea of CLVS is to incorporate a vision memory that smooths the attention distribution across layers. Specifically, we initialize this vision memory with position-unbiased visual attention in the first layer. In subsequent layers, the model's visual attention jointly considers the vision memory from previous layers, while the memory is updated iteratively, thereby maintaining smooth attention on key objects. Given that visual understanding primarily occurs in the early and middle layers of the model, we use uncertainty as an indicator of completed visual understanding and terminate the smoothing process accordingly. Experiments on four benchmarks across three LVLMs confirm the effectiveness and generalizability of our method. CLVS achieves state-of-the-art performance on a variety of visual understanding tasks, with particularly significant improvements in relation and attribute understanding.
Abstract（参考訳）: LVLM(Large Vision-Language Models)は、画像中の重要なオブジェクトを正確に見つけることができるが、これらのオブジェクトへの注意は非常に短い傾向にある。キーオブジェクトに焦点を合わせることで、LVLMの視覚能力を向上させるという仮説に触発され、CLVS(Cross-Layer Vision Smoothing)を提案する。 CLVSの中核となる考え方は、レイヤ間の注意分布を円滑にするビジョンメモリを組み込むことである。具体的には、この視覚記憶を第1層における位置不偏の視覚的注意で初期化する。その後のレイヤでは、モデルの視覚的注意は、以前のレイヤからの視覚記憶を共同で考慮し、メモリは反復的に更新され、キーオブジェクトへのスムーズな注意が維持される。視覚的理解がモデルの初期層と中層で主に発生することを考えると、我々は不確実性を視覚的理解の完了の指標として使用し、スムーズな処理を終了する。 3つのLVLMにまたがる4つのベンチマーク実験により,本手法の有効性と一般化性が確認された。 CLVSは様々な視覚的理解タスクにおける最先端のパフォーマンスを実現しており、特に関係性や属性理解の大幅な改善がある。

論文の概要: Cross-Layer Vision Smoothing: Enhancing Visual Understanding via Sustained Focus on Key Objects in Large Vision-Language Models

関連論文リスト