Fugu-MT 論文翻訳(概要): Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

論文の概要: Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

arxiv url: http://arxiv.org/abs/2603.24181v1
Date: Wed, 25 Mar 2026 11:00:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.252519
Title: Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection
Title（参考訳）: プロンプトコンディショニングと頭部選択によるLVLMのアンロック機能
Authors: Adhemar de Senneville, Xavier Bou, Jérémy Anger, Rafael Grompone, Gabriele Facciolo,
Abstract要約: 本稿では,LVLMの視覚的特徴クラス分離性について,プロンプト条件を用いた推論により改善可能であることを示す。ヘッドアンサンブル(HEC)を導入し,CLIPに基づく分類法とLVLMに基づく分類法のパフォーマンスギャップを埋める。
参考スコア（独自算出の注目度）: 12.487816927241056
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Current Large Vision Language Models (LVLMs) excel at many zero-shot tasks like image captioning, visual question answering and OCR. However, these same models suffer from poor performance at image classification tasks, underperforming against CLIP-based methods. Notably, this gap is surprising because many LVLMs use CLIP-pretrained vision encoders. Yet LVLMs are not inherently limited by CLIP's architecture with independent vision and text encoders. In CLIP, this separation biases classification toward class-name matching rather than joint visual-text reasoning. In this paper we show that, despite their poor raw performance, LVLMs can improve visual feature class separability at inference using prompt conditioning, and LVLMs' internal representations, especially attention heads, can outperform the model itself at zero-shot and few-shot classification. We introduce Head Ensemble Classifiers (HEC) to bridge the performance gap between CLIP-based and LVLM-based classification methods. Inspired by Gaussian Discriminant Analysis, HEC ranks the most discriminative vision and text heads and combines them into a training-free classifier. We show that HEC achieves state-of-the-art performance in few-shot and zero-shot classification across 12 datasets.
Abstract（参考訳）: 現在のLVLM(Large Vision Language Models)は、画像キャプション、視覚的質問応答、OCRなど、多くのゼロショットタスクに優れています。しかし、これらのモデルでは画像分類タスクのパフォーマンスが悪く、CLIPベースの手法に対して性能が劣っている。多くのLVLMでは、CLIP-pretrained vision encoderを使用しているため、このギャップは驚くべきものである。しかし、LVLMは独立した視覚とテキストエンコーダを備えたCLIPアーキテクチャによって本質的に制限されるわけではない。 CLIPでは、この分離は、共同視覚テキスト推論ではなく、クラス名マッチングの分類に偏っている。本稿では,LVLMの生性能が劣っているにもかかわらず,プロンプト条件付けを用いて推論時の視覚特徴クラス分離性を向上することができ,LVLMの内部表現,特にアテンションヘッドは,ゼロショットと少数ショットの分類においてモデル自体よりも優れていることを示す。ヘッドアンサンブル分類器(HEC)を導入し,CLIP法とLVLM法のパフォーマンスギャップを埋める。ガウス判別分析にインスパイアされたHECは、最も差別的な視覚とテキストヘッドをランク付けし、それらをトレーニング不要の分類器に組み合わせる。 HECは12のデータセットにまたがって,数ショット,ゼロショットの分類において最先端のパフォーマンスを実現する。

論文の概要: Unlocking Few-Shot Capabilities in LVLMs via Prompt Conditioning and Head Selection

関連論文リスト