Fugu-MT 論文翻訳(概要): Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification

論文の概要: Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification

arxiv url: http://arxiv.org/abs/2603.24528v1
Date: Wed, 25 Mar 2026 17:04:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.400191
Title: Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification
Title（参考訳）: 訓練不要なFew-Shot分類のためのクロスモーダルプロトタイプアライメントとミキシング
Authors: Dipam Goswami, Simone Magistri, Gido M. van de Ven, Bartłomiej Twardowski, Andrew D. Bagdanov, Tinne Tuytelaars, Joost van de Weijer,
Abstract要約: 本研究は,画像とテキストのプロトタイプを直接混合した場合の影響について考察する。試作品の混合は, 収縮推定器として機能することを示す。そこで本研究では,画像のプロトタイプをセマンティックテキスト埋め込み空間の主方向へ投影し,テキスト対応のセマンティックイメージ部分空間を得る。
参考スコア（独自算出の注目度）: 52.48204114948899
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language models (VLMs) like CLIP are trained with the objective of aligning text and image pairs. To improve CLIP-based few-shot image classification, recent works have observed that, along with text embeddings, image embeddings from the training set are an important source of information. In this work we investigate the impact of directly mixing image and text prototypes for few-shot classification and analyze this from a bias-variance perspective. We show that mixing prototypes acts like a shrinkage estimator. Although mixed prototypes improve classification performance, the image prototypes still add some noise in the form of instance-specific background or context information. In order to capture only information from the image space relevant to the given classification task, we propose projecting image prototypes onto the principal directions of the semantic text embedding space to obtain a text-aligned semantic image subspace. These text-aligned image prototypes, when mixed with text embeddings, further improve classification. However, for downstream datasets with poor cross-modal alignment in CLIP, semantic alignment might be suboptimal. We show that the image subspace can still be leveraged by modeling the anisotropy using class covariances. We demonstrate that combining a text-aligned mixed prototype classifier and an image-specific LDA classifier outperforms existing methods across few-shot classification benchmarks.
Abstract（参考訳）: CLIPのような視覚言語モデル(VLM)は、テキストとイメージペアの整列を目的として訓練されている。 CLIPベースの少数ショット画像分類を改善するために、最近の研究で、テキスト埋め込みとともに、トレーニングセットからのイメージ埋め込みが重要な情報ソースであることがわかった。本研究は, 画像とテキストのプロトタイプを直接混合した場合の影響について検討し, バイアス分散の観点から解析する。試作品の混合は, 収縮推定器として機能することを示す。混合プロトタイプは分類性能を改善するが、画像プロトタイプは、インスタンス固有の背景情報やコンテキスト情報という形でいくつかのノイズを加える。与えられた分類タスクに関連する画像空間からのみの情報を取得するために,テキスト対応のセマンティック画像サブスペースを得るために,セマンティックテキスト埋め込み空間の主方向に画像プロトタイプを投影することを提案する。これらのテキスト整列画像プロトタイプは、テキスト埋め込みと混在すると、さらに分類を改善した。しかし、CLIPにおけるクロスモーダルアライメントが不十分な下流データセットの場合、セマンティックアライメントは最適ではないかもしれない。画像部分空間は、クラス共分散を用いた異方性モデリングにより、依然として活用可能であることを示す。テキスト整列型混合プロトタイプ分類器と画像特化型LDA分類器を組み合わせることで,数ショットの分類ベンチマークにおいて既存の手法よりも優れていることを示す。

論文の概要: Cross-Modal Prototype Alignment and Mixing for Training-Free Few-Shot Classification

関連論文リスト