Fugu-MT 論文翻訳(概要): Feature Projection Learning for Better Vision-Language Reasoning

論文の概要: Feature Projection Learning for Better Vision-Language Reasoning

arxiv url: http://arxiv.org/abs/2601.20224v1
Date: Wed, 28 Jan 2026 03:54:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-29 15:46:06.756141
Title: Feature Projection Learning for Better Vision-Language Reasoning
Title（参考訳）: ビジョンランゲージ推論のための特徴投影学習
Authors: Yi Zhang, Weicheng Lin, Liang-Jie Zhang,
Abstract要約: この問題を解決するためにtextittextbfFeature textbfProjection textbfL(FPL) という手法を提案する。 FPLは精度が優れており、最先端の手法をかなり上回っている。
参考スコア（独自算出の注目度）: 9.360663595659988
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Pre-Trained models, notably CLIP, that utilize contrastive learning have proven highly adept at extracting generalizable visual features. To inherit the well-learned knowledge of VLP models for downstream tasks, several approaches aim to adapt them efficiently with limited supervision. However, these methods either suffer from limited performance, excessive learnable parameters, or extended training times, all of which hinder their effectiveness in adapting the CLIP model to downstream tasks. In this work, we propose a simple yet efficient and effective method called \textit{\textbf{F}eature \textbf{P}rojection \textbf{L}earning(FPL)} to address these problems. Specifically, we develop a projection model that projects class prototype features into the query image feature space and reconstructs the query image feature map. The negative average squared reconstruction error is used as the class score. In this way, we transform the classification problem into a feature projection problem. The final output of this method is a combination of the prediction from the projection model and the original pre-trained CLIP. Comprehensive empirical evaluations confirm that FPL delivers superior accuracy, surpassing the current state-of-the-art methods by a substantial margin.
Abstract（参考訳）: コントラスト学習を利用した視覚言語事前訓練モデル、特にCLIPは、一般化可能な視覚特徴の抽出に非常に適していることが証明されている。下流タスクのためのVLPモデルのよく知られた知識を継承するために、いくつかのアプローチは、限られた監督で効率よくVLPモデルに適応することを目指している。しかし、これらの手法は、限られたパフォーマンス、過剰な学習可能なパラメータ、あるいは長いトレーニング時間に悩まされ、これらすべてがCLIPモデルを下流タスクに適応させる効果を妨げている。そこで本研究では,これらの問題に対処するために,単純で効率的かつ効果的な方法である「textit{\textbf{F}eature \textbf{P}rojection \textbf{L}earning(FPL)}を提案する。具体的には、クラスプロトタイプの機能をクエリ画像特徴空間に投影し、クエリ画像特徴マップを再構成するプロジェクションモデルを開発する。クラススコアは、負の平均2乗復元誤差を用いる。このように分類問題を特徴投影問題に変換する。この手法の最終的な出力は、プロジェクションモデルと元の事前訓練されたCLIPからの予測の組み合わせである。総合的な経験的評価は、FPLがより優れた精度を達成し、最先端の手法をかなりの差で上回っていることを証明している。

論文の概要: Feature Projection Learning for Better Vision-Language Reasoning

関連論文リスト