Fugu-MT 論文翻訳(概要): PERL: Parameter Efficient Reasoning in CLIP Latent Space

論文の概要: PERL: Parameter Efficient Reasoning in CLIP Latent Space

arxiv url: http://arxiv.org/abs/2605.18464v2
Date: Tue, 19 May 2026 09:44:59 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 15:03:08.578515
Title: PERL: Parameter Efficient Reasoning in CLIP Latent Space
Title（参考訳）: PERL: CLIP潜在空間におけるパラメータ効率的な推論
Authors: Simone Carnemolla, Salvatore Calcagno, Daniela Giordano, Concetto Spampinato, Matteo Pennisi,
Abstract要約: PERLは、凍結したCLIPモデルを拡張し、コンパクトな共用推論モジュールを繰り返し適用する軽量適応フレームワークである。 PerLは、高速適応数ショット設定で比較した手法の中で最高のパラメータ性能トレードオフを達成する。以上の結果から,反復潜在推論は,ディスクネイティブな視覚言語モデルにおけるパラメータスケーリングに相補的適応機構を提供する可能性が示唆された。
参考スコア（独自算出の注目度）: 11.607257085664727
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Contrastively trained vision-language models such as CLIP provide strong zero-shot transfer by aligning images and text in a shared embedding space. However, adapting these models to downstream tasks without degrading their open-vocabulary generalization remains challenging. Existing parameter-efficient adaptation methods typically improve task specialization through learned prompts, adapters, or multimodal transformations, where adaptation capacity is primarily expressed through additional trainable parameters. Inspired by recent latent reasoning methods in language models, we investigate a complementary perspective: can adaptation emerge from iterative reasoning on latent representations rather than from increasing parameter count alone? We introduce PERL (Parameter-Efficient Reasoning in CLIP Latent Space), a lightweight adaptation framework that augments a frozen CLIP model with a compact shared reasoning module applied recurrently across refinement steps. At each step, PERL generates a latent reasoning token conditioned on the current representation and injects it into an intermediate encoder layer, progressively refining higher-level semantic representations while preserving CLIP's pretrained multimodal structure. Across 15 benchmarks spanning base-to-novel generalization, cross-dataset transfer, and out-of-distribution ImageNet variants, PERL achieves the best parameter-performance trade-off among the compared methods under a fast-adaptation few-shot setting, combining strong novel-class accuracy and competitive transfer performance with only about 6K trainable parameters, up to 817x fewer than the largest compared approach. Overall, our results suggest that iterative latent reasoning provides a complementary adaptation mechanism to parameter scaling in discriminative vision-language models.
Abstract（参考訳）: CLIPのような対照的に訓練された視覚言語モデルは、画像とテキストを共有埋め込み空間で整列することで、強力なゼロショット転送を提供する。しかし、これらのモデルをダウンストリームタスクに適応させることは、オープン語彙の一般化を低下させることなく、依然として困難である。既存のパラメータ効率適応法は、学習プロンプト、アダプタ、マルチモーダル変換を通じてタスクの特殊化を改善するのが一般的である。言語モデルにおける近年の潜時推論法に着想を得て、補的視点として、パラメータ数の増加からではなく、潜時表現への反復的推論から適応が生まれるかを検討する。 CLIP遅延空間におけるParameter-Efficient Reasoning(Parameter-Efficient Reasoning in CLIP Latent Space)を導入する。各ステップでPERLは、現在の表現に条件付けられた潜在推論トークンを生成し、中間エンコーダ層に注入し、CLIPの事前訓練されたマルチモーダル構造を保持しながら、より高度なセマンティック表現を段階的に洗練する。ベース・ツー・ノーベルの一般化、クロス・データセット・トランスファー、アウト・オブ・ディストリビューションのイメージネットの変種にまたがる15のベンチマークにおいて、PERLは、高速適応数ショット設定の下で比較した手法の中で、最高のパラメータ・パフォーマンスのトレードオフを達成し、強力な新規クラスの精度と競合転送性能を約6Kのトレーニング可能なパラメータと組み合わせ、最大817倍まで小さくする。以上の結果から,反復潜在推論は識別的視覚言語モデルにおけるパラメータスケーリングに相補的な適応機構をもたらすことが示唆された。

論文の概要: PERL: Parameter Efficient Reasoning in CLIP Latent Space

関連論文リスト