Fugu-MT 論文翻訳(概要): A Retrospect to Multi-prompt Learning across Vision and Language

論文の概要: A Retrospect to Multi-prompt Learning across Vision and Language

arxiv url: http://arxiv.org/abs/2511.00191v1
Date: Fri, 31 Oct 2025 18:50:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 16:37:26.662172
Title: A Retrospect to Multi-prompt Learning across Vision and Language
Title（参考訳）: 視覚と言語にまたがるマルチプロンプト学習の振り返り
Authors: Ziliang Chen, Xin Huang, Quanlong Guan, Liang Lin, Weiqi Luo,
Abstract要約: 本稿では,エネルギベースのマルチプロンプト学習(EMPL)を提案する。私たちのEMPLはパラメータ効率だけでなく、ドメイン内とドメイン外のオープン語彙の一般化のバランスも厳密に導き出します。
参考スコア（独自算出の注目度）: 57.957750464643226
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The vision community is undergoing the unprecedented progress with the emergence of Vision-Language Pretraining Models (VLMs). Prompt learning plays as the holy grail of accessing VLMs since it enables their fast adaptation to downstream tasks with limited resources. Whereas existing researches milling around single-prompt paradigms, rarely investigate the technical potential behind their multi-prompt learning counterparts. This paper aims to provide a principled retrospect for vision-language multi-prompt learning. We extend the recent constant modality gap phenomenon to learnable prompts and then, justify the superiority of vision-language transfer with multi-prompt augmentation, empirically and theoretically. In terms of this observation, we propose an Energy-based Multi-prompt Learning (EMPL) to generate multiple prompt embeddings by drawing instances from an energy-based distribution, which is implicitly defined by VLMs. So our EMPL is not only parameter-efficient but also rigorously lead to the balance between in-domain and out-of-domain open-vocabulary generalization. Comprehensive experiments have been conducted to justify our claims and the excellence of EMPL.
Abstract（参考訳）: VLM(Vision-Language Pretraining Models)の出現に伴い、ビジョンコミュニティは前例のない進歩を遂げている。プロンプト学習は、限られたリソースで下流タスクへの迅速な適応を可能にするため、VLMへのアクセスの聖杯として機能する。シングルプロンプトパラダイムに関する既存の研究とは対照的に、マルチプロンプト学習の背景にある技術的ポテンシャルを調査することはめったにない。本稿では,視覚言語によるマルチプロンプト学習の原則的振り返りを提案する。近年の一定モードギャップ現象を学習可能なプロンプトに拡張し,マルチプロンプト拡張による視覚言語変換の優越性を実証的・理論的に正当化する。本稿では,VLM で暗黙的に定義されているエネルギーベース分布からインスタンスを抽出することにより,複数のプロンプト埋め込みを生成するためのエネルギーベースマルチプロンプト学習(EMPL)を提案する。したがって、EMPLはパラメータ効率だけでなく、ドメイン内とドメイン外のオープン語彙の一般化のバランスも厳密に導き出します。我々の主張とEMPLの卓越性を正当化するための総合的な実験が実施されている。

論文の概要: A Retrospect to Multi-prompt Learning across Vision and Language

関連論文リスト