Fugu-MT 論文翻訳(概要): Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

論文の概要: Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

arxiv url: http://arxiv.org/abs/2604.09532v1
Date: Fri, 10 Apr 2026 17:48:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-13 17:57:53.988689
Title: Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise
Title（参考訳）: ロバスト・ビジョン誘導型クロスモーダル・プロンプト・ラーニング
Authors: Zibin Geng, Xuefeng Jiang, Jia Li, Zheng Li, Tian Wen, Lvhua Wu, Sheng Sun, Yuwei Wang, Min Liu,
Abstract要約: ノイズラベル設定のための視覚誘導学習フレームワークVisPromptを提案する。我々は、視覚的意味論を即時表現に逆注入するために、モーダルな注意機構を利用する。 VisPromptは、トレーニング済みのVLMバックボーンを凍結させ、少量のトレーニング可能なパラメータのみを導入しながら、ロバスト性を大幅に改善する。
参考スコア（独自算出の注目度）: 19.372722047131862
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of injecting visual information for all samples, despite differences in the quality of their visual cues, we further introduce a lightweight conditional modulation mechanism to adaptively control the strength of visual information injection, which strikes a more robust balance between text-side semantic priors and image-side instance evidence. The proposed framework effectively suppresses the noise-induced disturbances, reduce instability in prompt updates, and alleviate memorization of mislabeled samples. VisPrompt significantly improves robustness while keeping the pretrained VLM backbone frozen and introducing only a small amount of additional trainable parameters. Extensive experiments under synthetic and real-world label noise demonstrate that VisPrompt generally outperforms existing baselines on seven benchmark datasets and achieves stronger robustness. Our code is publicly available at https://github.com/gezbww/Vis_Prompt.
Abstract（参考訳）: プロンプト学習は視覚言語モデルに対するパラメータ効率のよい手法であるが、ラベル雑音下での頑健さは研究されていない。ビジュアルコンテンツは、よりリッチで信頼性の高いセマンティック情報を含んでいる。しかし、プロンプト自体がラベルノイズの影響を受けやすい。この直感に触発され、ノイズラベル設定のための軽量で堅牢な視覚誘導学習フレームワークVisPromptを提案する。具体的には、モーダルな注意機構を利用して、視覚的意味論をインタプリタ表現に逆注入する。これにより、プロンプトトークンは、現在のサンプルに関連する視覚情報を選択的に集約することができ、即時学習をインスタンスレベルの視覚的エビデンスに固定し、ノイズ管理の影響を低減することにより、堅牢性を向上させることができる。視覚的手がかりの質に違いはあるものの,全てのサンプルに対して同じ方法で視覚情報を注入することによる不安定性に対処するため,テキスト側のセマンティック先行と画像側の事例証拠とのより堅牢なバランスをとるために,視覚情報注入の強度を適応的に制御する軽量な条件調整機構を導入する。提案手法は, ノイズによる乱れを効果的に抑制し, 即時更新における不安定性を低減し, 誤ラベル標本の暗記を緩和する。 VisPromptは、トレーニング済みのVLMバックボーンを凍結させ、少量のトレーニング可能なパラメータのみを導入しながら、ロバスト性を大幅に改善する。合成および実世界のラベルノイズの下での大規模な実験は、VisPromptが7つのベンチマークデータセットで既存のベースラインを上回っ、強い堅牢性を達成することを示す。私たちのコードはhttps://github.com/gezbww/Vis_Prompt.comで公開されています。

論文の概要: Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

関連論文リスト