Fugu-MT 論文翻訳(概要): Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

論文の概要: Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

arxiv url: http://arxiv.org/abs/2604.11576v1
Date: Mon, 13 Apr 2026 14:54:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.63066
Title: Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models
Title（参考訳）: ビジョン言語モデルにおけるゼロショット対逆ロバスト性を高めるファインチューン
Authors: Songlong Xing, Weijie Wang, Zhengyu Zhao, Jindong Gu, Philip Torr, Nicu Sebe,
Abstract要約: 本稿では,CLIPのプレトレーニングプロセスのトレーニングレシピに従うAdvFLYPを提案する。具体的には、AdvFLYPは、Webから収集された画像とテキストのペアに基づいて生成された敵画像とCLIPを微調整し、対照的な損失によって対応するテキストとマッチングする。また,ロジットレベルの正規化条件と特徴レベルの正規化条件は,それぞれ堅牢性とクリーンな精度に有益であることを示す。
参考スコア（独自算出の注目度）: 89.0460992131069
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite their impressive zero-shot abilities, vision-language models such as CLIP have been shown to be susceptible to adversarial attacks. To enhance its adversarial robustness, recent studies finetune the pretrained vision encoder of CLIP with adversarial examples on a proxy dataset such as ImageNet by aligning adversarial images with correct class labels. However, these methods overlook the important roles of training data distributions and learning objectives, resulting in reduced zero-shot capabilities and limited transferability of robustness across domains and datasets. In this work, we propose a simple yet effective paradigm AdvFLYP, which follows the training recipe of CLIP's pretraining process when performing adversarial finetuning to the model. Specifically, AdvFLYP finetunes CLIP with adversarial images created based on image-text pairs collected from the web, and match them with their corresponding texts via a contrastive loss. To alleviate distortion of adversarial image embeddings of noisy web images, we further propose to regularise AdvFLYP by penalising deviation of adversarial image features. We show that logit- and feature-level regularisation terms benefit robustness and clean accuracy, respectively. Extensive experiments on 14 downstream datasets spanning various domains show the superiority of our paradigm over mainstream practices. Our code and model weights are released at https://github.com/Sxing2/AdvFLYP.
Abstract（参考訳）: 印象的なゼロショット能力にもかかわらず、CLIPのような視覚言語モデルは敵の攻撃を受けやすいことが示されている。近年の研究では,CLIPの事前学習されたビジョンエンコーダを,ImageNetなどのプロキシデータセット上で,正のクラスラベルと逆画像の整列によって微調整する手法が提案されている。しかし、これらの手法は、データ分散のトレーニングと学習目標の重要な役割を見落とし、ゼロショット能力の低下と、ドメインやデータセット間の堅牢性の限定的な転送可能性をもたらす。本稿では,CLIPの事前学習プロセスのトレーニングレシピに従って,モデルに逆方向の微調整を行うための,シンプルで効果的なパラダイムAdvFLYPを提案する。具体的には、AdvFLYPは、Webから収集された画像とテキストのペアに基づいて生成された敵画像とCLIPを微調整し、対照的な損失によって対応するテキストとマッチングする。雑音の多いWeb画像の逆画像埋め込みの歪みを軽減するために,逆画像特徴の偏差を考慮し,AdvFLYPを正規化することを提案する。また,ロジットレベルの正規化条件と特徴レベルの正規化条件は,それぞれ堅牢性とクリーンな精度に有益であることを示す。様々な領域にまたがる14の下流データセットに関する大規模な実験は、主流のプラクティスよりもパラダイムが優れていることを示している。コードとモデルの重み付けはhttps://github.com/Sxing2/AdvFLYP.orgで公開されています。

論文の概要: Finetune Like You Pretrain: Boosting Zero-shot Adversarial Robustness in Vision-language Models

関連論文リスト