Fugu-MT 論文翻訳(概要): Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

論文の概要: Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

arxiv url: http://arxiv.org/abs/2509.18717v1
Date: Tue, 23 Sep 2025 07:05:43 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-24 20:41:27.741335
Title: Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment
Title（参考訳）: 最適輸送方式のマッチングとアライメントによるデータポジショニングに対するプレトレーニングCLIP
Authors: Tong Zhang, Kuofeng Gao, Jiawang Bai, Leo Yu Zhang, Xin Yin, Zonghui Wang, Shouling Ji, Wenzhi Chen,
Abstract要約: 対照的な言語-イメージ事前訓練モデルは、ターゲットとするデータ中毒や攻撃によって脅かされている。従来の防御方法は、各画像の新しいキャプションをマッチングすることで、有毒な撮像対を補正する。我々は、OTCCLIPという画像キャプチャペアを再構成する最適なトランスポートベースのフレームワークを提案する。
参考スコア（独自算出の注目度）: 65.51957843888061
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies have shown that Contrastive Language-Image Pre-training (CLIP) models are threatened by targeted data poisoning and backdoor attacks due to massive training image-caption pairs crawled from the Internet. Previous defense methods correct poisoned image-caption pairs by matching a new caption for each image. However, the matching process relies solely on the global representations of images and captions, overlooking fine-grained features of visual and textual features. It may introduce incorrect image-caption pairs and harm the CLIP pre-training. To address their limitations, we propose an Optimal Transport-based framework to reconstruct image-caption pairs, named OTCCLIP. We propose a new optimal transport-based distance measure between fine-grained visual and textual feature sets and re-assign new captions based on the proposed optimal transport distance. Additionally, to further reduce the negative impact of mismatched pairs, we encourage the inter- and intra-modality fine-grained alignment by employing optimal transport-based objective functions. Our experiments demonstrate that OTCCLIP can successfully decrease the attack success rates of poisoning attacks. Also, compared to previous methods, OTCCLIP significantly improves CLIP's zero-shot and linear probing performance trained on poisoned datasets.
Abstract（参考訳）: 近年の研究では、コントラスト言語-画像事前訓練(CLIP)モデルが、インターネットから大量の画像キャプチャーペアがクロールされたことにより、ターゲットデータ中毒やバックドア攻撃によって脅かされていることが示されている。従来の防御方法は、各画像の新しいキャプションをマッチングすることで、有毒な撮像対を補正する。しかし、マッチングプロセスは画像やキャプションのグローバルな表現にのみ依存しており、視覚的特徴やテキスト的特徴のきめ細かい特徴を見下ろしている。不正なイメージキャプチャペアを導入し、CLIP事前トレーニングを損なう可能性がある。これらの制約に対処するため,OTCCLIPという画像キャプチャペアを再構成する最適なトランスポートベースのフレームワークを提案する。本稿では,視覚的特徴集合とテキスト的特徴集合間の最適輸送に基づく距離測定と,提案した最適輸送距離に基づく新しいキャプションの再割り当てを提案する。さらに,不一致ペアの負の影響をさらに軽減するために,最適輸送に基づく目的関数を用いることで,モダリティ内およびモダリティ内における微粒化アライメントを促進する。実験の結果, OTCCLIPは, 毒性攻撃による攻撃成功率を低下させることができた。また、従来の方法と比較して、OCCCLIPはCLIPのゼロショットとリニアプローブのパフォーマンスを大幅に改善する。

論文の概要: Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment

関連論文リスト