Fugu-MT 論文翻訳(概要): A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models

論文の概要: A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models

arxiv url: http://arxiv.org/abs/2407.17797v1
Date: Thu, 25 Jul 2024 06:10:33 GMT
ステータス: 翻訳完了
システム内更新日: 2024-07-26 15:08:06.868842
Title: A Unified Understanding of Adversarial Vulnerability Regarding Unimodal Models and Vision-Language Pre-training Models
Title（参考訳）: 単モーダルモデルとビジョンランゲージ事前学習モデルに関する敵対的脆弱性の統一的理解
Authors: Haonan Zheng, Xinyang Deng, Wen Jiang, Wenrui Li,
Abstract要約: FGA(Feature Guidance Attack)は、テキスト表現を用いてクリーンな画像の摂動を誘導する新しい手法である。提案手法は, 各種データセット, 下流タスク, ブラックボックスとホワイトボックスの両方で, 安定かつ効果的な攻撃能力を示す。
参考スコア（独自算出の注目度）: 7.350203999073509
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With Vision-Language Pre-training (VLP) models demonstrating powerful multimodal interaction capabilities, the application scenarios of neural networks are no longer confined to unimodal domains but have expanded to more complex multimodal V+L downstream tasks. The security vulnerabilities of unimodal models have been extensively examined, whereas those of VLP models remain challenging. We note that in CV models, the understanding of images comes from annotated information, while VLP models are designed to learn image representations directly from raw text. Motivated by this discrepancy, we developed the Feature Guidance Attack (FGA), a novel method that uses text representations to direct the perturbation of clean images, resulting in the generation of adversarial images. FGA is orthogonal to many advanced attack strategies in the unimodal domain, facilitating the direct application of rich research findings from the unimodal to the multimodal scenario. By appropriately introducing text attack into FGA, we construct Feature Guidance with Text Attack (FGA-T). Through the interaction of attacking two modalities, FGA-T achieves superior attack effects against VLP models. Moreover, incorporating data augmentation and momentum mechanisms significantly improves the black-box transferability of FGA-T. Our method demonstrates stable and effective attack capabilities across various datasets, downstream tasks, and both black-box and white-box settings, offering a unified baseline for exploring the robustness of VLP models.
Abstract（参考訳）: 強力なマルチモーダルインタラクション能力を示すVision-Language Pre-training (VLP)モデルにより、ニューラルネットワークの応用シナリオは、もはや単調なドメインに限定されるのではなく、より複雑なマルチモーダルV+L下流タスクに拡張されている。ユニモーダルモデルのセキュリティ脆弱性は広く検討されているが、VLPモデルの脆弱性はいまだに困難なままである。 CVモデルでは、画像の理解は注釈付き情報に由来するが、VLPモデルは生のテキストから直接画像表現を学習するように設計されている。そこで我々は,クリーンな画像の摂動を指示するテキスト表現を用いた特徴誘導攻撃(FGA)を開発した。 FGAは、ユニモーダル領域における多くの先進的な攻撃戦略と直交しており、ユニモーダルからマルチモーダルシナリオへのリッチな研究成果の直接的な適用を促進する。テキストアタックをFGAに適切に導入することにより、テキストアタックによる特徴ガイダンス(FGA-T)を構築する。 2つのモードを攻撃することで、FGA-TはVLPモデルに対して優れた攻撃効果を達成する。さらに、データ拡張と運動量機構を取り入れることで、FGA-Tのブラックボックス転送性が大幅に向上する。提案手法は, 各種データセット, 下流タスク, ブラックボックス, ホワイトボックス設定にまたがる安定かつ効果的な攻撃能力を実証し, VLPモデルのロバスト性を探るための統一ベースラインを提供する。

関連論文リスト

MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCaは、トレーニング済みのVLMバックボーンを効果的な双方向埋め込みモデルに変換するためのフレームワークである。 MoCaは、MMEBとViDoRe-v2ベンチマークのパフォーマンスを継続的に改善し、新しい最先端の結果を達成する。
論文参考訳（メタデータ） (2025-06-29T06:41:00Z)
Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for Enhanced Image-Text Matching [0.8611782340880084]
本研究は,MH-CVSE (Multi-Headed Consensus-Aware Visual-Semantic Embedding) を用いた視覚的セマンティック埋め込みモデルを提案する。本モデルでは,コンセンサスを意識した視覚的セマンティック埋め込みモデル(CVSE)に基づくマルチヘッド自己認識機構を導入し,複数のサブ空間の情報を並列に取得する。損失関数設計においては、MH-CVSEモデルは、損失値自体に応じて動的に重量を調整するために動的重量調整戦略を採用する。
論文参考訳（メタデータ） (2024-12-26T11:46:22Z)
Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining [48.98105914356609]
ルミナ-mGPT (Lumina-mGPT) は、様々な視覚と言語を扱える多モード自動回帰モデルのファミリーである。我々は,Ominiponent Supervised Finetuningを導入し,Lumina-mGPTを全能タスク統一をシームレスに達成する基礎モデルに変換する。
論文参考訳（メタデータ） (2024-08-05T17:46:53Z)
Improving Adversarial Transferability of Vision-Language Pre-training Models through Collaborative Multimodal Interaction [22.393624206051925]
既存の研究は、ビジョンランゲージ事前訓練モデルに対する攻撃の伝達可能性を研究することはめったにない。我々はCMI-Attack(Collaborative Multimodal Interaction Attack)と呼ばれる新しい攻撃を提案する。 CMI-AttackはALBEFからTCL、textCLIP_textViT$と$textCLIP_textCNN$の転送成功率を8.11%-16.75%向上させる。
論文参考訳（メタデータ） (2024-03-16T10:32:24Z)
SA-Attack: Improving Adversarial Transferability of Vision-Language Pre-training Models via Self-Augmentation [56.622250514119294]
ホワイトボックスの敵攻撃とは対照的に、転送攻撃は現実世界のシナリオをより反映している。本稿では,SA-Attackと呼ばれる自己拡張型転送攻撃手法を提案する。
論文参考訳（メタデータ） (2023-12-08T09:08:50Z)
OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization [65.57380193070574]
視覚言語事前学習モデルは、マルチモーダル対逆例に対して脆弱である。近年の研究では、データ拡張と画像-テキストのモーダル相互作用を活用することで、対向的な例の転送可能性を高めることが示されている。本稿では,OT-Attack と呼ばれる最適輸送方式の敵攻撃を提案する。
論文参考訳（メタデータ） (2023-12-07T16:16:50Z)
Adversarial Prompt Tuning for Vision-Language Models [86.5543597406173]
AdvPT(Adversarial Prompt Tuning)は、視覚言語モデル(VLM)における画像エンコーダの対向ロバスト性を高める技術である。我々は,AdvPTが白箱攻撃や黒箱攻撃に対する抵抗性を向上し,既存の画像処理による防御技術と組み合わせることで相乗効果を示すことを示した。
論文参考訳（メタデータ） (2023-11-19T07:47:43Z)
Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models [52.530286579915284]
本稿では,視覚言語事前学習モデルの対角移動可能性について検討する。伝達性劣化は、部分的にはクロスモーダル相互作用のアンダーユース化によって引き起こされる。本稿では,高度に伝達可能なSGA(Set-level Guidance Attack)を提案する。
論文参考訳（メタデータ） (2023-07-26T09:19:21Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
本稿では、画像テキストコントラスト学習(ITC)、テキスト条件付き画像合成学習(IS)、相互意味整合性モデリング(RSC)を統合した統合マルチモーダルモデルUniDiffを提案する。 UniDiffはマルチモーダル理解と生成タスクの両方において汎用性を示す。
論文参考訳（メタデータ） (2023-06-01T15:39:38Z)
Towards Adversarial Attack on Vision-Language Pre-training Models [15.882687207499373]
本稿では,V+LモデルとV+Lタスクに対する敵対的攻撃について検討した。異なる対象や攻撃対象の影響を調べた結果,強力なマルチモーダル攻撃を設計するための指針として,いくつかの重要な観測結果が得られた。
論文参考訳（メタデータ） (2022-06-19T12:55:45Z)
mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections [104.14624185375897]
mPLUGは、クロスモーダルな理解と生成のための新しいビジョン言語基盤モデルである。画像キャプション、画像テキスト検索、視覚的グラウンドリング、視覚的質問応答など、幅広い視覚言語下流タスクの最先端結果を達成する。
論文参考訳（メタデータ） (2022-05-24T11:52:06Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。