Fugu-MT 論文翻訳(概要): XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

論文の概要: XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

arxiv url: http://arxiv.org/abs/2603.28568v1
Date: Mon, 30 Mar 2026 15:24:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.475831
Title: XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs
Title（参考訳）: XSPA:VLMのトランスファー可能な攻撃に対するX字型スパース対向摂動の製作
Authors: Chengyin Hu, Jiaju Han, Xuemeng Sun, Qike Zhang, Yiwei Wei, Ang Li, Chunlei Meng, Xiang Chen, Jiahuan Long,
Abstract要約: 視覚言語モデル(VLM)は、タスクを実行するために共有された視覚的テキスト表現空間に依存している。小さな視覚摂動は共有埋め込み空間を通して伝播し、相関する意味障害を引き起こす。 X字型スパース・ピクチャー・アタック (XSPA) は、2本の対角線に摂動を制限する非受容構造攻撃である。
参考スコア（独自算出の注目度）: 12.841884476022889
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language models (VLMs) rely on a shared visual-textual representation space to perform tasks such as zero-shot classification, image captioning, and visual question answering (VQA). While this shared space enables strong cross-task generalization, it may also introduce a common vulnerability: small visual perturbations can propagate through the shared embedding space and cause correlated semantic failures across tasks. This risk is particularly important in interactive and decision-support settings, yet it remains unclear whether VLMs are robust to highly constrained, sparse, and geometrically fixed perturbations. To address this question, we propose X-shaped Sparse Pixel Attack (XSPA), an imperceptible structured attack that restricts perturbations to two intersecting diagonal lines. Compared with dense perturbations or flexible localized patches, XSPA operates under a much stricter attack budget and thus provides a more stringent test of VLM robustness. Within this sparse support, XSPA jointly optimizes a classification objective, cross-task semantic guidance, and regularization on perturbation magnitude and along-line smoothness, inducing transferable misclassification as well as semantic drift in captioning and VQA while preserving visual subtlety. Under the default setting, XSPA modifies only about 1.76% of image pixels. Experiments on the COCO dataset show that XSPA consistently degrades performance across all three tasks. Zero-shot accuracy drops by 52.33 points on OpenAI CLIP ViT-L/14 and 67.00 points on OpenCLIP ViT-B/16, while GPT-4-evaluated caption consistency decreases by up to 58.60 points and VQA correctness by up to 44.38 points. These results suggest that even highly sparse and visually subtle perturbations with fixed geometric priors can substantially disrupt cross-task semantics in VLMs, revealing a notable robustness gap in current multimodal systems.
Abstract（参考訳）: 視覚言語モデル(VLM)は、ゼロショット分類、画像キャプション、視覚質問応答(VQA)などのタスクを実行するために、共有された視覚テキスト表現空間に依存している。この共有空間は、強力なクロスタスクの一般化を可能にするが、共通の脆弱性も導入する: 小さな視覚摂動は、共有埋め込み空間を通して伝播し、タスク間で相関的な意味的障害を引き起こす。このリスクは、特に対話的かつ意思決定支援の設定において重要であるが、VLMが高度に制約された、スパースで、幾何学的に固定された摂動に対して堅牢であるかどうかは不明である。この問題に対処するために,2本の対角線に摂動を制限する非受容的構造攻撃であるX字型スパース・ピクチャー・アタック(XSPA)を提案する。密度の高い摂動やフレキシブルな局所パッチと比較すると、XSPAはより厳格な攻撃予算の下で動作し、VLMの堅牢性のより厳密なテストを提供する。このスパースサポートの中で、XSPAは、視覚的微妙さを保ちながら、キャプションやVQAのセマンティックドリフトと同様に、分類目的、クロスタスクのセマンティックガイダンス、摂動の大きさと直線スムーズ性の規則化を共同で最適化する。デフォルト設定では、XSPAは画像ピクセルの1.76%しか修正していない。 COCOデータセットの実験では、XSPAは3つのタスクすべてで一貫してパフォーマンスを低下させる。 OpenAI CLIP ViT-L/14では52.33ポイント、OpenCLIP ViT-B/16では67.00ポイント、GPT-4で評価されたキャプションの一貫性は58.60ポイント、VQAの精度は44.38ポイントまで低下する。これらの結果から,VLMにおけるクロスタスクのセマンティクスを著しく破壊し,現在のマルチモーダルシステムにおいて顕著なロバスト性差があることが示唆された。

論文の概要: XSPA: Crafting Imperceptible X-Shaped Sparse Adversarial Perturbations for Transferable Attacks on VLMs

関連論文リスト