Fugu-MT 論文翻訳(概要): Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

論文の概要: Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

arxiv url: http://arxiv.org/abs/2604.12833v1
Date: Tue, 14 Apr 2026 14:52:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.517318
Title: Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks
Title（参考訳）: 物理的に展開可能なマルチモーダル・セマンティック・ライティング・アタックを用いた視覚言語モデルの構築
Authors: Yingying Zhao, Chengyin Hu, Qike Zhang, Xin Li, Xin Wang, Yiwei Wei, Jiujiang Guo, Jiahuan Long, Tingsong Jiang, Wen Yao,
Abstract要約: VLM(Vision-Language Models)は優れた性能を示しているが、そのセキュリティは十分に理解されていない。既存の敵対的な研究はほとんどデジタル設定に焦点を合わせており、物理世界の脅威はほとんど解明されていない。 VLMに対する物理的に展開可能な最初の攻撃フレームワークであるMultimodal Semantic Lighting Attacks (MSLA)を提案する。
参考スコア（独自算出の注目度）: 23.938024446316717
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-Language Models (VLMs) have shown remarkable performance, yet their security remains insufficiently understood. Existing adversarial studies focus almost exclusively on the digital setting, leaving physical-world threats largely unexplored. As VLMs are increasingly deployed in real environments, this gap becomes critical, since adversarial perturbations must be physically realizable. Despite this practical relevance, physical attacks against VLMs have not been systematically studied. Such attacks may induce recognition failures and further disrupt multimodal reasoning, leading to severe semantic misinterpretation in downstream tasks. Therefore, investigating physical attacks on VLMs is essential for assessing their real-world security risks. To address this gap, we propose Multimodal Semantic Lighting Attacks (MSLA), the first physically deployable adversarial attack framework against VLMs. MSLA uses controllable adversarial lighting to disrupt multimodal semantic understanding in real scenes, attacking semantic alignment rather than only task-specific outputs. Consequently, it degrades zero-shot classification performance of mainstream CLIP variants while inducing severe semantic hallucinations in advanced VLMs such as LLaVA and BLIP across image captioning and visual question answering (VQA). Extensive experiments in both digital and physical domains demonstrate that MSLA is effective, transferable, and practically realizable. Our findings provide the first evidence that VLMs are highly vulnerable to physically deployable semantic attacks, exposing a previously overlooked robustness gap and underscoring the urgent need for physical-world robustness evaluation of VLMs.
Abstract（参考訳）: VLM(Vision-Language Models)は優れた性能を示しているが、そのセキュリティは十分に理解されていない。既存の敵対的な研究はほとんどデジタル設定に焦点を合わせており、物理世界の脅威はほとんど解明されていない。 VLMが現実の環境にますます展開されるにつれて、このギャップは重要になり、対向的な摂動は物理的に実現可能でなければならない。この実践的関連性にもかかわらず、VLMに対する物理的攻撃は体系的に研究されていない。このような攻撃は認識障害を誘発し、さらにマルチモーダル推論を妨害し、下流のタスクで深刻な意味的誤解を引き起こす可能性がある。したがって, VLMに対する物理的攻撃を調査することは, 現実のセキュリティリスクを評価する上で不可欠である。このギャップに対処するために、VLMに対する最初の物理的に展開可能な逆攻撃フレームワークであるMultimodal Semantic Lighting Attacks (MSLA)を提案する。 MSLAは、制御可能な対向照明を使用して、実際のシーンにおけるマルチモーダルなセマンティック理解を妨害し、タスク固有の出力だけでなくセマンティックアライメントを攻撃している。これにより、LLaVAやBLIPなどの高度なVLMにおいて、画像キャプションや視覚的質問応答(VQA)を介して、重度の意味幻覚を誘導しながら、主流のCLIPのゼロショット分類性能を低下させる。デジタルドメインと物理ドメインの両方での大規模な実験は、MSLAが効果的で、転送可能で、事実上実現可能であることを示した。我々の研究は,VLMが物理的に展開可能なセマンティックアタックに対して極めて脆弱であることを示す最初の証拠を提供し,これまで見過ごされていたロバスト性ギャップを露呈し,VLMの物理世界ロバスト性評価の緊急の必要性を浮き彫りにした。

論文の概要: Challenging Vision-Language Models with Physically Deployable Multimodal Semantic Lighting Attacks

関連論文リスト