Fugu-MT 論文翻訳(概要): Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks

論文の概要: Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks

arxiv url: http://arxiv.org/abs/2604.04473v1
Date: Mon, 06 Apr 2026 06:48:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:19.123698
Title: Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks
Title（参考訳）: 標準ベンチマークを超えて:様々なタスクにおける視覚言語モデルの自然意味変化に対するロバスト性に関する体系的な監査
Authors: Jia Chengyu, AprilPyone MaungMaung, Huy H. Nguyen, Jinyin Chen, Isao Echizen,
Abstract要約: 本稿では,視覚言語モデル(VLM)の自然なシナリオ下での体系的評価フレームワークを提案する。ゼロショット画像分類,セマンティックセグメンテーション,視覚的質問応答において,選択したVLMの自然な対向性能を測定した。解析の結果,頑健なCLIPモデルでは自然の敵対的脆弱性が増幅され,CLIPモデルでは自然言語による敵対的事例のパフォーマンスが著しく低下することが判明した。
参考スコア（独自算出の注目度）: 11.064940886724257
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in vision-language models (VLMs) trained on web-scale image-text pairs have enabled impressive zero-shot transfer across a diverse range of visual tasks. However, comprehensive and independent evaluation beyond standard benchmarks is essential to understand their robustness, limitations, and real-world applicability. This paper presents a systematic evaluation framework for VLMs under natural adversarial scenarios for diverse downstream tasks, which has been overlooked in previous evaluation works. We evaluate a wide range of VLMs (CLIP, robust CLIP, BLIP2, and SigLIP2) on curated adversarial datasets (typographic attacks, ImageNet-A, and natural language-induced adversarial examples). We measure the natural adversarial performance of selected VLMs for zero-shot image classification, semantic segmentation, and visual question answering. Our analysis reveals that robust CLIP models can amplify natural adversarial vulnerabilities, and CLIP models significantly reduce performance for natural language-induced adversarial examples. Additionally, we provide interpretable analyses to identify failure modes. We hope our findings inspire future research in robust and fair multimodal pattern recognition.
Abstract（参考訳）: Webスケールの画像テキストペアで訓練された視覚言語モデル(VLM)の最近の進歩は、様々な視覚的タスクにわたって印象的なゼロショット転送を可能にしている。しかし、標準ベンチマークを超えて総合的かつ独立した評価は、その堅牢性、制限、実世界の適用性を理解するために不可欠である。本稿では,従来の評価研究で見過ごされてきた様々な下流タスクの自然な逆シナリオ下でのVLMの系統的評価フレームワークを提案する。 VLM(CLIP, 堅牢なCLIP, BLIP2, SigLIP2)を, 逆行性データセット(タイポグラフィーアタック, ImageNet-A, 自然言語による逆行性サンプル)で評価した。ゼロショット画像分類,セマンティックセグメンテーション,視覚的質問応答において,選択したVLMの自然な対向性能を測定した。分析の結果,頑健なCLIPモデルでは自然の敵対的脆弱性が増幅され,CLIPモデルでは自然言語による敵対的事例のパフォーマンスが著しく低下することが判明した。さらに、障害モードを特定するための解釈可能な分析も提供する。我々はこの発見が、堅牢で公正なマルチモーダルパターン認識に将来の研究を刺激することを期待している。

論文の概要: Beyond Standard Benchmarks: A Systematic Audit of Vision-Language Model's Robustness to Natural Semantic Variation Across Diverse Tasks

関連論文リスト