Fugu-MT 論文翻訳(概要): VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models

論文の概要: VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models

arxiv url: http://arxiv.org/abs/2509.14571v1
Date: Thu, 18 Sep 2025 03:15:00 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 17:26:53.041921
Title: VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models
Title（参考訳）: VisMoDAl:視覚言語モデルの破壊ロバスト性評価と改善のためのビジュアル分析
Authors: Huanchen Wang, Wencheng Zhang, Zhiqiang Wang, Zhicong Lu, Yuxin Ma,
Abstract要約: 視覚言語モデル(VL)を様々な汚職タイプに対して評価するための視覚分析フレームワークであるVisMoDAlを紹介する。 VisMoDAlは、特定の汚職下でのパフォーマンス検査から、タスク駆動によるモデル動作の検査まで、多レベル分析をサポートする。
参考スコア（独自算出の注目度）: 38.03390941101576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language (VL) models have shown transformative potential across various critical domains due to their capability to comprehend multi-modal information. However, their performance frequently degrades under distribution shifts, making it crucial to assess and improve robustness against real-world data corruption encountered in practical applications. While advancements in VL benchmark datasets and data augmentation (DA) have contributed to robustness evaluation and improvement, there remain challenges due to a lack of in-depth comprehension of model behavior as well as the need for expertise and iterative efforts to explore data patterns. Given the achievement of visualization in explaining complex models and exploring large-scale data, understanding the impact of various data corruption on VL models aligns naturally with a visual analytics approach. To address these challenges, we introduce VisMoDAl, a visual analytics framework designed to evaluate VL model robustness against various corruption types and identify underperformed samples to guide the development of effective DA strategies. Grounded in the literature review and expert discussions, VisMoDAl supports multi-level analysis, ranging from examining performance under specific corruptions to task-driven inspection of model behavior and corresponding data slice. Unlike conventional works, VisMoDAl enables users to reason about the effects of corruption on VL models, facilitating both model behavior understanding and DA strategy formulation. The utility of our system is demonstrated through case studies and quantitative evaluations focused on corruption robustness in the image captioning task.
Abstract（参考訳）: 視覚言語(VL)モデルは、多モーダル情報を理解する能力のため、様々な臨界領域にわたって変換ポテンシャルを示す。しかし、それらの性能は分散シフトの下でしばしば劣化し、実践的なアプリケーションで遭遇した実世界のデータ破損に対する堅牢性の評価と改善が重要である。 VLベンチマークデータセットとデータ拡張(DA)の進歩は、堅牢性の評価と改善に寄与しているが、モデル動作の詳細な理解の欠如、専門知識の必要性、データパターンを探索するための反復的な取り組みなど、依然として課題が残っている。複雑なモデルの説明と大規模データの探索における可視化の成果から、VLモデルに対するさまざまなデータ破損の影響を理解することは、視覚分析アプローチと自然に一致する。これらの課題に対処するために,VLモデルに対する様々な汚職タイプに対するロバスト性を評価する視覚分析フレームワークVisMoDAlを紹介する。 VisMoDAlは、文献レビューと専門家の議論に基づいて、特定の汚職下でのパフォーマンスを調べることから、モデルの振る舞いとそれに対応するデータスライスをタスク駆動で検査することまで、多段階の分析をサポートする。従来の作業とは異なり、VisMoDAlはユーザがVLモデルに対する腐敗の影響を判断し、モデル行動理解とDA戦略の定式化を容易にします。本システムの有用性は,画像キャプションタスクにおける破損堅牢性に着目したケーススタディと定量的評価によって実証された。

論文の概要: VisMoDAl: Visual Analytics for Evaluating and Improving Corruption Robustness of Vision-Language Models

関連論文リスト