Fugu-MT 論文翻訳(概要): Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

論文の概要: Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

arxiv url: http://arxiv.org/abs/2508.11317v1
Date: Fri, 15 Aug 2025 08:40:13 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-18 14:51:23.804024
Title: Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models
Title（参考訳）: Logic Unseen:ビジョンランゲージモデルの論理的盲点を明らかにする
Authors: Yuchen Zhou, Jiayu Tang, Shuo Yang, Xiaoyan Xiao, Yuqin Dai, Wenhao Yang, Chao Gou, Xiaobo Xia, Tat-Seng Chua,
Abstract要約: VLM(Vision-Language Models)は、マルチモーダルインテリジェンスの基礎として登場した。しかし、その論理的理解能力は依然として明らかに過小評価されている。 LogicBenchは9つの論理カテゴリと4つの多様なシナリオにまたがる5万以上の視覚言語ペアを備えたベンチマークである。 VLMの論理感度を高めるためのトレーニングフレームワークであるLogicCLIPを提案する。
参考スコア（独自算出の注目度）: 58.456656119178064
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs), exemplified by CLIP, have emerged as foundational for multimodal intelligence. However, their capacity for logical understanding remains significantly underexplored, resulting in critical ''logical blindspots'' that limit their reliability in practical applications. To systematically diagnose this, we introduce LogicBench, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. Our evaluation reveals that existing VLMs, even the state-of-the-art ones, fall at over 40 accuracy points below human performance, particularly in challenging tasks like Causality and Conditionality, highlighting their reliance on surface semantics over critical logical structures. To bridge this gap, we propose LogicCLIP, a novel training framework designed to boost VLMs' logical sensitivity through advancements in both data generation and optimization objectives. LogicCLIP utilizes logic-aware data generation and a contrastive learning strategy that combines coarse-grained alignment, a fine-grained multiple-choice objective, and a novel logical structure-aware objective. Extensive experiments demonstrate LogicCLIP's substantial improvements in logical comprehension across all LogicBench domains, significantly outperforming baselines. Moreover, LogicCLIP retains, and often surpasses, competitive performance on general vision-language benchmarks, demonstrating that the enhanced logical understanding does not come at the expense of general alignment. We believe that LogicBench and LogicCLIP will be important resources for advancing VLM logical capabilities.
Abstract（参考訳）: 視覚言語モデル(VLM)は、CLIPによって実証され、マルチモーダルインテリジェンスの基礎として登場した。しかし、それらの論理的理解能力は依然として明らかに過小評価されており、結果として「論理的盲点」が批判的になり、実践的な応用における信頼性が制限される。これを体系的に診断するために,9つの論理カテゴリと4つの異なるシナリオ – 画像,ビデオ,異常検出,医療診断 – の5万以上の視覚言語ペアによる総合的なベンチマークであるLogicBenchを紹介した。我々の評価によると、既存のVLMは、最先端のVLMでさえ、人間のパフォーマンスよりも40以上の精度で低下しており、特に因果性や条件性といった困難なタスクにおいて、重要な論理構造に対する表面意味論への依存を強調している。このギャップを埋めるために、データ生成と最適化の両方の目的において、VLMの論理感度を高めるために設計された新しいトレーニングフレームワークLogicCLIPを提案する。 LogicCLIPは、論理認識データ生成と、粗粒度アライメント、細粒度多重選択目的、新しい論理構造認識目的を組み合わせたコントラスト学習戦略を利用する。大規模な実験では、LogicCLIPが全てのLogicBenchドメインにおける論理的理解を大幅に改善し、ベースラインを大幅に上回った。さらに、LogicCLIPは一般的な視覚言語ベンチマークにおける競合性能を維持し、拡張された論理的理解が一般的なアライメントを犠牲にしないことを示した。 LogicBench と LogicCLIP は,VLM の論理能力を向上するための重要なリソースとなると思います。

論文の概要: Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

関連論文リスト