Related papers: Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models

URL: http://arxiv.org/abs/2508.11317v1
Date: Fri, 15 Aug 2025 08:40:13 GMT
Title: Logic Unseen: Revealing the Logical Blindspots of Vision-Language Models
Authors: Yuchen Zhou, Jiayu Tang, Shuo Yang, Xiaoyan Xiao, Yuqin Dai, Wenhao Yang, Chao Gou, Xiaobo Xia, Tat-Seng Chua,
Abstract summary: Vision-Language Models (VLMs) have emerged as foundational for multimodal intelligence.<n>However, their capacity for logical understanding remains significantly underexplored.<n>We introduce LogicBench, a benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios.<n>We propose LogicCLIP, a training framework designed to boost VLMs' logical sensitivity.
Score: 58.456656119178064
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs), exemplified by CLIP, have emerged as foundational for multimodal intelligence. However, their capacity for logical understanding remains significantly underexplored, resulting in critical ''logical blindspots'' that limit their reliability in practical applications. To systematically diagnose this, we introduce LogicBench, a comprehensive benchmark with over 50,000 vision-language pairs across 9 logical categories and 4 diverse scenarios: images, videos, anomaly detection, and medical diagnostics. Our evaluation reveals that existing VLMs, even the state-of-the-art ones, fall at over 40 accuracy points below human performance, particularly in challenging tasks like Causality and Conditionality, highlighting their reliance on surface semantics over critical logical structures. To bridge this gap, we propose LogicCLIP, a novel training framework designed to boost VLMs' logical sensitivity through advancements in both data generation and optimization objectives. LogicCLIP utilizes logic-aware data generation and a contrastive learning strategy that combines coarse-grained alignment, a fine-grained multiple-choice objective, and a novel logical structure-aware objective. Extensive experiments demonstrate LogicCLIP's substantial improvements in logical comprehension across all LogicBench domains, significantly outperforming baselines. Moreover, LogicCLIP retains, and often surpasses, competitive performance on general vision-language benchmarks, demonstrating that the enhanced logical understanding does not come at the expense of general alignment. We believe that LogicBench and LogicCLIP will be important resources for advancing VLM logical capabilities.

Related papers

SpatiaLQA: A Benchmark for Evaluating Spatial Logical Reasoning in Vision-Language Models [60.088066516175026]
We introduce a benchmark designed to evaluate the spatial logical reasoning capabilities of Vision-Language Models (VLMs)<n>We conduct extensive experiments on 41 mainstream VLMs, and the results show that even the most advanced models still struggle with spatial logical reasoning.<n>We propose a method called recursive scene graph assisted reasoning, which leverages visual foundation models to progressively decompose complex scenes into task-relevant scene graphs.
arXiv Detail & Related papers (2026-02-24T13:38:37Z)
Logical Phase Transitions: Understanding Collapse in LLM Logical Reasoning [17.5066777599458]
Symbolic logical reasoning is a critical yet underexplored capability of large language models (LLMs)<n>We show that logical reasoning performance remains stable within a regime but collapses abruptly beyond a critical logical depth.<n>We propose Neuro-Symbolic Curriculum Tuning, a principled framework that adaptively aligns natural language with logical symbols to establish a shared representation.
arXiv Detail & Related papers (2026-01-06T10:38:25Z)
LogicLens: Visual-Logical Co-Reasoning for Text-Centric Forgery Analysis [10.305807834419765]
Text-centric forgeries pose a significant threat to societal security and information authenticity.<n>Current methods for text-centric forgery analysis are often limited to coarse-grained visual analysis.<n>We introduce LogicLens, a unified framework for Visual-Textual Co-reasoning.
arXiv Detail & Related papers (2025-12-25T03:02:27Z)
Last Layer Logits to Logic: Empowering LLMs with Logic-Consistent Structured Knowledge Reasoning [55.55968342644846]
Large Language Models (LLMs) achieve excellent performance in natural language reasoning tasks through pre-training on vast unstructured text.<n>We propose the textitLogits-to-Logic framework, which incorporates logits strengthening and logits filtering as core modules to correct logical defects in LLM outputs.
arXiv Detail & Related papers (2025-11-11T07:08:27Z)
From Ambiguity to Verdict: A Semiotic-Grounded Multi-Perspective Agent for LLM Logical Reasoning [16.381034926435074]
LogicAgent is a semiotic-square-guided framework designed to jointly address logical complexity and semantic complexity.<n>To overcome the semantic simplicity and low logical complexity of existing datasets, we introduce RepublicQA, a benchmark that reaches college-level difficulty.<n>Experiments demonstrate that LogicAgent achieves state-of-the-art performance on RepublicQA, with a 6.25% average gain over strong baselines.
arXiv Detail & Related papers (2025-09-29T13:31:22Z)
LOGICSEG: Parsing Visual Semantics with Neural Logic Learning and Reasoning [73.98142349171552]
LOGICSEG is a holistic visual semantic that integrates neural inductive learning and logic reasoning with both rich data and symbolic knowledge. During fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, hence enabling logic-induced network training. These designs together make LOGICSEG a general and compact neural-logic machine that is readily integrated into existing segmentation models.
arXiv Detail & Related papers (2023-09-24T05:43:19Z)
Exploring Self-supervised Logic-enhanced Training for Large Language Models [59.227222647741094]
In this paper, we make the first attempt to investigate the feasibility of incorporating logical knowledge through self-supervised post-training. We devise an auto-regressive objective variant of MERIt and integrate it with two LLM series, i.e., FLAN-T5 and LLaMA, with parameter size ranging from 3 billion to 13 billion. The results on two challenging logical reasoning benchmarks demonstrate the effectiveness of LogicLLM.
arXiv Detail & Related papers (2023-05-23T06:13:10Z)
Discourse-Aware Graph Networks for Textual Logical Reasoning [142.0097357999134]
Passage-level logical relations represent entailment or contradiction between propositional units (e.g., a concluding sentence) We propose logic structural-constraint modeling to solve the logical reasoning QA and introduce discourse-aware graph networks (DAGNs) The networks first construct logic graphs leveraging in-line discourse connectives and generic logic theories, then learn logic representations by end-to-end evolving the logic relations with an edge-reasoning mechanism and updating the graph features.
arXiv Detail & Related papers (2022-07-04T14:38:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.