When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?
- URL: http://arxiv.org/abs/2503.23137v1
- Date: Sat, 29 Mar 2025 16:08:51 GMT
- Title: When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?
- Authors: Tuo Liang, Zhe Hu, Jing Li, Hao Zhang, Yiren Lu, Yunlai Zhou, Yiran Qiao, Disheng Liu, Jeirui Peng, Jing Ma, Yu Yin,
- Abstract summary: We introduce a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts.<n>We systematically evaluate a wide range of vision-language models through four complementary tasks.<n>Our experiments reveal that even the most advanced models significantly underperform compared to humans.
- Score: 17.647896474008597
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Understanding humor-particularly when it involves complex, contradictory narratives that require comparative reasoning-remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI's ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to create humor through contradictions. We introduce the YesBut (V2), a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts, featuring comprehensive annotations that capture various aspects of narrative understanding. Using this benchmark, we systematically evaluate a wide range of VLMs through four complementary tasks spanning from surface content comprehension to deep narrative reasoning, with particular emphasis on comparative reasoning between contradictory elements. Our extensive experiments reveal that even the most advanced models significantly underperform compared to humans, with common failures in visual perception, key element identification, comparative analysis and hallucinations. We further investigate text-based training strategies and social knowledge augmentation methods to enhance model performance. Our findings not only highlight critical weaknesses in VLMs' understanding of cultural and creative expressions but also provide pathways toward developing context-aware models capable of deeper narrative understanding though comparative reasoning.
Related papers
- Improving Multilingual Retrieval-Augmented Language Models through Dialectic Reasoning Argumentations [65.11348389219887]
We introduce Dialectic-RAG (DRAG), a modular approach that evaluates retrieved information by comparing, contrasting, and resolving conflicting perspectives.
We show the impact of our framework both as an in-context learning strategy and for constructing demonstrations to instruct smaller models.
arXiv Detail & Related papers (2025-04-07T06:55:15Z) - Text Speaks Louder than Vision: ASCII Art Reveals Textual Biases in Vision-Language Models [93.46875303598577]
Vision-language models (VLMs) have advanced rapidly in processing multimodal information, but their ability to reconcile conflicting signals remains underexplored.
This work investigates how VLMs process ASCII art, a unique medium where textual elements collectively form visual patterns, potentially creating semantic-visual conflicts.
arXiv Detail & Related papers (2025-04-02T10:47:07Z) - Imagine while Reasoning in Space: Multimodal Visualization-of-Thought [70.74453180101365]
Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs)<n>We propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT)<n>It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces.
arXiv Detail & Related papers (2025-01-13T18:23:57Z) - DRIVINGVQA: Analyzing Visual Chain-of-Thought Reasoning of Vision Language Models in Real-World Scenarios with Driving Theory Tests [69.00444996464662]
We present DrivingVQA, a new benchmark derived from driving theory tests to evaluate visual chain-of-thought reasoning in complex real-world scenarios.<n>Our experiments reveal that open-source and proprietary LVLMs struggle with visual chain-of-thought reasoning under zero-shot settings.<n>We investigate training strategies that leverage relevant entities to improve visual reasoning.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach [33.20992355312175]
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding.<n>This paper aims to address this question with systematic investigation across 4 model families and 4 model scales.<n>Our analysis reveals a strong correlation between the behavior of these attention heads, the distribution of attention weights, and their concentration on visual tokens within the input.
arXiv Detail & Related papers (2024-12-24T02:31:24Z) - Visual-O1: Understanding Ambiguous Instructions via Multi-modal Multi-turn Chain-of-thoughts Reasoning [53.45295657891099]
This paper proposes Visual-O1, a multi-modal multi-turn chain-of-thought reasoning framework.
It simulates human multi-modal multi-turn reasoning, providing instantial experience for highly intelligent models.
Our work highlights the potential of artificial intelligence to work like humans in real-world scenarios with uncertainty and ambiguity.
arXiv Detail & Related papers (2024-10-04T11:18:41Z) - Enhancing Advanced Visual Reasoning Ability of Large Language Models [20.32900494896848]
Recent advancements in Vision-Language (VL) research have sparked new benchmarks for complex visual reasoning.
We propose Complex Visual Reasoning Large Language Models (CVR-LLM)
Our approach transforms images into detailed, context-aware descriptions using an iterative self-refinement loop.
We also introduce a novel multi-modal in-context learning (ICL) methodology to enhance LLMs' contextual understanding and reasoning.
arXiv Detail & Related papers (2024-09-21T02:10:19Z) - Multimodal Causal Reasoning Benchmark: Challenging Vision Large Language Models to Infer Causal Links Between Siamese Images [19.923665989164387]
We propose a novel Multimodal Causal Reasoning benchmark, namely MuCR, to challenge Large Language Models.
Specifically, we introduce a prompt-driven image synthesis approach to create siamese images with embedded semantic causality and visual cues.
Our extensive experiments reveal that the current state-of-the-art VLLMs are not as skilled at multimodal causal reasoning as we might have hoped.
arXiv Detail & Related papers (2024-08-15T12:04:32Z) - Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models [37.44286562901589]
We propose SpatialEval, a novel benchmark that covers diverse aspects of spatial reasoning.
We conduct a comprehensive evaluation of competitive language and vision-language models.
Our findings reveal several counter-intuitive insights that have been overlooked in the literature.
arXiv Detail & Related papers (2024-06-21T03:53:37Z) - Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions [16.23585043442914]
This paper focuses on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction.
We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics.
Our results show that even state-of-the-art models still lag behind human performance on this task.
arXiv Detail & Related papers (2024-05-29T13:51:43Z) - ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images.
We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline.
We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z) - DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning [89.92601337474954]
Pragmatic reasoning plays a pivotal role in deciphering implicit meanings that frequently arise in real-life conversations.
We introduce a novel challenge, DiPlomat, aiming at benchmarking machines' capabilities on pragmatic reasoning and situated conversational understanding.
arXiv Detail & Related papers (2023-06-15T10:41:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.