Related papers: Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

URL: http://arxiv.org/abs/2505.23759v2
Date: Tue, 16 Sep 2025 18:13:16 GMT
Title: Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint
Authors: Heekyung Lee, Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan,
Abstract summary: Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs)<n>In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles.
Score: 57.73346054360675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.

Related papers

Chatting with Images for Introspective Visual Thinking [50.7747647794877]
''Chatting with images'' is a new framework that reframes visual manipulation as language-guided feature modulation.<n>Under the guidance of expressive language prompts, the model dynamically performs joint re-encoding over multiple image regions.<n>ViLaVT achieves strong and consistent improvements on complex multi-image and video-based spatial reasoning tasks.
arXiv Detail & Related papers (2026-02-11T17:42:37Z)
Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning [1.6234264741872295]
Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks.<n>We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping evidence to non-literal concepts.<n>We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding.
arXiv Detail & Related papers (2026-01-06T20:27:29Z)
Seq2Seq Models Reconstruct Visual Jigsaw Puzzles without Seeing Them [2.8834483859625952]
We introduce a specialized tokenizer that converts each puzzle piece into a discrete sequence of tokens.<n>Treated as "blind" solvers, encoder-decoder transformers accurately reconstruct the original layout.<n>Despite being deliberately restricted from accessing visual input, our models achieve state-of-the-art results.
arXiv Detail & Related papers (2025-11-09T10:43:16Z)
Reasoning Riddles: How Explainability Reveals Cognitive Limits in Vision-Language Models [0.0]
Vision-Language Models (VLMs) excel at many multimodal tasks, yet their cognitive processes remain opaque on complex lateral thinking challenges like rebus puzzles.<n>Our study contributes a systematically annotated dataset of 221 rebus puzzles across six cognitive categories, paired with an evaluation framework that separates reasoning quality from answer correctness.<n>Our findings demonstrate that reasoning quality varies dramatically across puzzle categories, with models showing systematic strengths in visual composition while exhibiting fundamental limitations in absence interpretation and cultural symbolism.
arXiv Detail & Related papers (2025-10-03T07:27:47Z)
VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps [3.6380495892295173]
We propose a vision-language framework that leverages textual context to enhance puzzle assembly performance.<n>Our approach centers on the Vision-Language Hierarchical Semantic Alignment (VLHSA) module.<n>Our work establishes a new paradigm for jigsaw puzzle solving by incorporating multimodal semantic insights.
arXiv Detail & Related papers (2025-09-17T20:40:52Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation [53.452699232071495]
CrossWordBench is a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) through the medium of crossword puzzles.<n>Our evaluation reveals that reasoning LLMs outperform non-reasoning models substantially by effectively leveraging crossing-letter constraints.<n>Our findings offer insights into the limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
arXiv Detail & Related papers (2025-03-30T20:03:36Z)
Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z)
Enhancing Visual Reasoning with Autonomous Imagination in Multimodal Large Language Models [27.78471707423076]
We propose a new visual reasoning paradigm enabling MLLMs to autonomously modify the input scene to new ones based on its reasoning status.<n>We introduce a novel plug-and-play imagination space, where MLLMs conduct visual modifications through operations like focus, ignore, and transform.<n>We validate our approach through a benchmark spanning dense counting, simple jigsaw puzzle solving, and object placement.
arXiv Detail & Related papers (2024-11-27T08:44:25Z)
Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! [14.84123301554462]
We present UNPIE, a novel benchmark designed to assess the impact of multimodal inputs in resolving lexical ambiguities. Our dataset includes 1,000 puns, each accompanied by an image that explains both meanings. The results indicate that various Socratic Models and Visual-Language Models improve over the text-only models when given visual context.
arXiv Detail & Related papers (2024-10-01T19:32:57Z)
RISCORE: Enhancing In-Context Riddle Solving in Language Models through Context-Reconstructed Example Augmentation [1.9939549451457024]
This paper explores how different prompting techniques impact performance on riddles that demand diverse reasoning skills.<n>We introduce RISCORE, a fully automated prompting method that generates and utilizes contextually reconstructed sentence-based puzzles.<n>Our experiments demonstrate that RISCORE significantly improves the performance of language models in both vertical and lateral thinking tasks.
arXiv Detail & Related papers (2024-09-24T18:35:09Z)
Auto-Encoding Morph-Tokens for Multimodal LLM [151.2618346912529]
We propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts. Experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously.
arXiv Detail & Related papers (2024-05-03T08:43:06Z)
Beyond Text: Frozen Large Language Models in Visual Signal Comprehension [34.398976855955404]
Vision-to-Language Tokenizer, abbreviated as V2T Tokenizer, transforms an image into a foreign language'' with the combined aid of an encoder-decoder, the LLM vocabulary, and a CLIP model. We undertake rigorous experiments to validate our method, encompassing understanding tasks like image recognition, image captioning, and visual question answering, as well as image denoising tasks like inpainting, outpainting, deblurring, and shift restoration.
arXiv Detail & Related papers (2024-03-12T17:59:51Z)
Good Questions Help Zero-Shot Image Reasoning [110.1671684828904]
Question-Driven Visual Exploration (QVix) is a novel prompting strategy that enhances the exploratory capabilities of large vision-language models (LVLMs) QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment. Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods.
arXiv Detail & Related papers (2023-12-04T03:18:51Z)
Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [52.935150075484074]
We introduce a well-designed visual tokenizer to translate the non-linguistic image into a sequence of discrete tokens like a foreign language. The resulting visual tokens encompass high-level semantics worthy of a word and also support dynamic sequence length varying from the image. This unification empowers LaVIT to serve as an impressive generalist interface to understand and generate multi-modal content simultaneously.
arXiv Detail & Related papers (2023-09-09T03:01:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.