Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning
- URL: http://arxiv.org/abs/2601.03400v1
- Date: Tue, 06 Jan 2026 20:27:29 GMT
- Title: Eye-Q: A Multilingual Benchmark for Visual Word Puzzle Solving and Image-to-Phrase Reasoning
- Authors: Ali Najar, Alireza Mirrokni, Arshia Izadyari, Sadegh Mohammadian, Amir Homayoon Sharifizade, Asal Meskin, Mobin Bagherian, Ehsaneddin Asgari,
- Abstract summary: Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks.<n>We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping evidence to non-literal concepts.<n>We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding.
- Score: 1.6234264741872295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) have achieved strong performance on standard vision-language benchmarks, yet often rely on surface-level recognition rather than deeper reasoning. We propose visual word puzzles as a challenging alternative, as they require discovering implicit visual cues, generating and revising hypotheses, and mapping perceptual evidence to non-literal concepts in ways that are difficult to solve via literal grounding, OCR-heavy shortcuts, or simple retrieval-style matching. We introduce Eye-Q, a multilingual benchmark designed to assess this form of complex visual understanding. Eye-Q contains 1,343 puzzles in which a model observes a conceptually dense scene with a brief description and must infer a specific target word or phrase. The puzzles are intentionally unstructured and cue-implicit, with distractors and contextual relationships that demand selective attention, abstraction, and associative inference. The benchmark spans English, Persian, Arabic, and cross-lingual puzzles. We evaluate state-of-the-art VLMs using an open-ended, human-aligned protocol that probes hypothesis formation and revision under lightweight assistance. Results reveal substantial performance gaps, especially on abstract and cross-lingual puzzles, highlighting limitations in current models' ability to construct and search over appropriate conceptual representations for flexible image-to-phrase inference; maximum accuracy reaches only 60.27%.
Related papers
- Seeing Through Words: Controlling Visual Retrieval Quality with Language Models [68.49490036960559]
We propose a new paradigm of quality-controllable retrieval, which enriches short queries with contextual details while incorporating explicit notions of image quality.<n>Our key idea is to leverage a generative language model as a query completion function, extending underspecified queries into descriptive forms.<n>Our proposed approach significantly improves retrieval results and provides effective quality control, bridging the gap between the expressive capacity of modern VLMs and the underspecified nature of short user queries.
arXiv Detail & Related papers (2026-02-24T18:20:57Z) - Toward Cognitive Supersensing in Multimodal Large Language Model [67.15559571626747]
We introduce Cognitive Supersensing, a training paradigm that endows MLLMs with human-like visual imagery capabilities.<n>In experiments, MLLMs trained with Cognitive Supersensing significantly outperform state-of-the-art baselines on CogSense-Bench.<n>We will open-source the CogSense-Bench and our model weights.
arXiv Detail & Related papers (2026-02-02T02:19:50Z) - Reasoning or Pattern Matching? Probing Large Vision-Language Models with Visual Puzzles [13.059313134998192]
This survey provides a unified perspective of visual puzzle reasoning in Large Vision-Language Models (LVLMs)<n>We frame visual puzzles through a common abstraction and organize existing benchmarks by the reasoning mechanisms they target.<n>We identify consistent limitations in current models, including brittle generalization, tight entanglement between perception and reasoning, and a persistent gap between fluent explanations and faithful execution.
arXiv Detail & Related papers (2026-01-20T08:02:04Z) - Context Matters: Learning Global Semantics via Object-Centric Representation [8.195437248815802]
Vision models have yet to exhibit comparable progress in in-context learning.<n>We argue that this gap could stem from the lack of semantic and contextual guidance in current vision transformer (ViT) training schemes.<n>We propose to directly model "object" as the visual equivalence of "word," pushing the model to learn the global context and semantics among visual elements.
arXiv Detail & Related papers (2025-10-07T08:33:36Z) - VLHSA: Vision-Language Hierarchical Semantic Alignment for Jigsaw Puzzle Solving with Eroded Gaps [3.6380495892295173]
We propose a vision-language framework that leverages textual context to enhance puzzle assembly performance.<n>Our approach centers on the Vision-Language Hierarchical Semantic Alignment (VLHSA) module.<n>Our work establishes a new paradigm for jigsaw puzzle solving by incorporating multimodal semantic insights.
arXiv Detail & Related papers (2025-09-17T20:40:52Z) - Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint [57.73346054360675]
Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs)<n>In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles.
arXiv Detail & Related papers (2025-05-29T17:59:47Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - COLUMBUS: Evaluating COgnitive Lateral Understanding through Multiple-choice reBUSes [14.603382370403]
We formulate visual lateral thinking as a multiple-choice question-answering task.<n>We describe a three-step taxonomy-driven methodology for instantiating task examples.<n>We develop COLUMBUS, a synthetic benchmark that applies the task pipeline to create QA sets with text and icon rebus puzzles.
arXiv Detail & Related papers (2024-09-06T06:49:55Z) - UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding [88.24517460894634]
We propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning.<n>Our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR.
arXiv Detail & Related papers (2023-07-03T09:03:12Z) - Contextual Object Detection with Multimodal Large Language Models [66.15566719178327]
We introduce a novel research problem of contextual object detection.
Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering.
We present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts.
arXiv Detail & Related papers (2023-05-29T17:50:33Z) - Probing Contextual Language Models for Common Ground with Visual
Representations [76.05769268286038]
We design a probing model that evaluates how effective are text-only representations in distinguishing between matching and non-matching visual representations.
Our findings show that language representations alone provide a strong signal for retrieving image patches from the correct object categories.
Visually grounded language models slightly outperform text-only language models in instance retrieval, but greatly under-perform humans.
arXiv Detail & Related papers (2020-05-01T21:28:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.