Think Visually, Reason Textually: Vision-Language Synergy in ARC
- URL: http://arxiv.org/abs/2511.15703v2
- Date: Wed, 26 Nov 2025 12:23:11 GMT
- Title: Think Visually, Reason Textually: Vision-Language Synergy in ARC
- Authors: Beichen Zhang, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, Jiaqi Wang,
- Abstract summary: ARC-AGI is a rigorous testbed for conceptual rule induction and transfer to novel tasks.<n>Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction.<n>We introduce two synergistic strategies: Vision-Language Synergy Reasoning (VLSR) and Modality-Switch Self-Correction (MSSC)<n>Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence.
- Score: 94.15522924153264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Abstract reasoning from minimal examples remains a core unsolved problem for frontier foundation models such as GPT-5 and Grok 4. These models still fail to infer structured transformation rules from a handful of examples, which is a key hallmark of human intelligence. The Abstraction and Reasoning Corpus for Artificial General Intelligence (ARC-AGI) provides a rigorous testbed for this capability, demanding conceptual rule induction and transfer to novel tasks. Most existing methods treat ARC-AGI as a purely textual reasoning task, overlooking the fact that humans rely heavily on visual abstraction when solving such puzzles. However, our pilot experiments reveal a paradox: naively rendering ARC-AGI grids as images degrades performance due to imprecise rule execution. This leads to our central hypothesis that vision and language possess complementary strengths across distinct reasoning stages: vision supports global pattern abstraction and verification, whereas language specializes in symbolic rule formulation and precise execution. Building on this insight, we introduce two synergistic strategies: (1) Vision-Language Synergy Reasoning (VLSR), which decomposes ARC-AGI into modality-aligned subtasks; and (2) Modality-Switch Self-Correction (MSSC), which leverages vision to verify text-based reasoning for intrinsic error correction. Extensive experiments demonstrate that our approach yields up to a 4.33\% improvement over text-only baselines across diverse flagship models and multiple ARC-AGI tasks. Our findings suggest that unifying visual abstraction with linguistic reasoning is a crucial step toward achieving generalizable, human-like intelligence in future foundation models. Source code is released at https://github.com/InternLM/ARC-VL.
Related papers
- Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities? [61.533560295383786]
Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture.<n>We observe that U-MLLMs fail to maintain semantic equivalence when required to render the same results in the image modality.<n>We introduce VGUBench, a framework to decouple reasoning logic from generation fidelity.
arXiv Detail & Related papers (2026-02-27T06:23:56Z) - Stable Language Guidance for Vision-Language-Action Models [62.80963701282789]
Residual Semantic Steering is a probabilistic framework that disentangles physical affordance from semantic execution.<n> RSS achieves state-of-the-art robustness, maintaining performance even under adversarial linguistic perturbations.
arXiv Detail & Related papers (2026-01-07T16:16:10Z) - ARC Is a Vision Problem! [50.59206008530851]
We formulate ARC within a vision paradigm, framing it as an image-to-image translation problem.<n>Our framework, termed Vision ARC, achieves 60.4% accuracy on the ARC-1 benchmark.
arXiv Detail & Related papers (2025-11-18T18:59:49Z) - BLINK-Twice: You see, but do you observe? A Reasoning Benchmark on Visual Perception [67.89135437537179]
We introduce BLINK-Twice, a vision-centric reasoning benchmark grounded in challenging perceptual tasks.<n>Instead of relying on external knowledge, our tasks require models to reason from visual content alone.<n>Compared to prior perception benchmarks, it moves beyond shallow perception and requires fine-grained observation and analytical reasoning.
arXiv Detail & Related papers (2025-10-10T13:14:13Z) - EasyARC: Evaluating Vision Language Models on True Visual Reasoning [0.0]
We introduce EasyARC, a vision-language benchmark requiring multi-image, multi-step reasoning, and self-correction.<n>EasyARC is procedurally generated, fully verifiable, and scalable, making it ideal for reinforcement learning pipelines.<n>We benchmark state-of-the-art vision-language models and analyze their failure modes.
arXiv Detail & Related papers (2025-06-13T09:03:33Z) - Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning [95.44766931218896]
Multi-modal large language models (MLLMs) still lag behind text-based reasoning.<n>We introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable.<n>We propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO) to align the MLLM's perceptual output with the final reasoning task.
arXiv Detail & Related papers (2025-06-05T02:28:07Z) - Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z) - Tackling the Abstraction and Reasoning Corpus with Vision Transformers: the Importance of 2D Representation, Positions, and Objects [31.926206783846144]
We show that a Vision Transformer (ViT) fails dramatically on most ARC tasks even when trained on one million examples per task.<n>We propose ViTARC, a ViT-style architecture that unlocks some of the visual reasoning capabilities required by the ARC.<n>Our task-specific ViTARC models achieve a test solve rate close to 100% on more than half of the 400 public ARC tasks.
arXiv Detail & Related papers (2024-10-08T22:25:34Z) - LLMs and the Abstraction and Reasoning Corpus: Successes, Failures, and
the Importance of Object-based Representations [50.431003245201644]
We show that GPT-4 is unable to "reason" perfectly within non-language domains such as the 1D-ARC or a simple ARC subset.
We propose an object-based representation that is obtained through an external tool, resulting in nearly doubling the performance on solved ARC tasks and near-perfect scores on the easier 1D-ARC.
arXiv Detail & Related papers (2023-05-26T16:32:17Z) - Abstract Visual Reasoning Enabled by Language [8.627180519837657]
We propose a general learning-based framework for solving ARC.
It is centered on transforming tasks from the vision to the language domain.
This composition of language and vision allows for pre-trained models to be leveraged at each stage.
arXiv Detail & Related papers (2023-03-07T17:52:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.