Related papers: How Modality Shapes Perception and Reasoning: A Study of Error Propagation in ARC-AGI

How Modality Shapes Perception and Reasoning: A Study of Error Propagation in ARC-AGI

URL: http://arxiv.org/abs/2511.15717v1
Date: Tue, 11 Nov 2025 19:06:41 GMT
Title: How Modality Shapes Perception and Reasoning: A Study of Error Propagation in ARC-AGI
Authors: Bo Wen, Chen Wang, Erhan Bilal,
Abstract summary: ARC-AGI and ARC-AGI-2 measure generalization-through-composition on small color-quantized grids.<n>Recent instruction-first systems translate grids into concise natural-language or DSL rules executed in generate-execute-select loops.
Score: 7.226300346775942
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: ARC-AGI and ARC-AGI-2 measure generalization-through-composition on small color-quantized grids, and their prize competitions make progress on these harder held-out tasks a meaningful proxy for systematic generalization. Recent instruction-first systems translate grids into concise natural-language or DSL rules executed in generate-execute-select loops, yet we lack a principled account of how encodings shape model perception and how to separate instruction errors from execution errors. We hypothesize that modality imposes perceptual bottlenecks -- text flattens 2D structure into 1D tokens while images preserve layout but can introduce patch-size aliasing -- thereby shaping which grid features are reliably perceived. To test this, we isolate perception from reasoning across nine text and image modalities using a weighted set-disagreement metric and a two-stage reasoning pipeline, finding that structured text yields precise coordinates on sparse features, images capture 2D shapes yet are resolution-sensitive, and combining them improves execution (about 8 perception points; about 0.20 median similarity). Overall, aligning representations with transformer inductive biases and enabling cross-validation between text and image yields more accurate instructions and more reliable execution without changing the underlying model.

Related papers

DECOR:Decomposition and Projection of Text Embeddings for Text-to-Image Customization [15.920735314050296]
This study decomposes the text embedding matrix and conducts a component analysis to understand the embedding space geometry.<n>We propose DECOR, which projects text embeddings onto a vector space to undesired token vectors.<n> Experimental results demonstrate that DECOR outperforms state-of-the-art customization models.
arXiv Detail & Related papers (2024-12-12T10:59:44Z)
TPIE: Topology-Preserved Image Editing With Text Instructions [14.399084325078878]
Topology-Preserved Image Editing with text instructions (TPIE) TPIE treats newly generated samples as deformable variations of a given input template, allowing for controllable and structure-preserving edits. We validate TPIE on a diverse set of 2D and 3D images and compare them with state-of-the-art image editing approaches.
arXiv Detail & Related papers (2024-11-22T22:08:27Z)
Object-Attribute Binding in Text-to-Image Generation: Evaluation and Control [58.37323932401379]
Current diffusion models create images given a text prompt as input but struggle to correctly bind attributes mentioned in the text to the right objects in the image. We propose focused cross-attention (FCA) that controls the visual attention maps by syntactic constraints found in the input sentence. We show substantial improvements in T2I generation and especially its attribute-object binding on several datasets.
arXiv Detail & Related papers (2024-04-21T20:26:46Z)
Improving Joint Speech-Text Representations Without Alignment [92.60384956736536]
We show that joint speech-text encoders naturally achieve consistent representations across modalities by disregarding sequence length. We argue that consistency losses could forgive length differences and simply assume the best alignment.
arXiv Detail & Related papers (2023-08-11T13:28:48Z)
LRANet: Towards Accurate and Efficient Scene Text Detection with Low-Rank Approximation Network [63.554061288184165]
We propose a novel parameterized text shape method based on low-rank approximation. By exploring the shape correlation among different text contours, our method achieves consistency, compactness, simplicity, and robustness in shape representation. We implement an accurate and efficient arbitrary-shaped text detector named LRANet.
arXiv Detail & Related papers (2023-06-27T02:03:46Z)
Text Reading Order in Uncontrolled Conditions by Sparse Graph Segmentation [71.40119152422295]
We propose a lightweight, scalable and generalizable approach to identify text reading order. The model is language-agnostic and runs effectively across multi-language datasets. It is small enough to be deployed on virtually any platform including mobile devices.
arXiv Detail & Related papers (2023-05-04T06:21:00Z)
Primitive Representation Learning for Scene Text Recognition [7.818765015637802]
We propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images. A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding. We also propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods.
arXiv Detail & Related papers (2021-05-10T11:54:49Z)
ABCNet v2: Adaptive Bezier-Curve Network for Real-time End-to-end Text Spotting [108.93803186429017]
End-to-end text-spotting aims to integrate detection and recognition in a unified framework. Here, we tackle end-to-end text spotting by presenting Adaptive Bezier Curve Network v2 (ABCNet v2) Our main contributions are four-fold: 1) For the first time, we adaptively fit arbitrarily-shaped text by a parameterized Bezier curve, which, compared with segmentation-based methods, can not only provide structured output but also controllable representation. Comprehensive experiments conducted on various bilingual (English and Chinese) benchmark datasets demonstrate that ABCNet v2 can achieve state-of-the
arXiv Detail & Related papers (2021-05-08T07:46:55Z)
ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene Text Detection [147.10751375922035]
We propose the ContourNet, which effectively handles false positives and large scale variance of scene texts. Our method effectively suppresses these false positives by only outputting predictions with high response value in both directions.
arXiv Detail & Related papers (2020-04-10T08:15:23Z)
Scene Text Recognition With Finer Grid Rectification [6.598317412802175]
This paper proposed an end-to-end trainable model consists of a finer rectification module and a bidirectional attentional recognition network(Firbarn) The results of extensive evaluation on the standard benchmarks show Firbarn outperforms previous works, especially on irregular datasets.
arXiv Detail & Related papers (2020-01-26T02:40:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.