Related papers: LoopViT: Scaling Visual ARC with Looped Transformers

LoopViT: Scaling Visual ARC with Looped Transformers

URL: http://arxiv.org/abs/2602.02156v1
Date: Mon, 02 Feb 2026 14:32:57 GMT
Title: LoopViT: Scaling Visual ARC with Looped Transformers
Authors: Wen-Jie Shu, Xuerui Qiu, Rui-Jie Zhu, Harold Haodong Chen, Yexin Liu, Harry Yang,
Abstract summary: We propose Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence.<n>Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought.<n> Empirical results on the ARC-AGI-1 benchmark validate this perspective.
Score: 14.9105267508928
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in visual reasoning have leveraged vision transformers to tackle the ARC-AGI benchmark. However, we argue that the feed-forward architecture, where computational depth is strictly bound to parameter size, falls short of capturing the iterative, algorithmic nature of human induction. In this work, we propose a recursive architecture called Loop-ViT, which decouples reasoning depth from model capacity through weight-tied recurrence. Loop-ViT iterates a weight-tied Hybrid Block, combining local convolutions and global attention, to form a latent chain of thought. Crucially, we introduce a parameter-free Dynamic Exit mechanism based on predictive entropy: the model halts inference when its internal state ``crystallizes" into a low-uncertainty attractor. Empirical results on the ARC-AGI-1 benchmark validate this perspective: our 18M model achieves 65.8% accuracy, outperforming massive 73M-parameter ensembles. These findings demonstrate that adaptive iterative computation offers a far more efficient scaling axis for visual reasoning than simply increasing network width. The code is available at https://github.com/WenjieShu/LoopViT.

Related papers

Merging Beyond: Streaming LLM Updates via Activation-Guided Rotations [55.047454145941366]
Streaming Merging is an innovative model updating paradigm that conceptualizes merging as an iterative optimization process.<n> ARM is a strategy designed to approximate gradient descent dynamics.<n> ARM requires only early SFT checkpoints and, through iterative merging, surpasses the fully converged SFT model.
arXiv Detail & Related papers (2026-02-03T08:15:57Z)
DeepPrune: Parallel Scaling without Inter-trace Redundancy [53.62015294143274]
Over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation.<n>We propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning.<n>Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient.
arXiv Detail & Related papers (2025-10-09T17:24:54Z)
Inner Thinking Transformer: Leveraging Dynamic Depth Scaling to Foster Adaptive Internal Thinking [51.154226183713405]
We propose Inner Thinking Transformer, which reimagines layer computations as implicit thinking steps.<n>ITT achieves 96.5% performance of a 466M Transformer using only 162M parameters, reduces training data by 43.2%, and outperforms Transformer/Loop variants in 11 benchmarks.
arXiv Detail & Related papers (2025-02-19T16:02:23Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers [14.756988176469365]
An effective approach to reduce computational requirements and increase efficiency is to prune unnecessary components of Deep Neural Networks. Previous work has shown that attribution methods from the field of eXplainable AI serve as effective means to extract and prune the least relevant network components in a few-shot fashion.
arXiv Detail & Related papers (2024-08-22T17:35:18Z)
LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units [5.830814457423021]
Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability. We show how architectural modifications to a recurrent model can help push its performance toward Transformer models. We present a spiking version of this architecture, which introduces the benefit of states within the patch embedding and channel mixer modules.
arXiv Detail & Related papers (2024-01-20T01:10:18Z)
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges. Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning. A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
DCT-Former: Efficient Self-Attention with Discrete Cosine Transform [4.622165486890318]
An intrinsic limitation of the Trasformer architectures arises from the computation of the dot-product attention. Our idea takes inspiration from the world of lossy data compression (such as the JPEG algorithm) to derive an approximation of the attention module. An extensive section of experiments shows that our method takes up less memory for the same performance, while also drastically reducing inference time.
arXiv Detail & Related papers (2022-03-02T15:25:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.