Related papers: On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning

On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning

URL: http://arxiv.org/abs/2601.03048v1
Date: Tue, 06 Jan 2026 14:32:40 GMT
Title: On the Intrinsic Limits of Transformer Image Embeddings in Non-Solvable Spatial Reasoning
Authors: Siyi Lyu, Quan Liu, Feng Yan,
Abstract summary: We show that Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation.<n>We formalize a complexity boundary: constant-depth ViTs fundamentally lack the logical depth to efficiently capture non-solvable spatial structures.<n>We validate this complexity gap via latent-space probing, demonstrating that ViT representations suffer a structural collapse on non-solvable tasks as compositional depth increases.
Score: 4.907226678338655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers (ViTs) excel in semantic recognition but exhibit systematic failures in spatial reasoning tasks such as mental rotation. While often attributed to data scale, we propose that this limitation arises from the intrinsic circuit complexity of the architecture. We formalize spatial understanding as learning a Group Homomorphism: mapping image sequences to a latent space that preserves the algebraic structure of the underlying transformation group. We demonstrate that for non-solvable groups (e.g., the 3D rotation group $\mathrm{SO}(3)$), maintaining such a structure-preserving embedding is computationally lower-bounded by the Word Problem, which is $\mathsf{NC^1}$-complete. In contrast, we prove that constant-depth ViTs with polynomial precision are strictly bounded by $\mathsf{TC^0}$. Under the conjecture $\mathsf{TC^0} \subsetneq \mathsf{NC^1}$, we establish a complexity boundary: constant-depth ViTs fundamentally lack the logical depth to efficiently capture non-solvable spatial structures. We validate this complexity gap via latent-space probing, demonstrating that ViT representations suffer a structural collapse on non-solvable tasks as compositional depth increases.

Related papers

The Geometry of Abstraction: Continual Learning via Recursive Quotienting [6.0044467881527614]
Continual learning systems face a fundamental geometric barrier: the flat manifold problem.<n>We propose a geometric resolution to this paradox based on Recursive Metric Contraction.<n>We show that tokens in neural architectures are physically realizable as singularities or wormholes.
arXiv Detail & Related papers (2025-12-20T19:10:38Z)
Memory-Amortized Inference: A Topological Unification of Search, Closure, and Structure [6.0044467881527614]
We propose textbfMemory-Amortized Inference (MAI), a formal framework that unifies learning and memory as phase transitions of a single geometric substrate.<n>We show that cognition operates by converting high-complexity search into low-complexity lookup.<n>This framework offers a rigorous explanation for the emergence of fast-thinking (intuition) from slow-thinking (reasoning)
arXiv Detail & Related papers (2025-11-28T16:28:24Z)
Expressive Power of Deep Networks on Manifolds: Simultaneous Approximation [2.815765641180636]
We show that a constant-depth $mathrmReLUk-1$ network with bounded weights can approximate any function in the Sobolev space.<n>We also prove that our construction is nearly optimal by showing the required number of parameters matches up to a logarithmic factor.
arXiv Detail & Related papers (2025-09-11T11:28:20Z)
Why and When Deep is Better than Shallow: An Implementation-Agnostic State-Transition View of Depth Supremacy [15.310099705870114]
We formulate a deep model as an abstract state-transition semigroup acting on a general metric space.<n>We separate the implementation (e.g., ReLU nets, transformers, and chain-of-thought) from the abstract state transition.<n>We prove a bias-variance decomposition in which the dependence depends only on the abstract depth-$k$ network and not on the implementation.
arXiv Detail & Related papers (2025-05-21T03:32:30Z)
Recursive Self-Similarity in Deep Weight Spaces of Neural Architectures: A Fractal and Coarse Geometry Perspective [2.9130383514140292]
This paper conceptualizes the Deep Weight Spaces as hierarchical, fractal-like, coarse geometric structures observable at discrete integer scales.<n>We introduce a coarse group action termed the fractal transformation, $T_r_k $, acting under the symmetry group $G = (mathbbZ, +) $.<n>This perspective adopts a box count technique, commonly used to assess the hierarchical and scale-related geometry of physical structures.
arXiv Detail & Related papers (2025-03-18T14:41:23Z)
Large Spatial Model: End-to-end Unposed Images to Semantic 3D [79.94479633598102]
Large Spatial Model (LSM) processes unposed RGB images directly into semantic radiance fields. LSM simultaneously estimates geometry, appearance, and semantics in a single feed-forward operation. It can generate versatile label maps by interacting with language at novel viewpoints.
arXiv Detail & Related papers (2024-10-24T17:54:42Z)
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z)
Rethinking SO(3)-equivariance with Bilinear Tensor Networks [0.0]
We show that by judicious symmetry breaking, we can efficiently increase the expressiveness of a network operating only on vector and order-2 tensor representations of SO$(2)$. We demonstrate the method on an important problem from High Energy Physics known as textitb-tagging, where particle jets originating from b-meson decays must be discriminated from an overwhelming QCD background.
arXiv Detail & Related papers (2023-03-20T17:23:15Z)
How Do Transformers Learn Topic Structure: Towards a Mechanistic Understanding [56.222097640468306]
We provide mechanistic understanding of how transformers learn "semantic structure" We show, through a combination of mathematical analysis and experiments on Wikipedia data, that the embedding layer and the self-attention layer encode the topical structure.
arXiv Detail & Related papers (2023-03-07T21:42:17Z)
A Scalable Combinatorial Solver for Elastic Geometrically Consistent 3D Shape Matching [69.14632473279651]
We present a scalable algorithm for globally optimizing over the space of geometrically consistent mappings between 3D shapes. We propose a novel primal coupled with a Lagrange dual problem that is several orders of magnitudes faster than previous solvers.
arXiv Detail & Related papers (2022-04-27T09:47:47Z)
Bounds on quantum evolution complexity via lattice cryptography [0.0]
We address the difference between integrable and chaotic motion in quantum theory as manifested by the complexity of the corresponding evolution operators. Complexity is understood here as the shortest geodesic distance between the time-dependent evolution operator and the origin within the group of unitaries.
arXiv Detail & Related papers (2022-02-28T16:20:10Z)
Deep Implicit Templates for 3D Shape Representation [70.9789507686618]
We propose a new 3D shape representation that supports explicit correspondence reasoning in deep implicit representations. Our key idea is to formulate DIFs as conditional deformations of a template implicit function. We show that our method can not only learn a common implicit template for a collection of shapes, but also establish dense correspondences across all the shapes simultaneously without any supervision.
arXiv Detail & Related papers (2020-11-30T06:01:49Z)
Dense Non-Rigid Structure from Motion: A Manifold Viewpoint [162.88686222340962]
Non-Rigid Structure-from-Motion (NRSfM) problem aims to recover 3D geometry of a deforming object from its 2D feature correspondences across multiple frames. We show that our approach significantly improves accuracy, scalability, and robustness against noise.
arXiv Detail & Related papers (2020-06-15T09:15:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.