Related papers: Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

URL: http://arxiv.org/abs/2512.17351v1
Date: Fri, 19 Dec 2025 08:47:28 GMT
Title: Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers
Authors: Zeyuan Allen-Zhu,
Abstract summary: We introduce controlled synthetic pretraining tasks that isolate and evaluate core model capabilities.<n>Within this framework, we discover CANON LAYERS that promote horizontal information flow across neighboring tokens.<n>This includes how Canon layers enhance reasoning depth (e.g., by $2times$), reasoning breadth, knowledge manipulation, etc.
Score: 21.6340059114965
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding architectural differences in language models is challenging, especially at academic-scale pretraining (e.g., 1.3B parameters, 100B tokens), where results are often dominated by noise and randomness. To overcome this, we introduce controlled synthetic pretraining tasks that isolate and evaluate core model capabilities. Within this framework, we discover CANON LAYERS: lightweight architectural components -- named after the musical term "canon" -- that promote horizontal information flow across neighboring tokens. Canon layers compute weighted sums of nearby token representations and integrate seamlessly into Transformers, linear attention, state-space models, or any sequence architecture. We present 12 key results. This includes how Canon layers enhance reasoning depth (e.g., by $2\times$), reasoning breadth, knowledge manipulation, etc. They lift weak architectures like NoPE to match RoPE, and linear attention to rival SOTA linear models like Mamba2/GDN -- validated both through synthetic tasks and real-world academic-scale pretraining. This synthetic playground offers an economical, principled path to isolate core model capabilities often obscured at academic scales. Equipped with infinite high-quality data, it may even PREDICT how future architectures will behave as training pipelines improve -- e.g., through better data curation or RL-based post-training -- unlocking deeper reasoning and hierarchical inference.

Related papers

Looking beyond the next token [75.00751370502168]
We argue that rearranging and processing the training data sequences can allow models to more accurately imitate the true data-generating process.<n>Our method naturally enables the generation of long-term goals at no additional cost.
arXiv Detail & Related papers (2025-04-15T16:09:06Z)
Stack Transformer Based Spatial-Temporal Attention Model for Dynamic Sign Language and Fingerspelling Recognition [1.949837893170278]
Hand gesture-based Sign Language Recognition serves as a crucial bridge between deaf and non-deaf individuals.<n>We propose the Sequential Spatio-Temporal Attention Network (SSTAN), a novel Transformer-based architecture.<n>We validated our model through extensive experiments on diverse, large-scale datasets.
arXiv Detail & Related papers (2025-03-21T04:57:18Z)
Large Concept Models: Language Modeling in a Sentence Representation Space [62.73366944266477]
We present an attempt at an architecture which operates on an explicit higher-level semantic representation, which we name a concept.<n> Concepts are language- and modality-agnostic and represent a higher level idea or action in a flow.<n>We show that our model exhibits impressive zero-shot generalization performance to many languages.
arXiv Detail & Related papers (2024-12-11T23:36:20Z)
Adaptive Large Language Models By Layerwise Attention Shortcuts [46.76681147411957]
LLM-like setups allow the final layer to attend to all of the intermediate layers as it deems fit through the attention mechanism. We showcase four different datasets, namely acoustic tokens, natural language, and symbolic music, and we achieve superior performance for GPT-like architecture.
arXiv Detail & Related papers (2024-09-17T03:46:01Z)
Wavelet GPT: Wavelet Inspired Large Language Models [1.2328446298523066]
Large Language Models (LLMs) have ushered in a new wave of artificial intelligence advancements.<n>This paper infuses LLMs with a traditional signal processing idea, namely wavelets, during pre-training to take advantage of the structure.<n>We achieve the same pre-training performance almost twice as fast in text, audio, and images.
arXiv Detail & Related papers (2024-09-04T03:17:19Z)
Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups. We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective. Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z)
What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization. We train models with over 5 billion parameters for more than 170 billion tokens. We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z)
Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules. inputs to the model are routed through a sequence of functions in a way that is end-to-end learned. We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z)
An Efficient Image-to-Image Translation HourGlass-based Architecture for Object Pushing Policy Learning [20.77172985076276]
Humans effortlessly solve pushing tasks in everyday life but unlocking these capabilities remains a challenge in robotics. We present an architecture combining a predictor of which pushes lead to changes in the environment with a state-action value predictor dedicated to the pushing task. We demonstrate in simulation experiments with a UR5 robot arm that our overall architecture helps the DQN learn faster and achieve higher performance.
arXiv Detail & Related papers (2021-08-02T16:46:08Z)
The Nonlinearity Coefficient -- A Practical Guide to Neural Architecture Design [3.04585143845864]
We develop methods that can predict, without any training, whether an architecture will achieve a relatively high test or training error on a task after training. We then go on to explain the error in terms of the architecture definition itself and develop tools for modifying the architecture. Our first major contribution is to show that the 'degree of nonlinearity' of a neural architecture is a key causal driver behind its performance.
arXiv Detail & Related papers (2021-05-25T20:47:43Z)
Deep Imitation Learning for Bimanual Robotic Manipulation [70.56142804957187]
We present a deep imitation learning framework for robotic bimanual manipulation. A core challenge is to generalize the manipulation skills to objects in different locations. We propose to (i) decompose the multi-modal dynamics into elemental movement primitives, (ii) parameterize each primitive using a recurrent graph neural network to capture interactions, and (iii) integrate a high-level planner that composes primitives sequentially and a low-level controller to combine primitive dynamics and inverse kinematics control.
arXiv Detail & Related papers (2020-10-11T01:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.