PRISM: Deriving the Transformer as a Signal-Denoising Operator via Maximum Coding Rate Reduction
- URL: http://arxiv.org/abs/2601.15540v1
- Date: Wed, 21 Jan 2026 23:52:36 GMT
- Title: PRISM: Deriving the Transformer as a Signal-Denoising Operator via Maximum Coding Rate Reduction
- Authors: Dongchen Huang,
- Abstract summary: We propose Prism, a white-box attention-based architecture for deep learning.<n>We show that Prism spontaneously specializes its attention heads into spectrally distinct regimes.<n>Our results suggest that interpretability and performance are not a trade-off, but can be unified through principled construction.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning models, particularly Transformers, are often criticized as "black boxes" and lack interpretability. We propose Prism, a white-box attention-based architecture derived from the principles of Maximizing Coding Rate Reduction ($\text{MCR}^2$). By modeling the attention mechanism as a gradient ascent process on a distinct signal-noise manifold, we introduce two physical constraints: an overcomplete dictionary to expand the representational phase space, and an irrational frequency separation ($π$-RoPE) to enforce incoherence between signal and noise subspaces. We demonstrate that these geometric inductive biases can be viewed as a physical constraint and they are sufficient to induce unsupervised functional disentanglement alone. Using TinyStories as a controlled testbed for verifying spectral dynamics, we observe that Prism spontaneously specializes its attention heads into spectrally distinct regimes: low-frequency heads capturing long-range causal dependencies (signal) and high-frequency heads handling local syntactic constraints (noise). Our results suggest that interpretability and performance are not a trade-off, but can be unified through principled geometric construction.
Related papers
- The Malignant Tail: Spectral Segregation of Label Noise in Over-Parameterized Networks [0.0]
We experimentally isolate the Malignant Tail, a failure mode where networks functionally segregate signal and noise.<n>We show that untrained networks actively segregate noise, allowing post-hoc Explicit Spectral Truncation to surgically prune the noise-dominated subspace.<n>Our findings suggest that under label noise, excess spectral capacity is not harmless redundancy but a latent structural liability.
arXiv Detail & Related papers (2026-03-02T16:39:42Z) - FUTON: Fourier Tensor Network for Implicit Neural Representations [56.48739018255443]
Implicit neural representations (INRs) have emerged as powerful tools for encoding signals, yet dominant-based designs often suffer from slow convergence, overfitting to noise, and poor extrapolation.<n>We introduce FUTON, which models signals as generalized Fourier series whose coefficients are parameterized by a low-rank tensor decomposition.
arXiv Detail & Related papers (2026-02-13T19:31:44Z) - Parallel Complex Diffusion for Scalable Time Series Generation [50.01609741902786]
PaCoDi is a spectral-native architecture that decouples generative modeling in the frequency domain.<n>We show that PaCoDi outperforms existing baselines in both generation quality and inference speed.
arXiv Detail & Related papers (2026-02-10T14:31:53Z) - Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization [78.94590726578014]
multimodal reasoning models (MLRMs) remain prone to hallucinations, and effective solutions are still underexplored.<n>We propose C3PO, a training-based mitigation framework comprising textbfCompression and textbfPreference textbfOptimization.
arXiv Detail & Related papers (2026-02-03T11:00:55Z) - TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors [53.891337639229285]
We introduce attentionLens, a novel formulation that captures the entire transformer as a single, input-dependent linear operator expressed through a high-order attention-interaction connection.<n>Our experiments demonstrate that the attention tensor can serve as a powerful foundation for developing tools aimed at interpretability and model understanding.
arXiv Detail & Related papers (2026-01-25T19:21:25Z) - The Homogeneity Trap: Spectral Collapse in Doubly-Stochastic Deep Networks [1.7523718031184992]
We identify a critical spectral degradation phenomenon inherent to structure-preserving deep architectures.<n>We show that maximum-entropy bias drives the mixing operator towards the uniform barycenter, suppressing the subdominant singular value .<n>We derive a spectral bound linking to the network's effective depth, showing that high-entropy constraints restrict feature transformation to a shallow receptive field.
arXiv Detail & Related papers (2026-01-05T13:09:42Z) - Out-of-Time-Order Correlator Spectroscopy [3.9083778058145864]
We show that higher-order OTOCs fit within the framework of quantum signal processing.<n>We further generalize higher-order OTOCs by transformation of the singular values of the spatially resolved truncated propagator.<n>This extends conventional OTOCs into a mode-resolved tool for probing scrambling and spectral structure of quantum many-body dynamics.
arXiv Detail & Related papers (2025-11-27T17:42:51Z) - Avoided-crossings, degeneracies and Berry phases in the spectrum of quantum noise through analytic Bloch-Messiah decomposition [49.1574468325115]
"analytic Bloch-Messiah decomposition" provides approach for characterizing dynamics of quantum optical systems.<n>We show that avoided crossings arise naturally when a single parameter is varied, leading to hypersensitivity of the singular vectors.<n>We highlight the possibility of programming the spectral response of photonic systems through the deliberate design of avoided crossings.
arXiv Detail & Related papers (2025-04-29T13:14:15Z) - Modelling 1/f Noise in TRNGs via Fractional Brownian Motion [1.3053649021965603]
Security of random number generators is not fully understood due to complex $1/falpha$ phase noise.<n>We introduce fractional Brownian motion as a comprehensive theoretical framework, capturing power-law spectral densities from white to flicker frequency noise.
arXiv Detail & Related papers (2024-10-18T06:38:34Z) - Transformer Normalisation Layers and the Independence of Semantic Subspaces [17.957364289876548]
We consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution.
We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability.
We observe a 1% rate of circuit collapse when the norms are artificially perturbed by $lesssim$10%.
arXiv Detail & Related papers (2024-06-25T16:16:38Z) - Towards Training Without Depth Limits: Batch Normalization Without
Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth.
We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth.
We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z) - Reminiscence of classical chaos in driven transmons [117.851325578242]
We show that even off-resonant drives can cause strong modifications to the structure of the transmon spectrum rendering a large part of it chaotic.
Results lead to a photon number threshold characterizing the appearance of chaos-induced quantum demolition effects.
arXiv Detail & Related papers (2022-07-19T16:04:46Z) - Unraveling Attention via Convex Duality: Analysis and Interpretations of
Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality.
We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality.
We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z) - Robust, Nonparametric, Efficient Decomposition of Spectral Peaks under
Distortion and Interference [0.0]
We propose a decomposition method for the spectral peaks in an observed frequency spectrum, which is efficiently acquired by utilizing the Fast Fourier Transform.
We model the peaks in spectrum as pseudo-symmetric functions, where the only constraint is a nonincreasing behavior around a central frequency when the distance increases.
Our approach is more robust against arbitrary distortion, interference and noise on the spectrum that may be caused by an observation system.
arXiv Detail & Related papers (2022-04-18T17:08:37Z) - Attention is Not All You Need: Pure Attention Loses Rank Doubly
Exponentially with Depth [48.16156149749371]
This work proposes a new way to understand self-attention networks.
We show that their output can be decomposed into a sum of smaller terms.
We prove that self-attention possesses a strong inductive bias towards "token"
arXiv Detail & Related papers (2021-03-05T00:39:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.