Related papers: PRISM: Deriving the Transformer as a Signal-Denoising Operator via Maximum Coding Rate Reduction

PRISM: Deriving the Transformer as a Signal-Denoising Operator via Maximum Coding Rate Reduction

URL: http://arxiv.org/abs/2601.15540v1
Date: Wed, 21 Jan 2026 23:52:36 GMT
Title: PRISM: Deriving the Transformer as a Signal-Denoising Operator via Maximum Coding Rate Reduction
Authors: Dongchen Huang,
Abstract summary: We propose Prism, a white-box attention-based architecture for deep learning.<n>We show that Prism spontaneously specializes its attention heads into spectrally distinct regimes.<n>Our results suggest that interpretability and performance are not a trade-off, but can be unified through principled construction.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning models, particularly Transformers, are often criticized as "black boxes" and lack interpretability. We propose Prism, a white-box attention-based architecture derived from the principles of Maximizing Coding Rate Reduction ($\text{MCR}^2$). By modeling the attention mechanism as a gradient ascent process on a distinct signal-noise manifold, we introduce two physical constraints: an overcomplete dictionary to expand the representational phase space, and an irrational frequency separation ($π$-RoPE) to enforce incoherence between signal and noise subspaces. We demonstrate that these geometric inductive biases can be viewed as a physical constraint and they are sufficient to induce unsupervised functional disentanglement alone. Using TinyStories as a controlled testbed for verifying spectral dynamics, we observe that Prism spontaneously specializes its attention heads into spectrally distinct regimes: low-frequency heads capturing long-range causal dependencies (signal) and high-frequency heads handling local syntactic constraints (noise). Our results suggest that interpretability and performance are not a trade-off, but can be unified through principled geometric construction.

Related papers

The Malignant Tail: Spectral Segregation of Label Noise in Over-Parameterized Networks [0.0]
We experimentally isolate the Malignant Tail, a failure mode where networks functionally segregate signal and noise.<n>We show that untrained networks actively segregate noise, allowing post-hoc Explicit Spectral Truncation to surgically prune the noise-dominated subspace.<n>Our findings suggest that under label noise, excess spectral capacity is not harmless redundancy but a latent structural liability.
arXiv Detail & Related papers (2026-03-02T16:39:42Z)
FUTON: Fourier Tensor Network for Implicit Neural Representations [56.48739018255443]
Implicit neural representations (INRs) have emerged as powerful tools for encoding signals, yet dominant-based designs often suffer from slow convergence, overfitting to noise, and poor extrapolation.<n>We introduce FUTON, which models signals as generalized Fourier series whose coefficients are parameterized by a low-rank tensor decomposition.
arXiv Detail & Related papers (2026-02-13T19:31:44Z)
Parallel Complex Diffusion for Scalable Time Series Generation [50.01609741902786]
PaCoDi is a spectral-native architecture that decouples generative modeling in the frequency domain.<n>We show that PaCoDi outperforms existing baselines in both generation quality and inference speed.
arXiv Detail & Related papers (2026-02-10T14:31:53Z)
Seeing Through the Chain: Mitigate Hallucination in Multimodal Reasoning Models via CoT Compression and Contrastive Preference Optimization [78.94590726578014]
multimodal reasoning models (MLRMs) remain prone to hallucinations, and effective solutions are still underexplored.<n>We propose C3PO, a training-based mitigation framework comprising textbfCompression and textbfPreference textbfOptimization.
arXiv Detail & Related papers (2026-02-03T11:00:55Z)
TensorLens: End-to-End Transformer Analysis via High-Order Attention Tensors [53.891337639229285]
We introduce attentionLens, a novel formulation that captures the entire transformer as a single, input-dependent linear operator expressed through a high-order attention-interaction connection.<n>Our experiments demonstrate that the attention tensor can serve as a powerful foundation for developing tools aimed at interpretability and model understanding.
arXiv Detail & Related papers (2026-01-25T19:21:25Z)
The Homogeneity Trap: Spectral Collapse in Doubly-Stochastic Deep Networks [1.7523718031184992]
We identify a critical spectral degradation phenomenon inherent to structure-preserving deep architectures.<n>We show that maximum-entropy bias drives the mixing operator towards the uniform barycenter, suppressing the subdominant singular value .<n>We derive a spectral bound linking to the network's effective depth, showing that high-entropy constraints restrict feature transformation to a shallow receptive field.
arXiv Detail & Related papers (2026-01-05T13:09:42Z)
Out-of-Time-Order Correlator Spectroscopy [3.9083778058145864]
We show that higher-order OTOCs fit within the framework of quantum signal processing.<n>We further generalize higher-order OTOCs by transformation of the singular values of the spatially resolved truncated propagator.<n>This extends conventional OTOCs into a mode-resolved tool for probing scrambling and spectral structure of quantum many-body dynamics.
arXiv Detail & Related papers (2025-11-27T17:42:51Z)
Avoided-crossings, degeneracies and Berry phases in the spectrum of quantum noise through analytic Bloch-Messiah decomposition [49.1574468325115]
"analytic Bloch-Messiah decomposition" provides approach for characterizing dynamics of quantum optical systems.<n>We show that avoided crossings arise naturally when a single parameter is varied, leading to hypersensitivity of the singular vectors.<n>We highlight the possibility of programming the spectral response of photonic systems through the deliberate design of avoided crossings.
arXiv Detail & Related papers (2025-04-29T13:14:15Z)
Modelling 1/f Noise in TRNGs via Fractional Brownian Motion [1.3053649021965603]
Security of random number generators is not fully understood due to complex $1/falpha$ phase noise.<n>We introduce fractional Brownian motion as a comprehensive theoretical framework, capturing power-law spectral densities from white to flicker frequency noise.
arXiv Detail & Related papers (2024-10-18T06:38:34Z)
Transformer Normalisation Layers and the Independence of Semantic Subspaces [17.957364289876548]
We consider a semantic subspace to be any independent subspace of the latent representation that can fully determine an attention distribution. We show that Pre-Norm, the placement of normalisation layer used by state-of-the-art transformers, violates this ability. We observe a 1% rate of circuit collapse when the norms are artificially perturbed by $lesssim$10%.
arXiv Detail & Related papers (2024-06-25T16:16:38Z)
Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion [83.90492831583997]
We show that a batch-normalized network can keep the optimal signal propagation properties, but avoid exploding gradients in depth. We use a Multi-Layer Perceptron (MLP) with linear activations and batch-normalization that provably has bounded depth. We also design an activation shaping scheme that empirically achieves the same properties for certain non-linear activations.
arXiv Detail & Related papers (2023-10-03T12:35:02Z)
Reminiscence of classical chaos in driven transmons [117.851325578242]
We show that even off-resonant drives can cause strong modifications to the structure of the transmon spectrum rendering a large part of it chaotic. Results lead to a photon number threshold characterizing the appearance of chaos-induced quantum demolition effects.
arXiv Detail & Related papers (2022-07-19T16:04:46Z)
Unraveling Attention via Convex Duality: Analysis and Interpretations of Vision Transformers [52.468311268601056]
This paper analyzes attention through the lens of convex duality. We derive equivalent finite-dimensional convex problems that are interpretable and solvable to global optimality. We show how self-attention networks implicitly cluster the tokens, based on their latent similarity.
arXiv Detail & Related papers (2022-05-17T04:01:15Z)
Robust, Nonparametric, Efficient Decomposition of Spectral Peaks under Distortion and Interference [0.0]
We propose a decomposition method for the spectral peaks in an observed frequency spectrum, which is efficiently acquired by utilizing the Fast Fourier Transform. We model the peaks in spectrum as pseudo-symmetric functions, where the only constraint is a nonincreasing behavior around a central frequency when the distance increases. Our approach is more robust against arbitrary distortion, interference and noise on the spectrum that may be caused by an observation system.
arXiv Detail & Related papers (2022-04-18T17:08:37Z)
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth [48.16156149749371]
This work proposes a new way to understand self-attention networks. We show that their output can be decomposed into a sum of smaller terms. We prove that self-attention possesses a strong inductive bias towards "token"
arXiv Detail & Related papers (2021-03-05T00:39:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.