SIGMA: Scalable Spectral Insights for LLM Collapse
- URL: http://arxiv.org/abs/2601.03385v1
- Date: Tue, 06 Jan 2026 19:47:11 GMT
- Title: SIGMA: Scalable Spectral Insights for LLM Collapse
- Authors: Yi Gu, Lingyou Pang, Xiangkun Ye, Tianyu Wang, Jianyu Lin, Carey E. Priebe, Alexander Aue,
- Abstract summary: We introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework for model collapse.<n>By utilizing benchmarks that deriving and deterministic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space.<n>We demonstrate that SIGMA effectively captures the transition towards states, offering both theoretical insights into the mechanics of collapse.
- Score: 51.863164847253366
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The rapid adoption of synthetic data for training Large Language Models (LLMs) has introduced the technical challenge of "model collapse"-a degenerative process where recursive training on model-generated content leads to a contraction of distributional variance and representational quality. While the phenomenology of collapse is increasingly evident, rigorous methods to quantify and predict its onset in high-dimensional spaces remain elusive. In this paper, we introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework that benchmarks model collapse through the spectral lens of the embedding Gram matrix. By deriving and utilizing deterministic and stochastic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space. Crucially, our stochastic formulation enables scalable estimation of these bounds, making the framework applicable to large-scale foundation models where full eigendecomposition is intractable. We demonstrate that SIGMA effectively captures the transition towards degenerate states, offering both theoretical insights into the mechanics of collapse and a practical, scalable tool for monitoring the health of recursive training pipelines.
Related papers
- Generate Aligned Anomaly: Region-Guided Few-Shot Anomaly Image-Mask Pair Synthesis for Industrial Inspection [53.137651284042434]
Anomaly inspection plays a vital role in industrial manufacturing, but the scarcity of anomaly samples limits the effectiveness of existing methods.<n>We propose Generate grained Anomaly (GAA), a region-guided, few-shot anomaly image-mask pair generation framework.<n>GAA generates realistic, diverse, and semantically aligned anomalies using only a small number of samples.
arXiv Detail & Related papers (2025-07-13T12:56:59Z) - Symplectic Generative Networks (SGNs): A Hamiltonian Framework for Invertible Deep Generative Modeling [0.0]
We introduce the emphSymplectic Generative Network (SGN), a deep generative model that leverages Hamiltonian mechanics to construct an invertible, volume-preserving mapping between a latent space and the data space.<n>By endowing the latent space with a symplectic structure and modeling data generation as the time evolution of a Hamiltonian system, SGN achieves exact likelihood evaluation without incurring the computational overhead of Jacobian calculations.
arXiv Detail & Related papers (2025-05-28T16:13:36Z) - Hallucination Detection in LLMs with Topological Divergence on Attention Graphs [60.83579255387347]
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models.<n>We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting.
arXiv Detail & Related papers (2025-04-14T10:06:27Z) - Beyond Progress Measures: Theoretical Insights into the Mechanism of Grokking [50.465604300990904]
Grokking refers to the abrupt improvement in test accuracy after extended overfitting.<n>In this work, we investigate the grokking mechanism underlying the Transformer in the task of prime number operations.
arXiv Detail & Related papers (2025-04-04T04:42:38Z) - Linear Noise Approximation Assisted Bayesian Inference on Mechanistic Model of Partially Observed Stochastic Reaction Network [2.325005809983534]
This paper develops an efficient Bayesian inference approach for partially observed enzymatic reaction network (SRN)
An interpretable linear noise approximation (LNA) metamodel is proposed to approximate the likelihood of observations.
An efficient posterior sampling approach is developed by utilizing the gradients of the derived likelihood to speed up the convergence of Markov Chain Monte Carlo.
arXiv Detail & Related papers (2024-05-05T01:54:21Z) - Spectral Clustering for Directed Graphs via Likelihood Estimation on Stochastic Block Models [22.421702511126373]
We leverage statistical inference on block models to guide the development of a spectral clustering algorithm for directed graphs.<n>We establish a theoretical upper bound on the misclustering error of its spectral relaxation, and based on this relaxation, introduce a novel, self-adaptive spectral clustering method for directed graphs.
arXiv Detail & Related papers (2024-03-28T15:47:13Z) - Matrix Completion-Informed Deep Unfolded Equilibrium Models for
Self-Supervised k-Space Interpolation in MRI [8.33626757808923]
Regularization model-driven deep learning (DL) has gained significant attention due to its ability to leverage the potent representational capabilities of DL.
We propose a self-supervised DL approach for accelerated MRI that is theoretically guaranteed and does not rely on fully sampled labels.
arXiv Detail & Related papers (2023-09-24T07:25:06Z) - Convex Latent-Optimized Adversarial Regularizers for Imaging Inverse
Problems [8.33626757808923]
We introduce Convex Latent-d Adrial Regularizers (CLEAR), a novel and interpretable data-driven paradigm.
CLEAR represents a fusion of deep learning (DL) and variational regularization.
Our method consistently outperforms conventional data-driven techniques and traditional regularization approaches.
arXiv Detail & Related papers (2023-09-17T12:06:04Z) - Regularized Vector Quantization for Tokenized Image Synthesis [126.96880843754066]
Quantizing images into discrete representations has been a fundamental problem in unified generative modeling.
deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while quantization suffers from low codebook utilization and reconstruction objective.
This paper presents a regularized vector quantization framework that allows to mitigate perturbed above issues effectively by applying regularization from two perspectives.
arXiv Detail & Related papers (2023-03-11T15:20:54Z) - Learning Graphical Factor Models with Riemannian Optimization [70.13748170371889]
This paper proposes a flexible algorithmic framework for graph learning under low-rank structural constraints.
The problem is expressed as penalized maximum likelihood estimation of an elliptical distribution.
We leverage geometries of positive definite matrices and positive semi-definite matrices of fixed rank that are well suited to elliptical models.
arXiv Detail & Related papers (2022-10-21T13:19:45Z) - Statistical control for spatio-temporal MEG/EEG source imaging with
desparsified multi-task Lasso [102.84915019938413]
Non-invasive techniques like magnetoencephalography (MEG) or electroencephalography (EEG) offer promise of non-invasive techniques.
The problem of source localization, or source imaging, poses however a high-dimensional statistical inference challenge.
We propose an ensemble of desparsified multi-task Lasso (ecd-MTLasso) to deal with this problem.
arXiv Detail & Related papers (2020-09-29T21:17:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.