A Random Matrix Theory of Masked Self-Supervised Regression
- URL: http://arxiv.org/abs/2601.23208v1
- Date: Fri, 30 Jan 2026 17:32:33 GMT
- Title: A Random Matrix Theory of Masked Self-Supervised Regression
- Authors: Arie Wortsman Zurich, Federica Gerace, Bruno Loureiro, Yue M. Lu,
- Abstract summary: We show how training aggregates predictions across many masking patterns, giving rise to a joint, matrix-valued predictor.<n>This object encodes how coordinates condition on one another and poses new analytical challenges.<n>We identify structured regimes in which masked self-supervised learning provably outperforms PCA.
- Score: 16.836043197411378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the era of transformer models, masked self-supervised learning (SSL) has become a foundational training paradigm. A defining feature of masked SSL is that training aggregates predictions across many masking patterns, giving rise to a joint, matrix-valued predictor rather than a single vector-valued estimator. This object encodes how coordinates condition on one another and poses new analytical challenges. We develop a precise high-dimensional analysis of masked modeling objectives in the proportional regime where the number of samples scales with the ambient dimension. Our results provide explicit expressions for the generalization error and characterize the spectral structure of the learned predictor, revealing how masked modeling extracts structure from data. For spiked covariance models, we show that the joint predictor undergoes a Baik--Ben Arous--Péché (BBP)-type phase transition, identifying when masked SSL begins to recover latent signals. Finally, we identify structured regimes in which masked self-supervised learning provably outperforms PCA, highlighting potential advantages of SSL objectives over classical unsupervised methods
Related papers
- A Semi-supervised Molecular Learning Framework for Activity Cliff Estimation [10.640733919289643]
We propose a novel semi-supervised learning (SSL) method dubbed SemiMol.<n>SemiMol employs predictions on numerous unannotated data as pseudo-signals for subsequent training.<n>We show that SemiMol significantly enhances graph-based ML architectures and outpasses state-of-the-art pretraining and SSL baselines.
arXiv Detail & Related papers (2026-01-08T02:20:25Z) - SIGMA: Scalable Spectral Insights for LLM Collapse [51.863164847253366]
We introduce SIGMA (Spectral Inequalities for Gram Matrix Analysis), a unified framework for model collapse.<n>By utilizing benchmarks that deriving and deterministic bounds on the matrix's spectrum, SIGMA provides a mathematically grounded metric to track the contraction of the representation space.<n>We demonstrate that SIGMA effectively captures the transition towards states, offering both theoretical insights into the mechanics of collapse.
arXiv Detail & Related papers (2026-01-06T19:47:11Z) - Self-Speculative Masked Diffusions [46.04054227238148]
We present self-speculative masked diffusions, a new class of masked diffusion generative models for discrete data.<n>We reduce the computational burden by generating non-factorized predictions over masked positions.<n>We apply our method to GPT2 scale text modelling and protein sequences generation, finding that we can achieve a 2x reduction in the required number of network forward passes.
arXiv Detail & Related papers (2025-10-04T20:16:38Z) - Investigating Mask-aware Prototype Learning for Tabular Anomaly Detection [10.59950164851305]
Tabular anomaly detection has been crucial in a variety of real-world applications, such as medical disease identification, financial fraud detection, intrusion monitoring, etc.<n>Recent deep learning-based methods suffer from representation entanglement and the lack of global correlation modeling, which hinders anomaly detection performance.<n>This paper introduces mask modeling and prototype learning to tackle the problem.
arXiv Detail & Related papers (2025-06-03T11:22:44Z) - Steering Masked Discrete Diffusion Models via Discrete Denoising Posterior Prediction [88.65168366064061]
We introduce Discrete Denoising Posterior Prediction (DDPP), a novel framework that casts the task of steering pre-trained MDMs as a problem of probabilistic inference.
Our framework leads to a family of three novel objectives that are all simulation-free, and thus scalable.
We substantiate our designs via wet-lab validation, where we observe transient expression of reward-optimized protein sequences.
arXiv Detail & Related papers (2024-10-10T17:18:30Z) - ReAugment: Model Zoo-Guided RL for Few-Shot Time Series Augmentation and Forecasting [74.00765474305288]
We present a pilot study on using reinforcement learning (RL) for time series data augmentation.<n>Our method, ReAugment, tackles three critical questions: which parts of the training set should be augmented, how the augmentation should be performed, and what advantages RL brings to the process.
arXiv Detail & Related papers (2024-09-10T07:34:19Z) - HuRef: HUman-REadable Fingerprint for Large Language Models [44.9820558213721]
HuRef is a human-readable fingerprint for large language models.<n>It uniquely identifies the base model without interfering with training or exposing model parameters to the public.
arXiv Detail & Related papers (2023-12-08T05:01:47Z) - Sharp-SSL: Selective high-dimensional axis-aligned random projections
for semi-supervised learning [16.673022545571566]
We propose a new method for high-dimensional semi-supervised learning problems.
It is based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data.
arXiv Detail & Related papers (2023-04-18T17:49:02Z) - Regularized Vector Quantization for Tokenized Image Synthesis [126.96880843754066]
Quantizing images into discrete representations has been a fundamental problem in unified generative modeling.
deterministic quantization suffers from severe codebook collapse and misalignment with inference stage while quantization suffers from low codebook utilization and reconstruction objective.
This paper presents a regularized vector quantization framework that allows to mitigate perturbed above issues effectively by applying regularization from two perspectives.
arXiv Detail & Related papers (2023-03-11T15:20:54Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Probabilistic fine-tuning of pruning masks and PAC-Bayes self-bounded
learning [16.526326919313924]
We study an approach to learning pruning masks by optimizing the expected loss of pruning masks.
We analyze the training dynamics of the inducedadaptive predictor in the setting of linear regression.
We show that a PAC-Bayes generalization error bound is controlled by the magnitude of the change in feature alignment between the 'prior' and 'posterior' data.
arXiv Detail & Related papers (2021-10-22T14:25:22Z) - Improving Self-supervised Pre-training via a Fully-Explored Masked
Language Model [57.77981008219654]
Masked Language Model (MLM) framework has been widely adopted for self-supervised language pre-training.
We propose a fully-explored masking strategy, where a text sequence is divided into a certain number of non-overlapping segments.
arXiv Detail & Related papers (2020-10-12T21:28:14Z) - Prototypical Contrastive Learning of Unsupervised Representations [171.3046900127166]
Prototypical Contrastive Learning (PCL) is an unsupervised representation learning method.
PCL implicitly encodes semantic structures of the data into the learned embedding space.
PCL outperforms state-of-the-art instance-wise contrastive learning methods on multiple benchmarks.
arXiv Detail & Related papers (2020-05-11T09:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.