Stable Anisotropic Regularization
- URL: http://arxiv.org/abs/2305.19358v3
- Date: Thu, 4 Apr 2024 03:04:12 GMT
- Title: Stable Anisotropic Regularization
- Authors: William Rudman, Carsten Eickhoff,
- Abstract summary: We propose I-STAR: IsoScore*-based STable Anisotropic Regularization, a novel regularization method that can be used to increase or decrease levels of isotropy in embedding space during training.
I-STAR uses IsoScore*, the first accurate measure of isotropy that is both differentiable and stable on mini-batch computations.
- Score: 18.52015282224059
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Given the success of Large Language Models (LLMs), there has been considerable interest in studying the properties of model activations. The literature overwhelmingly agrees that LLM representations are dominated by a few "outlier dimensions" with exceedingly high variance and magnitude. Several studies in Natural Language Processing (NLP) have sought to mitigate the impact of such outlier dimensions and force LLMs to be isotropic (i.e., have uniform variance across all dimensions in embedding space). Isotropy is thought to be a desirable property for LLMs that improves model performance and more closely aligns textual representations with human intuition. However, many of the claims regarding isotropy in NLP have been based on the average cosine similarity of embeddings, which has recently been shown to be a flawed measure of isotropy. In this paper, we propose I-STAR: IsoScore*-based STable Anisotropic Regularization, a novel regularization method that can be used to increase or decrease levels of isotropy in embedding space during training. I-STAR uses IsoScore*, the first accurate measure of isotropy that is both differentiable and stable on mini-batch computations. In contrast to several previous works, we find that decreasing isotropy in contextualized embeddings improves performance on the majority of tasks and models considered in this paper.
Related papers
- Entropy-Based Dimension-Free Convergence and Loss-Adaptive Schedules for Diffusion Models [3.2091923314854416]
Diffusion generative models synthesize samples by discretizing reverse-time dynamics driven by a learned score (or denoiser)<n>We develop an information-theoretic approach to dimension-free convergence that avoids geometric assumptions.<n>We also propose a Loss-Adaptive Schedule (LAS) for efficient discretization of reverse SDE which is lightweight and relies only on the training loss.
arXiv Detail & Related papers (2026-01-29T16:28:21Z) - Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback [50.89125374999765]
We provide the first convergence guarantee for Optimistic Multiplicative Weights Update ($mathtOMWU$) in NLHF.<n>Our analysis identifies a novel marginal convergence behavior, where the probability of rarely played actions grows exponentially from exponentially small values.
arXiv Detail & Related papers (2025-12-31T12:08:29Z) - The Sign Estimator: LLM Alignment in the Face of Choice Heterogeneity [4.957619545367733]
Traditional alignment methods are vulnerable to heterogeneity in human preferences.<n>We propose a new method, dubbed the sign estimator, that provides a simple, provably consistent, and efficient estimator.
arXiv Detail & Related papers (2025-10-28T00:42:38Z) - Estimating Semantic Alphabet Size for LLM Uncertainty Quantification [12.029394705620724]
We propose a modified semantic alphabet size estimator for semantic entropy estimation.<n>Using it to adjust discrete semantic entropy for sample coverage results in more accurate semantic entropy estimation.<n>Our proposed alphabet size estimator flags incorrect LLM responses as well or better than recent top-performing approaches.
arXiv Detail & Related papers (2025-09-17T23:16:39Z) - Adapt in the Wild: Test-Time Entropy Minimization with Sharpness and Feature Regularization [85.50560211492898]
Test-time adaptation (TTA) may fail to improve or even harm the model performance when test data have mixed distribution shifts.<n>This is often a key obstacle preventing existing TTA methods from being deployed in the real world.<n>We propose a sharpness-aware and reliable entropy minimization method, called SAR, for stabilizing TTA from two aspects.
arXiv Detail & Related papers (2025-09-05T10:03:00Z) - On Entropy Control in LLM-RL Algorithms [10.71946318944523]
We study the issues of entropy bonus in LLM-RL setting.<n>We propose AEnt, an entropy control method that utilizes a new clamped entropy bonus with an automatically adjusted coefficient.<n>AEnt is tested in math-reasoning tasks under different base models and datasets, and it is observed that AEnt outperforms the baselines consistently.
arXiv Detail & Related papers (2025-09-03T17:23:19Z) - Beyond Semantic Entropy: Boosting LLM Uncertainty Quantification with Pairwise Semantic Similarity [15.16188621701658]
Hallucination in large language models can be detected by assessing the uncertainty of model outputs, typically measured using entropy.<n>We propose a simple black-box uncertainty quantification method inspired by nearest neighbor estimates of entropy.<n>Our approach can also be easily extended to white-box settings by incorporating token probabilities.
arXiv Detail & Related papers (2025-05-30T21:21:05Z) - LESA: Learnable LLM Layer Scaling-Up [57.0510934286449]
Training Large Language Models (LLMs) from scratch requires immense computational resources, making it prohibitively expensive.
Model scaling-up offers a promising solution by leveraging the parameters of smaller models to create larger ones.
We propose textbfLESA, a novel learnable method for depth scaling-up.
arXiv Detail & Related papers (2025-02-19T14:58:48Z) - Shrink the longest: improving latent space isotropy with symplicial geometry [0.0]
We propose a novel regularization technique based on simplicial geometry to improve the isotropy of latent representations.
We demonstrate that the method leads to an increase in downstream performance while significantly lowering the anisotropy during fine-tuning.
arXiv Detail & Related papers (2025-01-09T18:44:10Z) - Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [66.27334633749734]
As language models grow in size, memory demands for backpropagation increase.
Zeroth-order (ZOZO) optimization methods offer a memory-efficient alternative.
We show that SubZero enhances fine-tuning and achieves faster results compared to standard ZOZO approaches.
arXiv Detail & Related papers (2024-10-11T17:01:43Z) - Total Uncertainty Quantification in Inverse PDE Solutions Obtained with Reduced-Order Deep Learning Surrogate Models [50.90868087591973]
We propose an approximate Bayesian method for quantifying the total uncertainty in inverse PDE solutions obtained with machine learning surrogate models.
We test the proposed framework by comparing it with the iterative ensemble smoother and deep ensembling methods for a non-linear diffusion equation.
arXiv Detail & Related papers (2024-08-20T19:06:02Z) - REAL Sampling: Boosting Factuality and Diversity of Open-Ended Generation via Asymptotic Entropy [93.8400683020273]
Decoding methods for large language models (LLMs) usually struggle with the tradeoff between ensuring factuality and maintaining diversity.
We propose REAL sampling, a decoding method that improved factuality and diversity over nucleus sampling.
arXiv Detail & Related papers (2024-06-11T21:44:49Z) - Quantifying Emergence in Large Language Models [31.608080868988825]
We propose a quantifiable solution for estimating emergence of LLMs.
Inspired by emergentism in dynamics, we quantify the strength of emergence by comparing the entropy reduction of the macroscopic (semantic) level with that of the microscopic (token) level.
Our method demonstrates consistent behaviors across a suite of LMs under both in-context learning (ICL) and natural sentences.
arXiv Detail & Related papers (2024-05-21T09:12:20Z) - Policy Gradient with Active Importance Sampling [55.112959067035916]
Policy gradient (PG) methods significantly benefit from IS, enabling the effective reuse of previously collected samples.
However, IS is employed in RL as a passive tool for re-weighting historical samples.
We look for the best behavioral policy from which to collect samples to reduce the policy gradient variance.
arXiv Detail & Related papers (2024-05-09T09:08:09Z) - The Curse of Memory in Stochastic Approximation: Extended Version [1.534667887016089]
Theory and application of approximation (SA) has grown within the control systems community since the earliest days of adaptive control.
Recent results establish remarkable performance of SA with (sufficiently small) constant step-size $alpha>0$.
arXiv Detail & Related papers (2023-09-06T12:22:32Z) - Combating Mode Collapse in GANs via Manifold Entropy Estimation [70.06639443446545]
Generative Adversarial Networks (GANs) have shown compelling results in various tasks and applications.
We propose a novel training pipeline to address the mode collapse issue of GANs.
arXiv Detail & Related papers (2022-08-25T12:33:31Z) - Building Robust Machine Learning Models for Small Chemical Science Data:
The Case of Shear Viscosity [3.4761212729163313]
We train several Machine Learning models to predict the shear viscosity of a Lennard-Jones (LJ) fluid.
Specifically, the issues related to model selection, performance estimation and uncertainty quantification were investigated.
arXiv Detail & Related papers (2022-08-23T07:33:14Z) - Concentration of Non-Isotropic Random Tensors with Applications to
Learning and Empirical Risk Minimization [0.0]
Dimension is an inherent bottleneck to some modern learning tasks, where optimization methods suffer from the size of the data.
We develop tools that aim at reducing these dimensional costs by a dependency on an effective dimension rather than the ambient one.
We show the importance of taking advantage of non-isotropic properties in learning problems with the following applications.
arXiv Detail & Related papers (2021-02-04T17:13:03Z) - IsoBN: Fine-Tuning BERT with Isotropic Batch Normalization [41.267328947683936]
Fine-tuning pre-trained language models (PTLMs) has been a common practice for advancing performance in natural language understanding (NLU) tasks.
Recent advance in representation learning shows that isotropic embeddings can significantly improve performance on downstream tasks with faster convergence and better generalization.
We analyze the isotropy of the pre-trained embeddings in PTLMs with straightforward visualization, and point out two major issues: high variance in their standard deviation, and high correlation between different dimensions.
arXiv Detail & Related papers (2020-05-02T11:49:09Z) - Fast approximations in the homogeneous Ising model for use in scene
analysis [61.0951285821105]
We provide accurate approximations that make it possible to numerically calculate quantities needed in inference.
We show that our approximation formulae are scalable and unfazed by the size of the Markov Random Field.
The practical import of our approximation formulae is illustrated in performing Bayesian inference in a functional Magnetic Resonance Imaging activation detection experiment, and also in likelihood ratio testing for anisotropy in the spatial patterns of yearly increases in pistachio tree yields.
arXiv Detail & Related papers (2017-12-06T14:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.