Related papers: The Blessing of Dimensionality in LLM Fine-tuning: A Variance-Curvature Perspective

The Blessing of Dimensionality in LLM Fine-tuning: A Variance-Curvature Perspective

URL: http://arxiv.org/abs/2602.00170v1
Date: Fri, 30 Jan 2026 00:26:35 GMT
Title: The Blessing of Dimensionality in LLM Fine-tuning: A Variance-Curvature Perspective
Authors: Qiyao Liang, Jinyeop Song, Yizhou Liu, Jeff Gore, Ila Fiete, Risto Miikkulainen, Xin Qiu,
Abstract summary: We show that weight-perturbation evolution strategies can fine-tune language models with surprisingly small populations.<n>We also observe that fine-tuning reward often rises, peaks, and then degrades in both ES and GRPO.
Score: 19.4447760660162
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Weight-perturbation evolution strategies (ES) can fine-tune billion-parameter language models with surprisingly small populations (e.g., $N\!\approx\!30$), contradicting classical zeroth-order curse-of-dimensionality intuition. We also observe a second seemingly separate phenomenon: under fixed hyperparameters, the stochastic fine-tuning reward often rises, peaks, and then degrades in both ES and GRPO. We argue that both effects reflect a shared geometric property of fine-tuning landscapes: they are low-dimensional in curvature. A small set of high-curvature dimensions dominates improvement, producing (i) heterogeneous time scales that yield rise-then-decay under fixed stochasticity, as captured by a minimal quadratic stochastic-ascent model, and (ii) degenerate improving updates, where many random perturbations share similar components along these directions. Using ES as a geometric probe on fine-tuning reward landscapes of GSM8K, ARC-C, and WinoGrande across Qwen2.5-Instruct models (0.5B--7B), we show that reward-improving perturbations remain empirically accessible with small populations across scales. Together, these results reconcile ES scalability with non-monotonic training dynamics and suggest that high-dimensional fine-tuning may admit a broader class of viable optimization methods than worst-case theory implies.

Related papers

GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler [54.10960908347221]
We model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS)<n>GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen.
arXiv Detail & Related papers (2026-02-15T09:57:47Z)
Unregularized Linear Convergence in Zero-Sum Game from Preference Feedback [50.89125374999765]
We provide the first convergence guarantee for Optimistic Multiplicative Weights Update ($mathtOMWU$) in NLHF.<n>Our analysis identifies a novel marginal convergence behavior, where the probability of rarely played actions grows exponentially from exponentially small values.
arXiv Detail & Related papers (2025-12-31T12:08:29Z)
When does Gaussian equivalence fail and how to fix it: Non-universal behavior of random features with quadratic scaling [15.148577493784051]
Gaussian equivalence theory (GET) states that the behavior of high-dimensional, complex features can be captured by Gaussian surrogates.<n>But numerical experiments show that this equivalence can fail even for simple embeddings under general scaling regimes.<n>We introduce a Conditional Equivalent (CGE) model, which can be viewed as appending a low-dimensional non-Gaussian component to an otherwise high-dimensional Gaussian model.
arXiv Detail & Related papers (2025-12-03T00:23:12Z)
PointNSP: Autoregressive 3D Point Cloud Generation with Next-Scale Level-of-Detail Prediction [87.33016661440202]
Autoregressive point cloud generation has long lagged behind diffusion-based approaches in quality.<n>We propose PointNSP, a coarse-to-fine generative framework that preserves global shape structure at low resolutions.<n> Experiments on ShapeNet show that PointNSP establishes state-of-the-art (SOTA) generation quality for the first time within the autoregressive paradigm.
arXiv Detail & Related papers (2025-10-07T06:31:02Z)
Latent Iterative Refinement Flow: A Geometric-Constrained Approach for Few-Shot Generation [5.062604189239418]
We introduce Latent Iterative Refinement Flow (LIRF), a novel approach to few-shot generation.<n>LIRF establishes a stable latent space using an autoencoder trained with our novel textbfmanifold-preservation loss.<n>Within this cycle, candidate samples are refined by a geometric textbfcorrection operator, a provably contractive mapping.
arXiv Detail & Related papers (2025-09-24T08:57:21Z)
Rao-Blackwell Gradient Estimators for Equivariant Denoising Diffusion [55.95767828747407]
In domains such as molecular and protein generation, physical systems exhibit inherent symmetries that are critical to model.<n>We present a framework that reduces training variance and provides a provably lower-variance gradient estimator.<n>We also present a practical implementation of this estimator incorporating the loss and sampling procedure through a method we call Orbit Diffusion.
arXiv Detail & Related papers (2025-02-14T03:26:57Z)
Hierarchic Flows to Estimate and Sample High-dimensional Probabilities [8.548100130679614]
We introduce low-dimensional models with robust multiscale approximations across energies and densities. We estimate and sample these wavelet models to generate 2D vorticity fields of turbulence, and images of dark matter.
arXiv Detail & Related papers (2024-05-06T13:44:51Z)
A U-turn on Double Descent: Rethinking Parameter Counting in Statistical Learning [68.76846801719095]
We show that double descent appears exactly when and where it occurs, and that its location is not inherently tied to the threshold p=n. This provides a resolution to tensions between double descent and statistical intuition.
arXiv Detail & Related papers (2023-10-29T12:05:39Z)
Monotonicity and Double Descent in Uncertainty Estimation with Gaussian Processes [52.92110730286403]
It is commonly believed that the marginal likelihood should be reminiscent of cross-validation metrics and that both should deteriorate with larger input dimensions. We prove that by tuning hyper parameters, the performance, as measured by the marginal likelihood, improves monotonically with the input dimension. We also prove that cross-validation metrics exhibit qualitatively different behavior that is characteristic of double descent.
arXiv Detail & Related papers (2022-10-14T08:09:33Z)
Beyond the Edge of Stability via Two-step Gradient Updates [49.03389279816152]
Gradient Descent (GD) is a powerful workhorse of modern machine learning. GD's ability to find local minimisers is only guaranteed for losses with Lipschitz gradients. This work focuses on simple, yet representative, learning problems via analysis of two-step gradient updates.
arXiv Detail & Related papers (2022-06-08T21:32:50Z)
High-dimensional limit theorems for SGD: Effective dynamics and critical scaling [6.950316788263433]
We prove limit theorems for the trajectories of summary statistics of gradient descent (SGD) We show a critical scaling regime for the step-size, below which the effective ballistic dynamics matches gradient flow for the population loss. About the fixed points of this effective dynamics, the corresponding diffusive limits can be quite complex and even degenerate.
arXiv Detail & Related papers (2022-06-08T17:42:18Z)
On the Role of Optimization in Double Descent: A Least Squares Study [30.44215064390409]
We show an excess risk bound for the descent gradient solution of the least squares objective. We find that in case of noiseless regression, double descent is explained solely by optimization-related quantities. We empirically explore if our predictions hold for neural networks.
arXiv Detail & Related papers (2021-07-27T09:13:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.