Related papers: Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning

URL: http://arxiv.org/abs/2509.24372v1
Date: Mon, 29 Sep 2025 07:19:34 GMT
Title: Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning
Authors: Xin Qiu, Yulu Gan, Conor F. Hayes, Qiyao Liang, Elliot Meyerson, Babak Hodjat, Risto Miikkulainen,
Abstract summary: Reinforcement learning is arguably the most prominent fine-tuning method.<n>Evolution strategies (ES) once showed comparable performance to RL on models with a few million parameters.<n>ES can search efficiently over billions of parameters and outperform existing RL fine-tuning methods in multiple respects.
Score: 16.095629872564874
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning pre-trained large language models (LLMs) for down-stream tasks is a critical step in the AI deployment pipeline. Reinforcement learning (RL) is arguably the most prominent fine-tuning method, contributing to the birth of many state-of-the-art LLMs. In contrast, evolution strategies (ES), which once showed comparable performance to RL on models with a few million parameters, was neglected due to the pessimistic perception of its scalability to larger models. In this work, we report the first successful attempt to scale up ES for fine-tuning the full parameters of LLMs, showing the surprising fact that ES can search efficiently over billions of parameters and outperform existing RL fine-tuning methods in multiple respects, including sample efficiency, tolerance to long-horizon rewards, robustness to different base LLMs, less tendency to reward hacking, and more stable performance across runs. It therefore serves as a basis to unlock a new direction in LLM fine-tuning beyond what current RL techniques provide. The source codes are provided at: https://github.com/VsonicV/es-fine-tuning-paper.

Related papers

Tool Zero: Training Tool-Augmented LLMs via Pure RL from Scratch [63.40752011615843]
Training tool-augmented language models has emerged as a promising approach to enhancing their capabilities for complex tasks.<n>We propose a dynamic generalization-guided reward design for rule-based reinforcement learning.<n>We show that our models achieve over 7% performance improvement compared to both SFT and RL-with-SFT models.
arXiv Detail & Related papers (2025-11-02T16:33:45Z)
The Road Less Traveled: Enhancing Exploration in LLMs via Sequential Sampling [39.65138471548881]
Reinforcement learning (RL) has been pivotal in enhancing the reasoning capabilities of large language models (LLMs)<n>We propose SESA, a novel SEquential SAmpling framework that generates diverse solution sketches sequentially before expanding them into full reasoning paths.<n>Our experiments on a synthetic task show that sequential sampling consistently outperforms traditional RL methods in terms of path diversity and recovery from collapse.
arXiv Detail & Related papers (2025-10-17T10:15:11Z)
On Predictability of Reinforcement Learning Dynamics for Large Language Models [20.320268628019047]
This work identifies two fundamental properties of RL-induced parameter updates in large language models.<n>We propose AlphaRL, a plug-in acceleration framework that extrapolates the final parameter update using a short early training window.
arXiv Detail & Related papers (2025-10-01T06:13:50Z)
Reinforcement Learning Fine-Tuning Enhances Activation Intensity and Diversity in the Internal Circuitry of LLMs [13.036236161537147]
Large language models (LLMs) acquire extensive prior knowledge through large-scale pretraining and can be further enhanced via supervised fine-tuning (SFT) or reinforcement learning (RL)-based post-training.<n>A growing body of evidence has shown that RL fine-tuning improves the capability of LLMs beyond what SFT alone achieves.<n>However, the underlying mechanisms why RL fine-tuning is able to enhance the capability of various LLMs with distinct intrinsic characteristics remain underexplored.
arXiv Detail & Related papers (2025-09-25T11:51:05Z)
SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning [81.7764584515496]
Vision-Language-Action (VLA) models have emerged as a powerful paradigm for robotic manipulation.<n>These models face two fundamental challenges: scarcity and high cost of large-scale human-operated robotic trajectories.<n>We introduce SimpleVLA-RL, an efficient reinforcement learning framework tailored for VLA models.
arXiv Detail & Related papers (2025-09-11T17:59:17Z)
Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z)
LENSLLM: Unveiling Fine-Tuning Dynamics for LLM Selection [11.353302879735862]
Open-sourced Large Language Models (LLMs) and diverse downstream tasks require efficient model selection.<n>We propose a novel theoretical framework that provides a proper lens to assess the generalization capabilities of LLMs.<n>In particular, we first derive a PAC-Bayesian Generalization Bound that unveils fine-tuning dynamics of LLMs.<n>We then introduce LENSLLM, a Neural Tangent Kernel (NTK)-based Rectified Scaling Model that enables accurate performance predictions.
arXiv Detail & Related papers (2025-05-01T15:07:32Z)
Think or Not Think: A Study of Explicit Thinking in Rule-Based Visual Reinforcement Fine-Tuning [26.835266813794316]
We first propose CLS-RL for MLLM image classification, using verifiable rewards for fine-tuning.<n>We then rethink and question whether explicit thinking in RFT is always necessary.<n>No-Thinking-RL explores RFT without thinking by introducing a simple equality accuracy reward.
arXiv Detail & Related papers (2025-03-20T14:37:45Z)
On the Emergence of Thinking in LLMs I: Searching for the Right Intuition [34.32871896067864]
We propose a post-training framework called Reinforcement Learning via Self-Play (RLSP)<n> RLSP involves three steps: supervised fine-tuning with human or synthetic demonstrations of the reasoning process, using an exploration reward signal to encourage diverse and efficient reasoning behaviors, and RL training with an outcome verifier to ensure correctness while preventing reward hacking.<n> Empirical studies in the math domain show that RLSP improves reasoning.
arXiv Detail & Related papers (2025-02-10T18:52:04Z)
ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL [80.10358123795946]
We develop a framework for building multi-turn RL algorithms for fine-tuning large language models. Our framework adopts a hierarchical RL approach and runs two RL algorithms in parallel. Empirically, we find that ArCHer significantly improves efficiency and performance on agent tasks.
arXiv Detail & Related papers (2024-02-29T18:45:56Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs [67.38165028487242]
We introduce Dynamic Sparse No Training (DSnoT), a training-free fine-tuning approach to fine-tune large language models (LLMs) Inspired by the Dynamic Sparse Training, DSnoT minimizes the reconstruction error between the dense and sparse LLMs. Our paper offers fresh insights into how to fine-tune sparse LLMs in an efficient training-free manner and open new venues to scale the great potential of sparsity to LLMs.
arXiv Detail & Related papers (2023-10-13T07:38:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.