Related papers: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

URL: http://arxiv.org/abs/2507.19457v1
Date: Fri, 25 Jul 2025 17:42:32 GMT
Title: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
Authors: Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab,
Abstract summary: We introduce GEPA (Genetic-Pareto), a prompt that thoroughly incorporates natural language to learn high-level rules from trial and error.<n>GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems.<n>It can often turn even just a few rollouts into a large quality gain.
Score: 106.98018881499362
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA's design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization.

Related papers

Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs [77.22973302887435]
Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs)<n>We present mmGRPO, a simple multi-module of GRPO that groups LM calls by module across rollouts and handles variable-length and interrupted trajectories.<n>We find that mmGRPO, composed with automatic prompt optimization, improves accuracy by 11% on average across classification, many-hop search, and privacy-preserving delegation tasks.
arXiv Detail & Related papers (2025-08-06T17:28:31Z)
GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers [52.17222304851524]
We introduce GReaTer, a novel prompt optimization technique that directly incorporates gradient information over task-specific reasoning.<n>By utilizing task loss gradients, GReaTer enables self-optimization of prompts for open-source, lightweight language models.<n> GReaTer consistently outperforms previous state-of-the-art prompt optimization methods.
arXiv Detail & Related papers (2024-12-12T20:59:43Z)
GRL-Prompt: Towards Knowledge Graph based Prompt Optimization via Reinforcement Learning [8.307785339429863]
We propose a novel framework for prompt optimization for large language models (LLMs) GRL-Prompt aims to automatically construct optimal prompts via reinforcement learning (RL) in an end-to-end manner. Experiments show that GRL-Prompt outperforms recent state-of-the-art methods.
arXiv Detail & Related papers (2024-11-19T10:52:25Z)
Improving the Language Understanding Capabilities of Large Language Models Using Reinforcement Learning [20.13007387453759]
Proximal Policy Optimization (PPO) is a framework to improve the capabilities of large language models (LLMs)<n>PPO consistently outperforms supervised fine-tuning, yielding an average improvement of 6.3 points on GLUE.<n>This work highlights a promising direction for adapting LLMs to new tasks by reframing them as reinforcement learning problems.
arXiv Detail & Related papers (2024-10-14T19:16:56Z)
GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation [108.2008975785364]
Graph Inspired Veracity Extrapolation (GIVE) is a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input.<n>GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak)
arXiv Detail & Related papers (2024-10-11T03:05:06Z)
Large Language Models as Code Executors: An Exploratory Study [29.545321608864295]
This paper pioneers the exploration of Large Language Models (LLMs) as code executors. We are the first to examine this feasibility across various LLMs, including OpenAI's o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder. We introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22%.
arXiv Detail & Related papers (2024-10-09T08:23:22Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models [9.688626139309013]
Retrieval-Augmented Generation is considered as a means to improve the trustworthiness of text generation from large language models. In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers. We introduce a novel optimization technique called Gradient Guided Prompt Perturbation.
arXiv Detail & Related papers (2024-02-11T12:25:41Z)
EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers [67.64162164254809]
EvoPrompt is a framework for discrete prompt optimization.<n>It borrows the idea of evolutionary algorithms (EAs) as they exhibit good performance and fast convergence.<n>It significantly outperforms human-engineered prompts and existing methods for automatic prompt generation.
arXiv Detail & Related papers (2023-09-15T16:50:09Z)
Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs. Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.