GRPOformer: Advancing Hyperparameter Optimization via Group Relative Policy Optimization
- URL: http://arxiv.org/abs/2509.17105v1
- Date: Sun, 21 Sep 2025 14:54:51 GMT
- Title: GRPOformer: Advancing Hyperparameter Optimization via Group Relative Policy Optimization
- Authors: Haoxin Guo, Jiawen Pan, Weixin Zhai,
- Abstract summary: We propose a novel framework that integrates reinforcement learning (RL) with Transformers.<n>In GRPOformer, Transformers are employed to generate new hyperparameter configurations from historical optimization trajectories.<n>We also introduce Policy Churn Regularization (PCR) to enhance the stability of GRPO training.
- Score: 1.6759048077528458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Hyperparameter optimization (HPO) plays a critical role in improving model performance. Transformer-based HPO methods have shown great potential; however, existing approaches rely heavily on large-scale historical optimization trajectories and lack effective reinforcement learning (RL) techniques, thereby limiting their efficiency and performance improvements. Inspired by the success of Group Relative Policy Optimization (GRPO) in large language models (LLMs), we propose GRPOformer -- a novel hyperparameter optimization framework that integrates reinforcement learning (RL) with Transformers. In GRPOformer, Transformers are employed to generate new hyperparameter configurations from historical optimization trajectories, while GRPO enables rapid trajectory construction and optimization strategy learning from scratch. Moreover, we introduce Policy Churn Regularization (PCR) to enhance the stability of GRPO training. Experimental results on OpenML demonstrate that GRPOformer consistently outperforms baseline methods across diverse tasks, offering new insights into the application of RL for HPO.
Related papers
- iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization [97.18886232580131]
Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration.<n>We propose Turn-Level GRPO, a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization.
arXiv Detail & Related papers (2026-01-23T06:21:33Z) - Adaptive-Boundary-Clipping GRPO: Ensuring Bounded Ratios for Stable and Generalizable Training [7.404779700134294]
Adaptive-Boundary-Clipping GRPO (ABC-GRPO) is an asymmetric and adaptive refinement of the original GRPO framework.<n>ABC-GRPO achieves superior performance over standard GRPO on mathematical reasoning tasks.
arXiv Detail & Related papers (2026-01-07T13:04:52Z) - GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning [52.16150076582931]
We propose Group Relative Policy Optimization for Representation Model (GRPO-RM)<n>Our method establishes a predefined output set to functionally replace token sequence sampling in large language models (LLMs)<n>A specialized reward function is designed to accommodate the properties of representation models.
arXiv Detail & Related papers (2025-11-19T09:19:39Z) - Group Sequence Policy Optimization [55.40088895148603]
Group Sequence Policy Optimization (GSPO) is a stable, efficient, and performant reinforcement learning algorithm.<n>GSPO defines the importance ratio based on sequence likelihood and performs sequence-level clipping, rewarding, and optimization.
arXiv Detail & Related papers (2025-07-24T03:50:32Z) - VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization [59.39976343879587]
VerIPO aims to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains.<n>The training loop benefits from GRPO's expansive search and DPO's targeted optimization.<n>Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs.
arXiv Detail & Related papers (2025-05-25T06:41:28Z) - GAAPO: Genetic Algorithmic Applied to Prompt Optimization [0.0]
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, with their performance heavily dependent on the quality of input prompts.<n>While prompt engineering has proven effective, it typically relies on manual adjustments, making it time-consuming and potentially suboptimal.<n>This paper introducesGenetic Algorithm Applied to Prompt Optimization, a novel hybrid optimization framework that leverages genetic principles to evolve prompts through successive generations.
arXiv Detail & Related papers (2025-04-09T11:19:42Z) - Hybrid Group Relative Policy Optimization: A Multi-Sample Approach to Enhancing Policy Optimization [0.0]
Hybrid Group Relative Policy Optimization (Hybrid GRPO) is a reinforcement learning framework.<n>It incorporates empirical multi-sample action evaluation while preserving the stability of value function-based learning.<n>By integrating structured empirical sampling with reinforcement learning stability mechanisms, Hybrid GRPO has potential applications in autonomous robotics, financial modeling, and AI-driven control systems.
arXiv Detail & Related papers (2025-01-30T21:04:01Z) - Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers [108.72225067368592]
We propose a novel perspective to investigate the design of large language models (LLMs)-based prompts.<n>We identify two pivotal factors in model parameter learning: update direction and update method.<n>We develop a capable Gradient-inspired Prompt-based GPO.
arXiv Detail & Related papers (2024-02-27T15:05:32Z) - Towards Learning Universal Hyperparameter Optimizers with Transformers [57.35920571605559]
We introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction.
Our experiments demonstrate that the OptFormer can imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates.
arXiv Detail & Related papers (2022-05-26T12:51:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.