TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization
- URL: http://arxiv.org/abs/2601.16480v1
- Date: Fri, 23 Jan 2026 06:21:33 GMT
- Title: TL-GRPO: Turn-Level RL for Reasoning-Guided Iterative Optimization
- Authors: Peiji Li, Linyang Li, Handa Sun, Wenjin Mai, Yongkang Chen, Xiaozhe Li, Yue Shen, Yichuan Ma, Yiliu Sun, Jiaxi Cao, Zhishu He, Bo Wang, Xiaoqing Zheng, Zhaori Bi, Xipeng Qiu, Qipeng Guo, Kai Chen, Dahua Lin,
- Abstract summary: Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration.<n>We propose Turn-Level GRPO, a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization.
- Score: 97.18886232580131
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models have demonstrated strong reasoning capabilities in complex tasks through tool integration, which is typically framed as a Markov Decision Process and optimized with trajectory-level RL algorithms such as GRPO. However, a common class of reasoning tasks, iterative optimization, presents distinct challenges: the agent interacts with the same underlying environment state across turns, and the value of a trajectory is determined by the best turn-level reward rather than cumulative returns. Existing GRPO-based methods cannot perform fine-grained, turn-level optimization in such settings, while black-box optimization methods discard prior knowledge and reasoning capabilities. To address this gap, we propose Turn-Level GRPO (TL-GRPO), a lightweight RL algorithm that performs turn-level group sampling for fine-grained optimization. We evaluate TL-GRPO on analog circuit sizing (ACS), a challenging scientific optimization task requiring multiple simulations and domain expertise. Results show that TL-GRPO outperforms standard GRPO and Bayesian optimization methods across various specifications. Furthermore, our 30B model trained with TL-GRPO achieves state-of-the-art performance on ACS tasks under same simulation budget, demonstrating both strong generalization and practical utility.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - iGRPO: Self-Feedback-Driven LLM Reasoning [88.83313431248473]
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions.<n>We introduce Iterative Group Relative Policy Optimization (iGRPO), a two-stage extension of GRPO that adds dynamic self-conditioning through model-generated drafts.<n>Under matched rollout budgets, iGRPO consistently outperforms GRPO across base models.
arXiv Detail & Related papers (2026-02-09T18:45:11Z) - GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization [133.27496265096445]
We show how to apply Group Relative Policy Optimization under multi-reward setting without examining its suitability.<n>We then introduce Group reward-Decoupled Normalization Policy Optimization (GDPO), a new policy optimization method to resolve these issues.<n>GDPO consistently outperforms GRPO, demonstrating its effectiveness and generalizability for multi-reward reinforcement learning optimization.
arXiv Detail & Related papers (2026-01-08T18:59:24Z) - Expand and Prune: Maximizing Trajectory Diversity for Effective GRPO in Generative Models [37.48289959306949]
Group Relative Policy Optimization is a powerful technique for aligning generative models.<n>But its effectiveness is bottlenecked by the conflict between large group sizes and prohibitive computational costs.<n>We propose Pro-GRPO, a novel dynamic framework that integrates latent feature-based trajectory pruning into the sampling process.
arXiv Detail & Related papers (2025-12-17T11:44:34Z) - GRPO-RM: Fine-Tuning Representation Models via GRPO-Driven Reinforcement Learning [52.16150076582931]
We propose Group Relative Policy Optimization for Representation Model (GRPO-RM)<n>Our method establishes a predefined output set to functionally replace token sequence sampling in large language models (LLMs)<n>A specialized reward function is designed to accommodate the properties of representation models.
arXiv Detail & Related papers (2025-11-19T09:19:39Z) - Sample By Step, Optimize By Chunk: Chunk-Level GRPO For Text-to-Image Generation [29.015994347609936]
Group Relative Policy Optimization (GRPO) has shown strong potential for flow-matching-based text-to-image (T2I) generation.<n>We argue that shifting the optimization paradigm from the step level to the chunk level can effectively alleviate these issues.<n>Chunk-GRPO is the first chunk-level GRPO-based approach for T2I generation.
arXiv Detail & Related papers (2025-10-24T15:50:36Z) - VAGPO: Vision-augmented Asymmetric Group Preference Optimization for Graph Routing Problems [27.70647397895125]
Graph routing problems play a vital role in web-related networks, where finding optimal paths across graphs is essential.<n>Recent data-driven optimization methods have made significant progress, yet they often face limitations in training efficiency and generalization to large-scale instances.<n>We propose a novel Vision-augmented Asymmetric Group Preference Optimization (VAGPO) approach, which captures both spatial structure and temporal dependencies.
arXiv Detail & Related papers (2025-08-03T14:19:12Z) - TGRPO :Fine-tuning Vision-Language-Action Model via Trajectory-wise Group Relative Policy Optimization [12.061547251822326]
Trajectory-based Group Relative Policy Optimization (TGRPO) is an online RL-based training framework for Visual-Language-Action (VLA) models.<n>We show that TGRPO achieves an average success rate of 80.7%, which is 4.2% higher than that of Supervised Fine-Tuning (SFT) and outperforms other representative RL-based post-training methods.
arXiv Detail & Related papers (2025-06-10T04:27:49Z) - Unleashing the Potential of Large Language Models as Prompt Optimizers: Analogical Analysis with Gradient-based Model Optimizers [108.72225067368592]
We propose a novel perspective to investigate the design of large language models (LLMs)-based prompts.<n>We identify two pivotal factors in model parameter learning: update direction and update method.<n>We develop a capable Gradient-inspired Prompt-based GPO.
arXiv Detail & Related papers (2024-02-27T15:05:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.