Related papers: Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning

Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning

URL: http://arxiv.org/abs/2509.20616v1
Date: Wed, 24 Sep 2025 23:47:36 GMT
Title: Training Task Reasoning LLM Agents for Multi-turn Task Planning via Single-turn Reinforcement Learning
Authors: Hanjiang Hu, Changliu Liu, Na Li, Yebin Wang,
Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use.<n>This paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems.
Score: 15.393743659727926
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in knowledge acquisition, reasoning, and tool use, making them promising candidates for autonomous agent applications. However, training LLM agents for complex multi-turn task planning faces significant challenges, including sparse episode-wise rewards, credit assignment across long horizons, and the computational overhead of reinforcement learning in multi-turn interaction settings. To this end, this paper introduces a novel approach that transforms multi-turn task planning into single-turn task reasoning problems, enabling efficient policy optimization through Group Relative Policy Optimization (GRPO) with dense and verifiable reward from expert trajectories. Our theoretical analysis shows that GRPO improvement on single-turn task reasoning results in higher multi-turn success probability under the minimal turns, as well as the generalization to subtasks with shorter horizons. Experimental evaluation on the complex task planning benchmark demonstrates that our 1.5B parameter model trained with single-turn GRPO achieves superior performance compared to larger baseline models up to 14B parameters, with success rates of 70% for long-horizon planning tasks with over 30 steps. We also theoretically and empirically validate the strong cross-task generalizability that the models trained on complex tasks can lead to the successful completion of all simpler subtasks.

Related papers

Operationalising the Superficial Alignment Hypothesis via Task Complexity [49.93635747700126]
We propose a new metric called task complexity: the length of the shortest program that achieves a target performance on a task.<n>Our results highlight that task adaptation often requires surprisingly little information -- often just a few kilobytes.
arXiv Detail & Related papers (2026-02-17T18:59:39Z)
A Goal Without a Plan Is Just a Wish: Efficient and Effective Global Planner Training for Long-Horizon Agent Tasks [66.86312354478478]
Agents based on large language models (LLMs) struggle with brainless trial-and-error and generating hallucinatory actions due to a lack of global planning in long-horizon tasks.<n>We introduce a plan-and-execute framework and propose a planner training method to enhance the executor agent's planning abilities without human effort.<n>Experiments show that executor agents equipped with our planner outperform existing methods, achieving new state-of-the-art performance.
arXiv Detail & Related papers (2025-10-07T06:10:53Z)
MEJO: MLLM-Engaged Surgical Triplet Recognition via Inter- and Intra-Task Joint Optimization [52.149337961205624]
We propose a framework that empowers both inter- and intra-task optimization for surgical triplet recognition.<n>For inter-task optimization, we introduce the Shared-Specific-Disentangled (S$2$D) learning scheme that decomposes representations into task-shared and task-specific components.<n>For intra-task optimization conflicts, we develop a Coordinated Gradient Learning (CGL) strategy, which dissects and rebalances the positive-negative ambiguities.
arXiv Detail & Related papers (2025-09-16T09:48:52Z)
PilotRL: Training Language Model Agents via Global Planning-Guided Progressive Reinforcement Learning [36.051921179063264]
Large Language Models (LLMs) have shown remarkable advancements in tackling agent-oriented tasks.<n>Current approaches predominantly rely on supervised fine-tuning, which often leads models to memorize established task completion trajectories.<n>We introduce an adaptive global plan-based agent paradigm AdaPlan, aiming to synergize high-level explicit guidance with execution.
arXiv Detail & Related papers (2025-08-01T06:17:11Z)
Omni-Thinker: Scaling Multi-Task RL in LLMs with Hybrid Reward and Task Scheduling [66.0871543682453]
We present Omni-Thinker, a unified reinforcement learning framework that scales large language models across diverse tasks.<n>Our scheduler orders tasks according to accuracy backward transfer (BWT), reducing forgetting and improving multi-task performance.
arXiv Detail & Related papers (2025-07-20T01:50:16Z)
PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving [66.42260489147617]
We introduce PLAN-TUNING, a framework that distills synthetic task decompositions from large-scale language models.<n>Plan-TUNING fine-tunes smaller models via supervised and reinforcement-learning objectives to improve complex reasoning.<n>Our analysis demonstrates how planning trajectories improves complex reasoning capabilities.
arXiv Detail & Related papers (2025-07-10T07:30:44Z)
Bigger, Regularized, Categorical: High-Capacity Value Functions are Efficient Multi-Task Learners [60.75160178669076]
We show that the use of high-capacity value models trained via cross-entropy and conditioned on learnable task embeddings addresses the problem of task interference in online reinforcement learning.<n>We test our approach on 7 multi-task benchmarks with over 280 unique tasks, spanning high degree-of-freedom humanoid control and discrete vision-based RL.
arXiv Detail & Related papers (2025-05-29T06:41:45Z)
MALT: Improving Reasoning with Multi-Agent LLM Training [66.9481561915524]
MALT (Multi-Agent LLM Training) is a novel post-training strategy that divides the reasoning process into generation, verification, and refinement steps.<n>On MATH, GSM8K, and CSQA, MALT surpasses the same baseline LLM with a relative improvement of 15.66%, 7.42%, and 9.40% respectively.
arXiv Detail & Related papers (2024-12-02T19:30:36Z)
Planning with Multi-Constraints via Collaborative Language Agents [13.550774629515843]
This paper introduces Planning with Multi-Constraints (PMC), a zero-shot methodology for collaborative multi-agent systems.<n>PMC simplifies complex task planning with constraints by decomposing it into a hierarchy of subordinate tasks.<n>PMC achieved an average 42.68% success rate on TravelPlanner, significantly higher than GPT-4 (2.92%), and outperforming GPT-4 with ReAct on API-Bank by 13.64%.
arXiv Detail & Related papers (2024-05-26T10:33:17Z)
Proximal Curriculum with Task Correlations for Deep Reinforcement Learning [25.10619062353793]
We consider curriculum design in contextual multi-task settings where the agent's final performance is measured w.r.t. a target distribution over complex tasks. We propose a novel curriculum, ProCuRL-Target, that effectively balances the need for selecting tasks that are not too difficult for the agent while progressing the agent's learning toward the target distribution via leveraging task correlations.
arXiv Detail & Related papers (2024-05-03T21:07:54Z)
Sample Efficient Myopic Exploration Through Multitask Reinforcement Learning with Diverse Tasks [53.44714413181162]
This paper shows that when an agent is trained on a sufficiently diverse set of tasks, a generic policy-sharing algorithm with myopic exploration design can be sample-efficient. To the best of our knowledge, this is the first theoretical demonstration of the "exploration benefits" of MTRL.
arXiv Detail & Related papers (2024-03-03T22:57:44Z)
In Defense of the Unitary Scalarization for Deep Multi-Task Learning [121.76421174107463]
We present a theoretical analysis suggesting that many specialized multi-tasks can be interpreted as forms of regularization. We show that, when coupled with standard regularization and stabilization techniques, unitary scalarization matches or improves upon the performance of complex multitasks.
arXiv Detail & Related papers (2022-01-11T18:44:17Z)
A Simple General Approach to Balance Task Difficulty in Multi-Task Learning [4.531240717484252]
In multi-task learning, difficulty levels of different tasks are varying. We propose a Balanced Multi-Task Learning (BMTL) framework. The proposed BMTL framework is very simple and it can be combined with most multi-task learning models.
arXiv Detail & Related papers (2020-02-12T04:31:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.