Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space
- URL: http://arxiv.org/abs/2512.04601v1
- Date: Thu, 04 Dec 2025 09:21:44 GMT
- Title: Natural Language Actor-Critic: Scalable Off-Policy Learning in Language Space
- Authors: Joey Hong, Kang Liu, Zhan Ling, Jiecao Chen, Sergey Levine,
- Abstract summary: Natural Language Actor-Critic is a novel actor-critic algorithm that trains policies using natural language rather than scalar values.<n>We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches.
- Score: 57.868527884634894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language model (LLM) agents -- LLMs that dynamically interact with an environment over long horizons -- have become an increasingly important area of research, enabling automation in complex tasks involving tool-use, web browsing, and dialogue with people. In the absence of expert demonstrations, training LLM agents has relied on policy gradient methods that optimize LLM policies with respect to an (often sparse) reward function. However, in long-horizon tasks with sparse rewards, learning from trajectory-level rewards can be noisy, leading to training that is unstable and has high sample complexity. Furthermore, policy improvement hinges on discovering better actions through exploration, which can be difficult when actions lie in natural language space. In this paper, we propose Natural Language Actor-Critic (NLAC), a novel actor-critic algorithm that trains LLM policies using a generative LLM critic that produces natural language rather than scalar values. This approach leverages the inherent strengths of LLMs to provide a richer and more actionable training signal; particularly, in tasks with large, open-ended action spaces, natural language explanations for why an action is suboptimal can be immensely useful for LLM policies to reason how to improve their actions, without relying on random exploration. Furthermore, our approach can be trained off-policy without policy gradients, offering a more data-efficient and stable alternative to existing on-policy methods. We present results on a mixture of reasoning, web browsing, and tool-use with dialogue tasks, demonstrating that NLAC shows promise in outperforming existing training approaches and offers a more scalable and stable training paradigm for LLM agents.
Related papers
- LLM-Guided Reinforcement Learning: Addressing Training Bottlenecks through Policy Modulation [7.054214377609925]
Reinforcement learning (RL) has achieved notable success in various domains.<n>Training effective policies for complex tasks remains challenging.<n>Existing approaches to mitigate training bottlenecks fall into two categories.
arXiv Detail & Related papers (2025-05-27T03:40:02Z) - Planning without Search: Refining Frontier LLMs with Offline Goal-Conditioned RL [62.984693936073974]
Large language models (LLMs) excel in tasks like question answering and dialogue.<n>Complex tasks requiring interaction, such as negotiation and persuasion, require additional long-horizon reasoning and planning.<n>We propose a novel approach that uses goal-conditioned value functions to guide the reasoning of LLM agents.
arXiv Detail & Related papers (2025-05-23T16:51:54Z) - Option Discovery Using LLM-guided Semantic Hierarchical Reinforcement Learning [16.654435148168172]
Large Language Models (LLMs) have shown remarkable promise in reasoning and decision-making.<n>We propose an LLM-guided hierarchical RL framework, termed LDSC, to enhance sample efficiency, generalization, and multi-task adaptability.
arXiv Detail & Related papers (2025-03-24T15:49:56Z) - LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications.<n>Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z) - Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z) - CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning [3.602902292270654]
Reinforcement learning (RL) in continuous action spaces encounters persistent challenges, such as inefficient exploration and convergence to suboptimal solutions.<n>We propose CAMEL, a novel framework integrating LLM-generated suboptimal policies into the RL training pipeline.
arXiv Detail & Related papers (2025-02-17T15:22:19Z) - Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts [10.929547354171723]
This paper introduces Knowledgeable Agents from Language Model Rollouts (KALM)
It extracts knowledge from large language models (LLMs) in the form of imaginary rollouts that can be easily learned by the agent through offline reinforcement learning methods.
It achieves a success rate of 46% in executing tasks with unseen goals, substantially surpassing the 26% success rate achieved by baseline methods.
arXiv Detail & Related papers (2024-04-14T13:19:40Z) - Agent-Pro: Learning to Evolve via Policy-Level Reflection and Optimization [53.510942601223626]
Large Language Models (LLMs) exhibit robust problem-solving capabilities for diverse tasks.
These task solvers necessitate manually crafted prompts to inform task rules and regulate behaviors.
We propose Agent-Pro: an LLM-based Agent with Policy-level Reflection and Optimization.
arXiv Detail & Related papers (2024-02-27T15:09:20Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - LgTS: Dynamic Task Sampling using LLM-generated sub-goals for
Reinforcement Learning Agents [10.936460061405157]
We propose LgTS (LLM-guided Teacher-Student learning), a novel approach that explores the planning abilities of LLMs.
Our approach does not assume access to a propreitary or a fine-tuned LLM, nor does it require pre-trained policies that achieve the sub-goals proposed by the LLM.
arXiv Detail & Related papers (2023-10-14T00:07:03Z) - Enabling Intelligent Interactions between an Agent and an LLM: A Reinforcement Learning Approach [31.6589518077397]
Large language models (LLMs) encode a vast amount of world knowledge acquired from massive text datasets.
LLMs can assist an embodied agent in solving complex sequential decision making tasks by providing high-level instructions.
We propose When2Ask, a reinforcement learning based approach that learns when it is necessary to query LLMs for high-level instructions.
arXiv Detail & Related papers (2023-06-06T11:49:09Z) - Guiding Pretraining in Reinforcement Learning with Large Language Models [133.32146904055233]
We describe a method that uses background knowledge from text corpora to shape exploration.
This method, called ELLM, rewards an agent for achieving goals suggested by a language model.
By leveraging large-scale language model pretraining, ELLM guides agents toward human-meaningful and plausibly useful behaviors without requiring a human in the loop.
arXiv Detail & Related papers (2023-02-13T21:16:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.