Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning
- URL: http://arxiv.org/abs/2505.19761v1
- Date: Mon, 26 May 2025 09:43:40 GMT
- Title: Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning
- Authors: Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, Yu Cheng,
- Abstract summary: Large language models (LLMs) struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment.<n>We propose an innovative framework that introduces a parameter-efficient and generally applicable hierarchy to LLM policies.<n>We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy.
- Score: 32.260964481673085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework **GLIDER** (**G**rounding **L**anguage Models as Eff**I**cient **D**ecision-Making Agents via Offline Hi**E**rarchical **R**einforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.
Related papers
- Strict Subgoal Execution: Reliable Long-Horizon Planning in Hierarchical Reinforcement Learning [5.274804664403783]
Strict Subgoal Execution (SSE) is a graph-based hierarchical RL framework that enforces single-step subgoal reachability.<n>We show that SSE consistently outperforms existing goal-conditioned RL and hierarchical RL approaches in both efficiency and success rate.
arXiv Detail & Related papers (2025-06-26T06:35:42Z) - ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z) - Exploiting Hybrid Policy in Reinforcement Learning for Interpretable Temporal Logic Manipulation [12.243491328213217]
Reinforcement Learning (RL) based methods have been increasingly explored for robot learning.<n>We propose a Temporal-Logic-guided Hybrid policy framework (HyTL) which leverages three-level decision layers to improve the agent's performance.<n>We evaluate HyTL on four challenging manipulation tasks, which demonstrate its effectiveness and interpretability.
arXiv Detail & Related papers (2024-12-29T03:34:53Z) - SEAL: SEmantic-Augmented Imitation Learning via Language Model [16.345410506033915]
SEAL is a novel framework for specifying sub-goal space and pre-labeling states to semantically meaningful sub-goal representations.
SEAL employs a dual-encoder structure, combining supervised LLM-guided sub-goal learning with unsupervised Vector Quantization.
Our experiments demonstrate that SEAL outperforms state-of-the-art HIL methods and LLM-based planning approaches.
arXiv Detail & Related papers (2024-10-03T05:53:10Z) - Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs [12.572869123617783]
Reinforcement learning (RL) faces challenges in evaluating policy trajectories within intricate game tasks.
PbRL presents a pioneering framework that capitalizes on human preferences as pivotal reward signals.
We propose a LLM-enabled automatic preference generation framework named LLM4PG.
arXiv Detail & Related papers (2024-06-28T04:21:24Z) - Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases.
We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning.
Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z) - From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world.
Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting.
We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Hierarchical Continual Reinforcement Learning via Large Language Model [15.837883929274758]
Hi-Core is designed to facilitate the transfer of high-level knowledge.
It orchestrates a twolayer structure: high-level policy formulation by a large language model (LLM)
Hi-Core has demonstrated its effectiveness in handling diverse CRL tasks, which outperforms popular baselines.
arXiv Detail & Related papers (2024-01-25T03:06:51Z) - Semantically Aligned Task Decomposition in Multi-Agent Reinforcement
Learning [56.26889258704261]
We propose a novel "disentangled" decision-making method, Semantically Aligned task decomposition in MARL (SAMA)
SAMA prompts pretrained language models with chain-of-thought that can suggest potential goals, provide suitable goal decomposition and subgoal allocation as well as self-reflection-based replanning.
SAMA demonstrates considerable advantages in sample efficiency compared to state-of-the-art ASG methods.
arXiv Detail & Related papers (2023-05-18T10:37:54Z) - Efficient Learning of High Level Plans from Play [57.29562823883257]
We present Efficient Learning of High-Level Plans from Play (ELF-P), a framework for robotic learning that bridges motion planning and deep RL.
We demonstrate that ELF-P has significantly better sample efficiency than relevant baselines over multiple realistic manipulation tasks.
arXiv Detail & Related papers (2023-03-16T20:09:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.