Related papers: Language-based Trial and Error Falls Behind in the Era of Experience

Language-based Trial and Error Falls Behind in the Era of Experience

URL: http://arxiv.org/abs/2601.21754v2
Date: Sat, 31 Jan 2026 11:31:41 GMT
Title: Language-based Trial and Error Falls Behind in the Era of Experience
Authors: Haoyu Wang, Guozheng Ma, Shugang Cui, Yilun Kong, Haotian Luo, Li Shen, Mengya Gao, Yichao Wu, Xiaogang Wang, Dacheng Tao,
Abstract summary: Large Language Models (LLMs) excel in language-based agentic tasks, but their applicability to unseen, nonlinguistic environments remains limited.<n>In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration.<n>We propose SCOUT, a novel framework that decouples exploration from semantic exploitation.
Score: 50.503828360874536
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While Large Language Models (LLMs) excel in language-based agentic tasks, their applicability to unseen, nonlinguistic environments (e.g., symbolic or spatial tasks) remains limited. Previous work attributes this performance gap to the mismatch between the pretraining distribution and the testing distribution. In this work, we demonstrate the primary bottleneck is the prohibitive cost of exploration: mastering these tasks requires extensive trial-and-error, which is computationally unsustainable for parameter-heavy LLMs operating in a high dimensional semantic space. To address this, we propose SCOUT (Sub-Scale Collaboration On Unseen Tasks), a novel framework that decouples exploration from exploitation. We employ lightweight "scouts" (e.g., small MLPs) to probe environmental dynamics at a speed and scale far exceeding LLMs. The collected trajectories are utilized to bootstrap the LLM via Supervised Fine-Tuning (SFT), followed by multi-turn Reinforcement Learning (RL) to activate its latent world knowledge. Empirically, SCOUT enables a Qwen2.5-3B-Instruct model to achieve an average score of 0.86, significantly outperforming proprietary models, including Gemini-2.5-Pro (0.60), while saving about 60% GPU hours consumption.

Related papers

How to Train Your LLM Web Agent: A Statistical Diagnosis [96.86317871461834]
We present the first statistically grounded study on compute allocation for LLM web-agent post-training.<n>Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT) and on-policy reinforcement learning.<n>Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++.
arXiv Detail & Related papers (2025-07-05T17:12:33Z)
Catastrophic Forgetting in LLMs: A Comparative Analysis Across Language Tasks [0.0]
Large Language Models (LLMs) have significantly advanced Natural Language Processing (NLP)<n>This study evaluates the continual fine-tuning of various open-source LLMs on key NLU tasks.<n>Our results indicate that models such as Phi-3.5-mini exhibit minimal forgetting while maintaining strong learning capabilities.
arXiv Detail & Related papers (2025-04-01T23:06:55Z)
Controlling Large Language Model with Latent Actions [27.0292050543406]
Adapting Large Language Models to downstream tasks using Reinforcement Learning has proven to be an effective approach.<n>This paper studies learning a compact latent action space to enhance the controllability and exploration of RL for LLMs.<n>We propose Controlling Large Language Models with Latent Actions (CoLA), a framework that integrates a latent action space into pre-trained LLMs.
arXiv Detail & Related papers (2025-03-27T11:25:22Z)
Efficient Multitask Learning in Small Language Models Through Upside-Down Reinforcement Learning [8.995427413172148]
Small language models (SLMs) can achieve competitive performance in multitask prompt generation tasks.<n>We train an SLM that achieves relevance scores within 5% of state-of-the-art models, including Llama-3, Qwen2, and Mistral, despite being up to 80 times smaller.
arXiv Detail & Related papers (2025-02-14T01:39:45Z)
Reward Guidance for Reinforcement Learning Tasks Based on Large Language Models: The LMGT Framework [1.5802986215292307]
Language Model Guided reward Tuning (LMGT) is a novel, sample-efficient framework for Reinforcement Learning.<n>We show that LMGT adeptly balances exploration and exploitation, thereby guiding the agent's exploratory behavior.<n>Our results suggest that LMGT can substantially reduce the computational resources required during the RL training phase.
arXiv Detail & Related papers (2024-09-07T07:40:43Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
From Words to Actions: Unveiling the Theoretical Underpinnings of LLM-Driven Autonomous Systems [59.40480894948944]
Large language model (LLM) empowered agents are able to solve decision-making problems in the physical world. Under this model, the LLM Planner navigates a partially observable Markov decision process (POMDP) by iteratively generating language-based subgoals via prompting. We prove that the pretrained LLM Planner effectively performs Bayesian aggregated imitation learning (BAIL) through in-context learning.
arXiv Detail & Related papers (2024-05-30T09:42:54Z)
Can large language models explore in-context? [87.49311128190143]
We deploy Large Language Models as agents in simple multi-armed bandit environments. We find that the models do not robustly engage in exploration without substantial interventions.
arXiv Detail & Related papers (2024-03-22T17:50:43Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.