Related papers: Don't Just Fine-tune the Agent, Tune the Environment

Don't Just Fine-tune the Agent, Tune the Environment

URL: http://arxiv.org/abs/2510.10197v1
Date: Sat, 11 Oct 2025 12:35:15 GMT
Title: Don't Just Fine-tune the Agent, Tune the Environment
Authors: Siyuan Lu, Zechuan Wang, Hongxuan Zhang, Qintong Wu, Leilei Gan, Chenyi Zhuang, Jinjie Gu, Tao Lin,
Abstract summary: Supervised fine-tuning on synthetic data leads to overfitting.<n>Standard reinforcement learning struggles with a critical cold-start problem and training instability.<n>Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration.
Score: 25.7349297100143
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Large Language Model (LLM) agents show great promise for complex, multi-turn tool-use tasks, but their development is often hampered by the extreme scarcity of high-quality training data. Supervised fine-tuning (SFT) on synthetic data leads to overfitting, whereas standard reinforcement learning (RL) struggles with a critical cold-start problem and training instability. To address these challenges, we introduce $\textbf{Environment Tuning}$, a novel training paradigm that enables agents to learn complex behaviors directly from problem instances without relying on pre-collected expert trajectories. $\textbf{Environment Tuning}$ orchestrates this learning process through a structured curriculum, actionable environment augmentation that provides corrective feedback, and fine-grained progress rewards to ensure stable and efficient exploration. Using only 400 problem instances from Berkeley Function-Calling Leaderboard (BFCL) benchmark, our method not only achieves competitive in-distribution performance against strong baselines but also demonstrates superior out-of-distribution generalization, overcoming the performance collapse common to SFT-based approaches. Our work presents a paradigm shift from supervised fine-tuning on static trajectories to dynamic, environment-based exploration, paving the way for training more robust and data-efficient agents.

Related papers

ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas [13.919124676472022]
ASTRA is an end-to-end framework for training tool-augmented language model agents.<n>ASTRA integrates scalable data synthesis and verifiable reinforcement learning.<n> Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance.
arXiv Detail & Related papers (2026-01-29T11:22:23Z)
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning [24.80806018678682]
Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models.<n>In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability.<n>We propose a framework that sustains effective learning signals through adaptive environment design.
arXiv Detail & Related papers (2026-01-08T10:42:04Z)
Human-in-the-loop Online Rejection Sampling for Robotic Manipulation [55.99788088622936]
Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning.<n>Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training.
arXiv Detail & Related papers (2025-10-30T11:53:08Z)
BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning [82.925106913459]
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning.<n>We introduce BOTS, a unified framework for Bayesian Online Task Selection in RFT reinforcement finetuning.
arXiv Detail & Related papers (2025-10-30T11:15:23Z)
Training-Free Group Relative Policy Optimization [34.73950078782136]
We argue that Large Language Model (LLM) agents can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior.<n>We propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates.<n> Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance.
arXiv Detail & Related papers (2025-10-09T13:18:17Z)
On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting [91.38734024438357]
Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) are two prominent post-training paradigms for refining the capabilities and aligning the behavior of Large Language Models (LLMs)<n>Existing approaches that integrate SFT and RL often face the risk of disrupting established response patterns and inducing overfitting to expert data.<n>We propose CHORD, a framework for Controllable Harmonization of On- and Off-Policy Reinforcement Learning via Dynamic Weighting.
arXiv Detail & Related papers (2025-08-15T11:20:03Z)
Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization [15.212942734663514]
CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1.<n>We identify challenges in the training dynamics, which are emphasized by higher UTD ratios.<n>Our proposed approach reliably scales with increasing UTD ratios, achieving competitive performance across 25 challenging continuous control tasks.
arXiv Detail & Related papers (2025-02-11T12:55:32Z)
Take the Bull by the Horns: Hard Sample-Reweighted Continual Training Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data. Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets. We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z)
End-to-End Meta-Bayesian Optimisation with Transformer Neural Processes [52.818579746354665]
This paper proposes the first end-to-end differentiable meta-BO framework that generalises neural processes to learn acquisition functions via transformer architectures. We enable this end-to-end framework with reinforcement learning (RL) to tackle the lack of labelled acquisition data.
arXiv Detail & Related papers (2023-05-25T10:58:46Z)
Stabilizing and Improving Federated Learning with Non-IID Data and Client Dropout [15.569507252445144]
Label distribution skew induced data heterogeniety has been shown to be a significant obstacle that limits the model performance in federated learning. We propose a simple yet effective framework by introducing a prior-calibrated softmax function for computing the cross-entropy loss. The improved model performance over existing baselines in the presence of non-IID data and client dropout is demonstrated.
arXiv Detail & Related papers (2023-03-11T05:17:59Z)
Understanding the Difficulty of Training Transformers [120.99980924577787]
We show that unbalanced gradients are not the root cause of the instability of training. We propose Admin to stabilize the early stage's training and unleash its full potential in the late stage.
arXiv Detail & Related papers (2020-04-17T13:59:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.