Related papers: FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

URL: http://arxiv.org/abs/2603.01712v1
Date: Mon, 02 Mar 2026 10:37:11 GMT
Title: FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents
Authors: Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian,
Abstract summary: FT-Dojo is an interactive environment comprising 13 tasks across 5 domains.<n>We develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback.
Score: 25.60249598832918
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Fine-tuning large language models for vertical domains remains a labor-intensive and expensive process, requiring domain experts to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning, no prior work has tackled end-to-end LLM fine-tuning with agents. Can LLM-based agents automate this complete process? We frame this as a substantially open problem: agents must navigate an open-ended search space spanning data curation from diverse data sources, processing with complex tools, building a training pipeline, and iteratively refining their approach based on evaluation outcomes in rapidly growing logs--an overall scenario far more intricate than existing benchmarks. To study this question, we introduce FT-Dojo, an interactive environment comprising 13 tasks across 5 domains. We further develop FT-Agent, an autonomous system that mirrors human experts by leveraging evaluation-driven feedback to iteratively diagnose failures and refine fine-tuning strategies. Experiments on FT-Dojo demonstrate that purpose-built fine-tuning agents significantly outperform general-purpose alternatives, with FT-Agent achieving the best performance on 10 out of 13 tasks across all five domains. Ablations show that the approach generalizes effectively to 3B models, with additional insights on data scaling trade-offs and backbone sensitivity. Case analyses reveal that agents can recover from failures through cumulative learning from historical experience, while also exposing fundamental limitations in causal reasoning--highlighting both the promise and current boundaries of autonomous LLM fine-tuning.

Related papers

Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers [27.817039954088315]
We introduce textbfGeoEvolver, a self-evolving multi-agent system for learning tool-level expertise.<n>We show that GeoEvolver consistently improves end-to-end task success, with an average gain of 12% across multiple backbones.
arXiv Detail & Related papers (2026-01-30T15:11:07Z)
ReX-MLE: The Autonomous Agent Benchmark for Medical Imaging Challenges [5.886200278450183]
We introduce ReX-MLE, a benchmark of 20 challenges derived from high-impact medical imaging competitions.<n>Unlike prior benchmarks, ReX-MLE evaluates full end-to-end, requiring agents to independently manage data preprocessing, model training, and submission.<n>We observe a severe performance gap: most submissions rank in the 0th percentile compared to human experts.
arXiv Detail & Related papers (2025-12-19T17:44:40Z)
A Survey of Data Agents: Emerging Paradigm or Overstated Hype? [66.1526688475023]
"Data agent" currently suffers from terminological ambiguity and inconsistent adoption.<n>This survey introduces the first systematic hierarchical taxonomy for data agents.<n>We conclude with a forward-looking roadmap, envisioning the advent of proactive, generative data agents.
arXiv Detail & Related papers (2025-10-27T17:54:07Z)
Agent Fine-tuning through Distillation for Domain-specific LLMs in Microdomains [6.323778761045108]
Agentic large language models (LLMs) have become prominent for autonomously interacting with external environments.<n>This paper explores agent fine-tuning for domain adaptation within Hitachi's JP1 microdomain for specialized IT operations.
arXiv Detail & Related papers (2025-10-01T04:04:53Z)
Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z)
LaMDAgent: An Autonomous Framework for Post-Training Pipeline Optimization via LLM Agents [3.6117068575553595]
We introduce LaMDAgent, a framework that autonomously constructs and optimize full post-training pipelines.<n>LaMDAgent improves tool-use accuracy by 9.0 points while preserving instruction-following capabilities.<n>It uncovers effective post-training strategies that are often overlooked by conventional human-driven exploration.
arXiv Detail & Related papers (2025-05-28T04:30:51Z)
IDA-Bench: Evaluating LLMs on Interactive Guided Data Analysis [60.32962597618861]
IDA-Bench is a novel benchmark evaluating large language models in multi-round interactive scenarios.<n>Agent performance is judged by comparing its final numerical output to the human-derived baseline.<n>Even state-of-the-art coding agents (like Claude-3.7-thinking) succeed on 50% of the tasks, highlighting limitations not evident in single-turn tests.
arXiv Detail & Related papers (2025-05-23T09:37:52Z)
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z)
TAMO: Fine-Grained Root Cause Analysis via Tool-Assisted LLM Agent with Multi-Modality Observation Data in Cloud-Native Systems [42.50432360919637]
Large language models (LLMs)-driven root cause analysis (RCA) in cloud-native systems has become a key topic of modern software operations and maintenance.<n>Existing LLM-based approaches face three key challenges: multi-modality input constraint, context window limitation, and dynamic dependence graph.<n>We propose a tool-assisted LLM agent with multi-modality observation data for fine-grained RCA, namely TAMO.
arXiv Detail & Related papers (2025-04-29T06:50:48Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks.<n>However, they still struggle with problems requiring multi-step decision-making and environmental feedback.<n>We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [44.34340798542]
Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning. Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities. We propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions.
arXiv Detail & Related papers (2024-08-13T20:52:13Z)
DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning [56.887047551101574]
We present DS-Agent, a novel framework that harnesses large language models (LLMs) agent and case-based reasoning (CBR) In the development stage, DS-Agent follows the CBR framework to structure an automatic iteration pipeline, which can flexibly capitalize on the expert knowledge from Kaggle. In the deployment stage, DS-Agent implements a low-resource deployment stage with a simplified CBR paradigm, significantly reducing the demand on foundational capabilities of LLMs.
arXiv Detail & Related papers (2024-02-27T12:26:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.