Related papers: Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training

URL: http://arxiv.org/abs/2502.06589v1
Date: Mon, 10 Feb 2025 15:54:34 GMT
Title: Hephaestus: Improving Fundamental Agent Capabilities of Large Language Models through Continual Pre-Training
Authors: Yuchen Zhuang, Jingfeng Yang, Haoming Jiang, Xin Liu, Kewei Cheng, Sanket Lokegaonkar, Yifan Gao, Qing Ping, Tianyi Liu, Binxuan Huang, Zheng Li, Zhengyang Wang, Pei Chen, Ruijie Wang, Rongzhi Zhang, Nasser Zalmout, Priyanka Nigam, Bing Yin, Chao Zhang,
Abstract summary: Hephaestus-Forge is a large-scale pre-training corpus designed to enhance the capabilities of LLM agents in API function calling, intrinsic reasoning and planning.<n>Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories.<n>By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks.
Score: 69.13064064991552
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Due to the scarcity of agent-oriented pre-training data, LLM-based autonomous agents typically rely on complex prompting or extensive fine-tuning, which often fails to introduce new capabilities while preserving strong generalizability. We introduce Hephaestus-Forge, the first large-scale pre-training corpus designed to enhance the fundamental capabilities of LLM agents in API function calling, intrinsic reasoning and planning, and adapting to environmental feedback. Hephaestus-Forge comprises 103B agent-specific data encompassing 76,537 APIs, including both tool documentation to introduce knowledge of API functions and function calling trajectories to strengthen intrinsic reasoning. To explore effective training protocols, we investigate scaling laws to identify the optimal recipe in data mixing ratios. By continual pre-training on Hephaestus-Forge, Hephaestus outperforms small- to medium-scale open-source LLMs and rivals commercial LLMs on three agent benchmarks, demonstrating the effectiveness of our pre-training corpus in enhancing fundamental agentic capabilities and generalization of LLMs to new tasks or environments.

Related papers

Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation [27.59197535041953]
Large Language Models (LLMs) represent a promising frontier for recommender systems.<n>This paper introduces a novel, layered framework for generating high-quality synthetic data.<n>We empirically demonstrate, for the first time, robust power-law scaling for an LLM that is continually pre-trained on our high-quality, recommendation-specific data.
arXiv Detail & Related papers (2026-02-07T01:15:15Z)
Data-Centric Interpretability for LLM-based Multi-Agent Reinforcement Learning [39.84288631342219]
We analyze large-scale reinforcement learning training runs from the sophisticated environment of Full-Press Diplomacy.<n>We introduce Meta-Autointerp, a method for grouping SAE features into interpretable hypotheses about training dynamics.<n>We find that even subjectively interesting and seemingly helpful SAE features may be worse than useless to humans.
arXiv Detail & Related papers (2026-02-05T01:21:22Z)
From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory [48.22750809620306]
Large Language Models (LLMs) based agents have demonstrated remarkable potential in autonomous task-solving.<n>In this paper, we introduce a novel agent-centric, trainable, multi-layered graph memory framework.<n>We show how context memory enhances the ability of LLMs to utilize information.
arXiv Detail & Related papers (2025-11-11T03:36:33Z)
APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training [48.20667772172573]
APTBench is a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions.<n>It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research.<n>Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model's downstream performance as an agent.
arXiv Detail & Related papers (2025-10-28T13:11:22Z)
Exploring Superior Function Calls via Reinforcement Learning [9.278264697070306]
We present a novel reinforcement learning framework designed to enhance group relative policy optimization.<n>We address three critical challenges in function calling: insufficient exploration during policy learning, lack of structured reasoning in chain-of-thought generation, and inadequate verification of parameter extraction.<n>Our framework achieves state-of-the-art performance among open-source models with 86.02% overall accuracy, outperforming standard GRPO by up to 6% on complex multi-function scenarios.
arXiv Detail & Related papers (2025-08-07T07:51:38Z)
InfoSteer: Steering Information Utility in Language Model Post-Training [7.756342860929851]
We present a lightweight method that encourages parametric information utilization in language models (LMs) during post-training.<n>We find this simple guidance delivers consistent performance improvements across diverse model families--including Qwen, Gemma and Llama.<n>Our work underscores that vanilla post-training does not fully leverage pre-training potential, and steering LMs in latent representation space offers a promising approach.
arXiv Detail & Related papers (2025-07-07T16:13:21Z)
LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.10969986056]
Large Language Models (LLMs) have transformed the natural language processing landscape and brought to life diverse applications. Post-training methods enable LLMs to refine their knowledge, improve reasoning, enhance factual accuracy, and align more effectively with user intents and ethical considerations.
arXiv Detail & Related papers (2025-02-28T18:59:54Z)
MAGELLAN: Metacognitive predictions of learning progress guide autotelic LLM agents in large goal spaces [30.231701007708146]
Open-ended learning agents must efficiently prioritize goals in vast possibility spaces. Traditional approaches either require extensive sampling or rely on brittle expert-defined goal groupings. We introduce MAGELLAN, a metacognitive framework that lets LLM agents learn to predict their competence and LP online.
arXiv Detail & Related papers (2025-02-11T17:08:00Z)
Beyond Scaling: Measuring and Predicting the Upper Bound of Knowledge Retention in Language Model Pre-Training [51.41246396610475]
This paper aims to predict performance in closed-book question answering (QA) without the help of external tools.<n>We conduct large-scale retrieval and semantic analysis across the pre-training corpora of 21 publicly available and 3 custom-trained large language models.<n>Building on these foundations, we propose Size-dependent Mutual Information (SMI), an information-theoretic metric that linearly correlates pre-training data characteristics.
arXiv Detail & Related papers (2025-02-06T13:23:53Z)
Training Agents with Weakly Supervised Feedback from Large Language Models [19.216542820742607]
This paper introduces a novel training method for LLM-based agents using weakly supervised signals from a critic LLM.<n>Our agents are trained in iterative manner, where they initially generate trajectories through environmental interaction.<n>Tests on the API-bank dataset show consistent improvement in our agents' capabilities and comparable performance to GPT-4.
arXiv Detail & Related papers (2024-11-29T08:47:04Z)
Generalization v.s. Memorization: Tracing Language Models' Capabilities Back to Pretraining Data [76.90128359866462]
We introduce an extended concept of memorization, distributional memorization, which measures the correlation between the output probabilities and the pretraining data frequency.<n>We show that memorization plays a larger role in simpler, knowledge-intensive tasks, while generalization is the key for harder, reasoning-based tasks.
arXiv Detail & Related papers (2024-07-20T21:24:40Z)
Efficient Continual Pre-training by Mitigating the Stability Gap [68.49269649759005]
We study the behavior of Large Language Models (LLMs) during continual pre-training. We propose three effective strategies to enhance LLM performance within a fixed compute budget. Our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget.
arXiv Detail & Related papers (2024-06-21T02:28:37Z)
Extracting Heuristics from Large Language Models for Reward Shaping in Reinforcement Learning [28.077228879886402]
Reinforcement Learning (RL) suffers from sample inefficiency in reward domains, and the problem is further pronounced in case of transitions. To improve the sample efficiency, reward shaping is a well-studied approach to introduce intrinsic rewards that can help the RL agent converge to an optimal policy faster.
arXiv Detail & Related papers (2024-05-24T03:53:57Z)
Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models [56.00992369295851]
Open-sourced Large Language Models (LLMs) have achieved great success in various NLP tasks, however, they are still far inferior to API-based models when acting as agents. This paper delivers three key observations: (1) the current agent training corpus is entangled with both formats following and agent reasoning, which significantly shifts from the distribution of its pre-training data; (2) LLMs exhibit different learning speeds on the capabilities required by agent tasks; and (3) current approaches have side-effects when improving agent abilities by introducing hallucinations. We propose Agent-FLAN to effectively Fine-tune LANguage models for Agents.
arXiv Detail & Related papers (2024-03-19T16:26:10Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
ExpeL: LLM Agents Are Experiential Learners [57.13685954854463]
We introduce the Experiential Learning (ExpeL) agent to allow learning from agent experiences without requiring parametric updates. Our agent autonomously gathers experiences and extracts knowledge using natural language from a collection of training tasks. At inference, the agent recalls its extracted insights and past experiences to make informed decisions.
arXiv Detail & Related papers (2023-08-20T03:03:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.