Scaling Agents via Continual Pre-training
- URL: http://arxiv.org/abs/2509.13310v1
- Date: Tue, 16 Sep 2025 17:57:19 GMT
- Title: Scaling Agents via Continual Pre-training
- Authors: Liangcai Su, Zhen Zhang, Guangyu Li, Zhuo Chen, Chenxi Wang, Maojia Song, Xinyu Wang, Kuan Li, Jialong Wu, Xuanzhong Chen, Zile Qiao, Zhongwang Zhang, Huifeng Yin, Shihao Cai, Runnan Fang, Zhengwei Tao, Wenbiao Yin, Chenxiong Qian, Yong Jiang, Pengjun Xie, Fei Huang, Jingren Zhou,
- Abstract summary: We propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models.<n>We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability.
- Score: 80.97989245493326
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Large language models (LLMs) have evolved into agentic systems capable of autonomous tool use and multi-step reasoning for complex problem-solving. However, post-training approaches building upon general-purpose foundation models consistently underperform in agentic tasks, particularly in open-source implementations. We identify the root cause: the absence of robust agentic foundation models forces models during post-training to simultaneously learn diverse agentic behaviors while aligning them to expert demonstrations, thereby creating fundamental optimization tensions. To this end, we are the first to propose incorporating Agentic Continual Pre-training (Agentic CPT) into the deep research agents training pipeline to build powerful agentic foundational models. Based on this approach, we develop a deep research agent model named AgentFounder. We evaluate our AgentFounder-30B on 10 benchmarks and achieve state-of-the-art performance while retains strong tool-use ability, notably 39.9% on BrowseComp-en, 43.3% on BrowseComp-zh, and 31.5% Pass@1 on HLE.
Related papers
- AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts [35.52607495764441]
Large Language Models (LLMs) based autonomous agents demonstrate multifaceted capabilities to contribute substantially to economic production.<n>We introduce AgencyBench, a benchmark derived from daily AI usage, evaluating 6 core agentic capabilities across 32 real-world scenarios.<n>These scenarios require an average of 90 tool calls, 1 million tokens, and hours of execution time to resolve.
arXiv Detail & Related papers (2026-01-16T07:22:20Z) - Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning [84.70211451226835]
Large Language Model (LLM) Agents are constrained by a dependency on human-curated data.<n>We introduce Agent0, a fully autonomous framework that evolves high-performing agents without external data.<n>Agent0 substantially boosts reasoning capabilities, improving the Qwen3-8B-Base model by 18% on mathematical reasoning and 24% on general reasoning benchmarks.
arXiv Detail & Related papers (2025-11-20T05:01:57Z) - Klear-AgentForge: Forging Agentic Intelligence through Posttraining Scaling [46.593200463657645]
We present a comprehensive and fully open-source pipeline for training a high-performance agentic model, named Klear-Qwen3-AgentForge.<n>We design effective supervised fine-tuning (SFT) with synthetic data followed by multi-turn reinforcement learning (RL) to unlock the potential for multiple diverse agentic tasks.
arXiv Detail & Related papers (2025-11-08T09:47:27Z) - APTBench: Benchmarking Agentic Potential of Base LLMs During Pre-Training [48.20667772172573]
APTBench is a framework that converts real-world agent tasks and successful trajectories into multiple-choice or text completion questions.<n>It focuses on core agentic abilities, e.g., planning and action, and covers key agent scenarios, software engineering and deep research.<n>Compared to existing general-purpose benchmarks, APTBench offers a more predictive signal of a model's downstream performance as an agent.
arXiv Detail & Related papers (2025-10-28T13:11:22Z) - SFR-DeepResearch: Towards Effective Reinforcement Learning for Autonomously Reasoning Single Agents [93.26456498576181]
This paper focuses on the development of native Autonomous Single-Agent models for Deep Research.<n>Our best variant SFR-DR-20B achieves up to 28.7% on Humanity's Last Exam benchmark.
arXiv Detail & Related papers (2025-09-08T02:07:09Z) - AWorld: Orchestrating the Training Recipe for Agentic AI [35.94278765364194]
We introduce AWorld, an open-source system engineered for large-scale agent-environment interaction.<n>By distributing tasks across a cluster, AWorld accelerates experience collection by 14.6x compared to standard single-node, sequential execution.<n>We trained a Qwen3-32B-based agent that achieves pass@1 accuracy of 32.23% on the GAIA test set.
arXiv Detail & Related papers (2025-08-28T04:04:30Z) - Chain-of-Agents: End-to-End Agent Foundation Models via Multi-Agent Distillation and Agentic RL [41.847359443133776]
Chain-of-Agents (CoA) is a novel paradigm of large language models (LLMs) reasoning that enables native end-to-end complex problem-solving.<n>We introduce a multi-agent distillation framework to distill state-of-the-art multi-agent systems into chain-of-agents trajectories for agentic supervised fine-tuning.<n>We then use agentic reinforcement learning on verifiable agentic tasks to further improve the models' capabilities on chain-of-agents problem solving.
arXiv Detail & Related papers (2025-08-06T17:01:02Z) - Cognitive Kernel-Pro: A Framework for Deep Research Agents and Agent Foundation Models Training [67.895981259683]
General AI Agents are increasingly recognized as foundational frameworks for the next generation of artificial intelligence.<n>Current agent systems are either closed-source or heavily reliant on a variety of paid APIs and proprietary tools.<n>We present Cognitive Kernel-Pro, a fully open-source and (to the maximum extent) free multi-module agent framework.
arXiv Detail & Related papers (2025-08-01T08:11:31Z) - Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems.
This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process.
We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z) - Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents [44.34340798542]
Large Language Models (LLMs) have shown remarkable capabilities in natural language tasks requiring complex reasoning.
Traditional supervised pre-training on static datasets falls short in enabling autonomous agent capabilities.
We propose a framework that combines guided Monte Carlo Tree Search (MCTS) search with a self-critique mechanism and iterative fine-tuning on agent interactions.
arXiv Detail & Related papers (2024-08-13T20:52:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.