TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
- URL: http://arxiv.org/abs/2510.19286v1
- Date: Wed, 22 Oct 2025 06:42:01 GMT
- Title: TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
- Authors: Reza Esfandiarpoor, Vishwas Suryanarayanan, Stephen H. Bach, Vishal Chowdhary, Anthony Aue,
- Abstract summary: We introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services.<n>We also provide manually annotated ground-truth tools for each task.<n>Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments.
- Score: 12.249551019598442
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5's performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.
Related papers
- Learning to Rewrite Tool Descriptions for Reliable LLM-Agent Tool Use [21.666294374943178]
We propose a curriculum learning framework that transfers supervision from trace-rich settings to trace-free deployment.<n> Experiments show consistent gains on unseen tools, strong cross-domain generalization, and robustness as the number of candidate tools scales to over 100.
arXiv Detail & Related papers (2026-02-23T23:50:24Z) - AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning [66.24374176797075]
We introduce textbfAdaReasoner, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior.<n>AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that prioritizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage.
arXiv Detail & Related papers (2026-01-26T16:04:43Z) - MetaToolAgent: Towards Generalizable Tool Usage in LLMs through Meta-Learning [16.060518943785514]
We introduce a dataset spanning 7 domains, containing 155 tools and 9,377 question-answer pairs.<n>We also propose MetaToolAgent (MTA), a meta-learning approach designed to improve cross-tool generalization.
arXiv Detail & Related papers (2026-01-19T02:53:31Z) - Tool Graph Retriever: Exploring Dependency Graph-based Tool Retrieval for Large Language Models [43.50789219459378]
We propose Tool Graph Retriever (TGR), which exploits the dependencies among tools to learn better tool representations for retrieval.<n>First, we construct a dataset termed TDI300K to train a discriminator for identifying tool dependencies.<n>Then, we represent all candidate tools as a tool dependency graph and use graph convolution to integrate the dependencies into their representations.
arXiv Detail & Related papers (2025-08-07T08:36:26Z) - Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models [47.145844910856134]
Tool learning aims to augment large language models with diverse tools, enabling them to act as agents for solving practical tasks.<n>Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step.<n>Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios.<n>We propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from
arXiv Detail & Related papers (2025-03-03T17:37:16Z) - Re-Invoke: Tool Invocation Rewriting for Zero-Shot Tool Retrieval [47.81307125613145]
Re-Invoke is an unsupervised tool retrieval method designed to scale effectively to large toolsets without training.
We employ a novel multi-view similarity ranking strategy based on intents to pinpoint the most relevant tools for each query.
Our evaluation demonstrates that Re-Invoke significantly outperforms state-of-the-art alternatives in both single-tool and multi-tool scenarios.
arXiv Detail & Related papers (2024-08-03T22:49:27Z) - MetaTool: Facilitating Large Language Models to Master Tools with Meta-task Augmentation [25.360660222418183]
We present MetaTool, a novel tool learning methodology designed to generalize across any reusable toolset.
By incorporating meta-task data into task-oriented training, our method significantly enhances the performance of open-source Large Language Models.
arXiv Detail & Related papers (2024-07-15T10:15:41Z) - Towards Completeness-Oriented Tool Retrieval for Large Language Models [60.733557487886635]
Real-world systems often incorporate a wide array of tools, making it impractical to input all tools into Large Language Models.
Existing tool retrieval methods primarily focus on semantic matching between user queries and tool descriptions.
We propose a novel modelagnostic COllaborative Learning-based Tool Retrieval approach, COLT, which captures not only the semantic similarities between user queries and tool descriptions but also takes into account the collaborative information of tools.
arXiv Detail & Related papers (2024-05-25T06:41:23Z) - EASYTOOL: Enhancing LLM-based Agents with Concise Tool Instruction [56.02100384015907]
EasyTool is a framework transforming diverse and lengthy tool documentation into a unified and concise tool instruction.
It can significantly reduce token consumption and improve the performance of tool utilization in real-world scenarios.
arXiv Detail & Related papers (2024-01-11T15:45:11Z) - Large Language Models as Tool Makers [85.00361145117293]
We introduce a closed-loop framework, referred to as LLMs A s Tool Makers (LATM), where LLMs create their own reusable tools for problem-solving.
Our approach consists of two phases: 1) tool making: an LLM acts as the tool maker that crafts tools for a set of tasks. 2) tool using: another LLM acts as the tool user, which applies the tool built by the tool maker for problem-solving.
arXiv Detail & Related papers (2023-05-26T17:50:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.