In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents
- URL: http://arxiv.org/abs/2509.01560v1
- Date: Mon, 01 Sep 2025 15:42:21 GMT
- Title: In-N-Out: A Parameter-Level API Graph Dataset for Tool Agents
- Authors: Seungkyu Lee, Nalim Kim, Yohan Jo,
- Abstract summary: We introduce In-N-Out, the first expert-annotated dataset of API graphs built from two real-world API benchmarks and their documentation.<n>Using In-N-Out significantly improves performance on both tool retrieval and multi-tool query generation.<n>Our findings highlight the promise of using explicit API graphs for tool agents and the utility of In-N-Out as a valuable resource.
- Score: 12.78469884522289
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Tool agents -- LLM-based systems that interact with external APIs -- offer a way to execute real-world tasks. However, as tasks become increasingly complex, these agents struggle to identify and call the correct APIs in the proper order. To tackle this problem, we investigate converting API documentation into a structured API graph that captures API dependencies and leveraging it for multi-tool queries that require compositional API calls. To support this, we introduce In-N-Out, the first expert-annotated dataset of API graphs built from two real-world API benchmarks and their documentation. Using In-N-Out significantly improves performance on both tool retrieval and multi-tool query generation, nearly doubling that of LLMs using documentation alone. Moreover, graphs generated by models fine-tuned on In-N-Out close 90% of this gap, showing that our dataset helps models learn to comprehend API documentation and parameter relationships. Our findings highlight the promise of using explicit API graphs for tool agents and the utility of In-N-Out as a valuable resource. We will release the dataset and code publicly.
Related papers
- Doc2Agent: Scalable Generation of Tool-Using Agents from API Documentation [2.4117201298131232]
Doc2Agent is a scalable pipeline to build tool agents that can call Python-based tools generated from API documentation.<n>We evaluate our approach on real-world APIs, WebArena APIs, and research APIs, producing validated tools.
arXiv Detail & Related papers (2025-06-24T20:30:44Z) - Invocable APIs derived from NL2SQL datasets for LLM Tool-Calling Evaluation [7.260113022127256]
Large language models (LLMs) are routinely deployed as agentic systems with access to tools that interact with live environments to accomplish tasks.<n>In order to create datasets with such characteristics, we explore how existing NL2 datasets can be used to automatically create NL2API datasets.<n>We apply this pipeline to one of the largest NL2 datasets, BIRD, to create a collection of over 2500 APIs that can be served as invocable tools or REST-endpoints.
arXiv Detail & Related papers (2025-06-12T20:17:52Z) - ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution.<n> Experimental results demonstrate that ExploraCoder significantly improves performance for models lacking prior API knowledge.
arXiv Detail & Related papers (2024-12-06T19:00:15Z) - A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs [46.65963514391019]
We present AutoRestTest, the first black-box tool to adopt a dependency-embedded multi-agent approach for REST API testing.<n>Our approach treats REST API testing as a separable problem, where four agents collaborate to optimize API exploration.<n>Our evaluation of AutoRestTest on 12 real-world REST services shows that it outperforms the four leading black-box REST API testing tools.
arXiv Detail & Related papers (2024-11-11T16:20:27Z) - AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction [24.67142048995415]
Large Language Models (LLMs) can interact with the real world by connecting with versatile external APIs.
We introduce textttAppBench, the first benchmark to evaluate LLMs' ability to plan and execute multiple APIs from various sources.
arXiv Detail & Related papers (2024-10-10T04:03:13Z) - ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.<n>We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z) - APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets [99.8988504388011]
APIGen is an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications.
We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets.
We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains.
arXiv Detail & Related papers (2024-06-26T17:49:11Z) - SoAy: A Solution-based LLM API-using Methodology for Academic Information Seeking [59.59923482238048]
SoAy is a solution-based LLM API-using methodology for academic information seeking.<n>It uses code with a solution as the reasoning method, where a solution is a pre-constructed API calling sequence.<n>Results show a 34.58-75.99% performance improvement compared to state-of-the-art LLM API-based baselines.
arXiv Detail & Related papers (2024-05-24T02:44:14Z) - API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs [28.840207102132286]
We focus on the task of identifying, curating, and transforming existing datasets.
We introduce API-BLEND, a large corpora for training and systematic testing of tool-augmented LLMs.
We demonstrate the utility of the API-BLEND dataset for both training and benchmarking purposes.
arXiv Detail & Related papers (2024-02-23T18:30:49Z) - You Can REST Now: Automated REST API Documentation and Testing via LLM-Assisted Request Mutations [8.158964648211002]
We present RESTSpecIT, the first automated approach that infers documentation and performs black-box testing of REST APIs.<n>Our approach requires minimal user input compared to state-of-the-art tools.<n>We evaluate the quality of our tool with three state-of-the-art LLMs: DeepSeek V3, GPT-4.1, and GPT-3.5.
arXiv Detail & Related papers (2024-02-07T18:55:41Z) - ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world
APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities.
We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation.
We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z) - Binding Language Models in Symbolic Languages [146.3027328556881]
Binder is a training-free neural-symbolic framework that maps the task input to a program.
In the parsing stage, Codex is able to identify the part of the task input that cannot be answerable by the original programming language.
In the execution stage, Codex can perform versatile functionalities given proper prompts in the API calls.
arXiv Detail & Related papers (2022-10-06T12:55:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.