Related papers: A Framework for Testing and Adapting REST APIs as LLM Tools

A Framework for Testing and Adapting REST APIs as LLM Tools

URL: http://arxiv.org/abs/2504.15546v3
Date: Fri, 12 Sep 2025 11:24:08 GMT
Title: A Framework for Testing and Adapting REST APIs as LLM Tools
Authors: Jayachandu Bandlamudi, Ritwik Chaudhuri, Neelamadhav Gantayat, Sambit Ghosh, Kushal Mukherjee, Prerna Agarwal, Renuka Sindhgatta, Sameep Mehta,
Abstract summary: Large Language Models (LLMs) are increasingly used to build autonomous agents that perform complex tasks with external tools.<n>Current benchmarks overlook these challenges, leaving a gap in assessing API readiness for agent-driven automation.<n>We present a testing framework that systematically evaluates enterprise APIs when wrapped as Python tools for LLM-based agents.
Score: 11.757827071584737
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are increasingly used to build autonomous agents that perform complex tasks with external tools, often exposed through APIs in enterprise systems. Direct use of these APIs is difficult due to the complex input schema and verbose responses. Current benchmarks overlook these challenges, leaving a gap in assessing API readiness for agent-driven automation. We present a testing framework that systematically evaluates enterprise APIs when wrapped as Python tools for LLM-based agents. The framework generates data-aware test cases, translates them into natural language instructions, and evaluates whether agents can correctly invoke the tool, handle their inputs, and process its responses. We apply the framework to generate over 2400 test cases across different domains and develop a taxonomy of common errors, including input misinterpretation, output failures, and schema mismatches. We further classify errors to support debugging and tool refinement. Our framework provides a systematic approach to enabling enterprise APIs as reliable tools for agent-based applications.

Related papers

Automated structural testing of LLM-based agents: methods, framework, and case studies [0.05254956925594667]
LLM-based agents are rapidly being adopted across diverse domains.<n>Current testing approaches focus on acceptance-level evaluation from the user's perspective.<n>We present methods to enable structural testing of LLM-based agents.
arXiv Detail & Related papers (2026-01-25T11:52:30Z)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z)
DeepAgent: A General Reasoning Agent with Scalable Toolsets [111.6384541877723]
DeepAgent is an end-to-end deep reasoning agent that performs autonomous thinking, tool discovery, and action execution.<n>To address the challenges of long-horizon interactions, we introduce an autonomous memory folding mechanism that compresses past interactions into structured episodic, working, and tool memories.<n>We develop an end-to-end reinforcement learning strategy, namely ToolPO, that leverages LLM-simulated APIs and applies tool-call advantage attribution to assign fine-grained credit to the tool invocation tokens.
arXiv Detail & Related papers (2025-10-24T16:24:01Z)
Automated Creation and Enrichment Framework for Improved Invocation of Enterprise APIs as Tools [10.440520289311332]
ACE is an automated tool creation and enrichment framework for Large Language Models.<n>It generates enriched tool specifications with parameter descriptions and examples to improve selection and invocation accuracy.<n>We validate our framework on both proprietary and open-source APIs and demonstrate its integration with agentic frameworks.
arXiv Detail & Related papers (2025-09-15T06:41:54Z)
Evaluating LLMs on Sequential API Call Through Automated Test Generation [10.621357661774244]
StateGen is an automated framework designed to generate diverse coding tasks involving sequential API interactions.<n>We construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios.<n> Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks.
arXiv Detail & Related papers (2025-07-13T03:52:51Z)
Learning API Functionality from Demonstrations for Tool-based Agents [1.3332982107151432]
We propose learning API functionality directly from demonstrations as a new paradigm applicable in scenarios without documentation.<n>We study how the number of demonstrations and the use of LLM-generated summaries and evaluations affect the task success rate of the API-based agent.<n>We find that providing explicit function calls and natural language critiques significantly improves the agent's task success rate due to more accurate parameter filling.
arXiv Detail & Related papers (2025-05-30T04:17:09Z)
Test Amplification for REST APIs via Single and Multi-Agent LLM Systems [1.6499388997661122]
We show how single-agent and multi-agent LLM systems can amplify a REST API test suite.<n>Our evaluation demonstrates increased API coverage, identification of numerous bugs in the API under test, and insights into the computational cost and energy consumption of both approaches.
arXiv Detail & Related papers (2025-04-10T20:19:50Z)
SOPBench: Evaluating Language Agents at Following Standard Operating Procedures and Constraints [59.645885492637845]
SOPBench is an evaluation pipeline that transforms each service-specific SOP code program into a directed graph of executable functions.<n>Our approach transforms each service-specific SOP code program into a directed graph of executable functions and requires agents to call these functions based on natural language SOP descriptions.<n>We evaluate 18 leading models, and results show the task is challenging even for top-tier models.
arXiv Detail & Related papers (2025-03-11T17:53:02Z)
ToolFactory: Automating Tool Generation by Leveraging LLM to Understand REST API Documentations [4.934192277899036]
API documentation often suffers from a lack of standardization, inconsistent schemas, and incomplete information.<n>We developed textbfToolFactory, an open-source pipeline for automating tool generation from unstructured API documents.<n>We also demonstrated ToolFactory by creating a domain-specific AI agent for glycomaterials research.
arXiv Detail & Related papers (2025-01-28T13:42:33Z)
AutoRestTest: A Tool for Automated REST API Testing Using LLMs and MARL [46.65963514391019]
AutoRestTest is a novel tool that integrates the Semantic Property Dependency Graph (SPDG) with Multi-Agent Reinforcement Learning (MARL) and large language models (LLMs) for effective REST API testing.
arXiv Detail & Related papers (2025-01-15T05:54:33Z)
LlamaRestTest: Effective REST API Testing with Small Language Models [50.058600784556816]
We present LlamaRestTest, a novel approach that employs two custom Large Language Models (LLMs) to generate realistic test inputs.<n>We evaluate it against several state-of-the-art REST API testing tools, including RESTGPT, a GPT-powered specification-enhancement tool.<n>Our study shows that small language models can perform as well as, or better than, large language models in REST API testing.
arXiv Detail & Related papers (2025-01-15T05:51:20Z)
A Multi-Agent Approach for REST API Testing with Semantic Graphs and LLM-Driven Inputs [46.65963514391019]
We present AutoRestTest, the first black-box tool to adopt a dependency-embedded multi-agent approach for REST API testing.<n>Our approach treats REST API testing as a separable problem, where four agents collaborate to optimize API exploration.<n>Our evaluation of AutoRestTest on 12 real-world REST services shows that it outperforms the four leading black-box REST API testing tools.
arXiv Detail & Related papers (2024-11-11T16:20:27Z)
AutoPT: How Far Are We from the End2End Automated Web Penetration Testing? [54.65079443902714]
We introduce AutoPT, an automated penetration testing agent based on the principle of PSM driven by LLMs. Our results show that AutoPT outperforms the baseline framework ReAct on the GPT-4o mini model.
arXiv Detail & Related papers (2024-11-02T13:24:30Z)
KAT: Dependency-aware Automated API Testing with Large Language Models [1.7264233311359707]
KAT (Katalon API Testing) is a novel AI-driven approach that autonomously generates test cases to validate APIs. Our evaluation of KAT using 12 real-world services shows that it can improve validation coverage, detect more undocumented status codes, and reduce false positives in these services.
arXiv Detail & Related papers (2024-07-14T14:48:18Z)
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios [93.68764280953624]
UltraTool is a novel benchmark designed to improve and evaluate Large Language Models' ability in tool utilization. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage.
arXiv Detail & Related papers (2024-01-30T16:52:56Z)
Leveraging Large Language Models to Improve REST API Testing [51.284096009803406]
RESTGPT takes as input an API specification, extracts machine-interpretable rules, and generates example parameter values from natural-language descriptions in the specification. Our evaluations indicate that RESTGPT outperforms existing techniques in both rule extraction and value generation.
arXiv Detail & Related papers (2023-12-01T19:53:23Z)
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs) It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z)
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities. We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.