Related papers: NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls

URL: http://arxiv.org/abs/2409.03797v1
Date: Wed, 4 Sep 2024 17:53:24 GMT
Title: NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls
Authors: Kinjal Basu, Ibrahim Abdelaziz, Kelsey Bradford, Maxwell Crouse, Kiran Kate, Sadhana Kumaravel, Saurabh Goyal, Asim Munawar, Yara Rizk, Xin Wang, Luis Lastras, Pavan Kapanipathi,
Abstract summary: We present NESTFUL, a benchmark to evaluate large language models (LLMs) on nested sequences of API calls. Our results show that most models do not perform well on nested APIs in NESTFUL as compared to their performance on the simpler problem settings available in existing benchmarks.
Score: 18.831512738668792
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous agent applications powered by large language models (LLMs) have recently risen to prominence as effective tools for addressing complex real-world tasks. At their core, agentic workflows rely on LLMs to plan and execute the use of tools and external Application Programming Interfaces (APIs) in sequence to arrive at the answer to a user's request. Various benchmarks and leaderboards have emerged to evaluate an LLM's capabilities for tool and API use; however, most of these evaluations only track single or multiple isolated API calling capabilities. In this paper, we present NESTFUL, a benchmark to evaluate LLMs on nested sequences of API calls, i.e., sequences where the output of one API call is passed as input to a subsequent call. NESTFUL has a total of 300 human annotated samples divided into two types - executable and non-executable. The executable samples are curated manually by crawling Rapid-APIs whereas the non-executable samples are hand picked by human annotators from data synthetically generated using an LLM. We evaluate state-of-the-art LLMs with function calling abilities on NESTFUL. Our results show that most models do not perform well on nested APIs in NESTFUL as compared to their performance on the simpler problem settings available in existing benchmarks.

Related papers

ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario [17.494787282066866]
We introduce ComplexFuncBench, a benchmark for complex function calling across five real-world scenarios. Compared to existing benchmarks, ComplexFuncBench encompasses multi-step and constrained function calling. We propose an automatic framework, ComplexEval, for quantitatively evaluating complex function calling tasks.
arXiv Detail & Related papers (2025-01-17T11:41:53Z)
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks. We show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
SEAL: Suite for Evaluating API-use of LLMs [1.2528321519119252]
SEAL is an end-to-end testbed designed to evaluate large language models in real-world API usage. It standardizes existing benchmarks, integrates an agent system for testing API retrieval and planning, and addresses the instability of real-time APIs.
arXiv Detail & Related papers (2024-09-23T20:16:49Z)
Harnessing LLMs for API Interactions: A Framework for Classification and Synthetic Data Generation [0.0]
We propose a novel system that integrates Large Language Models (LLMs) for both classifying natural language inputs into corresponding API calls. Our system allows users to invoke complex software functionalities through simple inputs, improving interaction efficiency and lowering the barrier to software utilization.
arXiv Detail & Related papers (2024-09-18T04:56:52Z)
SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts. We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM. We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z)
Granite-Function Calling Model: Introducing Function Calling Abilities via Multi-task Learning of Granular Tasks [35.97890508648945]
We introduce the-20B-FUNCTIONCALLING model under an Apache 2.0 license. The model is trained using a multi-task training approach on seven fundamental tasks. We show that-20B-FUNCTIONCALLING has better generalizability on multiple tasks in seven different evaluation datasets.
arXiv Detail & Related papers (2024-06-27T17:47:26Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions [72.56339136017759]
We introduce BigCodeBench, a benchmark that challenges Large Language Models (LLMs) to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. Our evaluation shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. We propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information.
arXiv Detail & Related papers (2024-06-22T15:52:04Z)
LLM+Reasoning+Planning for supporting incomplete user queries in presence of APIs [0.09374652839580183]
In practice, natural language task requests (user queries) are often incomplete, i.e., they may not contain all the information required by the APIs. We leverage logical reasoning and classical AI planning along with an LLM for accurately answering user queries. Our approach achieves over 95% success rate in most cases on a dataset containing complete and incomplete single goal and multi-goal queries.
arXiv Detail & Related papers (2024-05-21T01:16:34Z)
Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks [76.43527940649939]
We introduce Ada-LEval, a benchmark for evaluating the long-context understanding of large language models (LLMs) Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval.
arXiv Detail & Related papers (2024-04-09T17:30:48Z)
API-BLEND: A Comprehensive Corpora for Training and Benchmarking API LLMs [28.840207102132286]
We focus on the task of identifying, curating, and transforming existing datasets. We introduce API-BLEND, a large corpora for training and systematic testing of tool-augmented LLMs. We demonstrate the utility of the API-BLEND dataset for both training and benchmarking purposes.
arXiv Detail & Related papers (2024-02-23T18:30:49Z)
PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions. We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence. The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z)
Reverse Chain: A Generic-Rule for LLMs to Master Multi-API Planning [8.96245399645571]
This paper introduces Reverse Chain'', a controllable, target-driven approach to empower Large Language Models with the capability to operate external APIs only via prompts. To manage a controllable multi-function calling, Reverse Chain adopts a generic rule based on a backward reasoning process.
arXiv Detail & Related papers (2023-10-06T05:20:18Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities. We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z)
Allies: Prompting Large Language Model with Beam Search [107.38790111856761]
In this work, we propose a novel method called ALLIES. Given an input query, ALLIES leverages LLMs to iteratively generate new queries related to the original query. By iteratively refining and expanding the scope of the original query, ALLIES captures and utilizes hidden knowledge that may not be directly through retrieval.
arXiv Detail & Related papers (2023-05-24T06:16:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.