API Pack: A Massive Multi-Programming Language Dataset for API Call Generation
- URL: http://arxiv.org/abs/2402.09615v4
- Date: Mon, 3 Jun 2024 22:38:04 GMT
- Title: API Pack: A Massive Multi-Programming Language Dataset for API Call Generation
- Authors: Zhen Guo, Adriana Meza Soria, Wei Sun, Yikang Shen, Rameswar Panda,
- Abstract summary: API Pack is a massive multi-programming language dataset containing more than 1 million instruction-API call pairs.
By fine-tuning CodeLlama-13B on 20,000 Python instances from API Pack, we enable it to outperform GPT-3.5 and GPT-4 in generating unseen API calls.
- Score: 30.466726273695144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce API Pack, a massive multi-programming language dataset containing more than 1 million instruction-API call pairs to improve the API call generation capabilities of large language models. By fine-tuning CodeLlama-13B on 20,000 Python instances from API Pack, we enable it to outperform GPT-3.5 and GPT-4 in generating unseen API calls. Fine-tuning on API Pack also facilitates cross-programming language generalization by leveraging a large amount of data in one language and small amounts of data from other languages. Scaling the training data to 1 million instances further improves the model's ability to generalize to new APIs not used in training. To facilitate further research, we open-source the API Pack dataset, trained model, and associated source code at https://github.com/zguo0525/API-Pack.
Related papers
- A Systematic Evaluation of Large Code Models in API Suggestion: When, Which, and How [53.65636914757381]
API suggestion is a critical task in modern software development.
Recent advancements in large code models (LCMs) have shown promise in the API suggestion task.
arXiv Detail & Related papers (2024-09-20T03:12:35Z) - WorldAPIs: The World Is Worth How Many APIs? A Thought Experiment [49.00213183302225]
We propose a framework to induce new APIs by grounding wikiHow instruction to situated agent policies.
Inspired by recent successes in large language models (LLMs) for embodied planning, we propose a few-shot prompting to steer GPT-4.
arXiv Detail & Related papers (2024-07-10T15:52:44Z) - APIGen: Generative API Method Recommendation [16.541442856821]
APIGen is a generative API recommendation approach through enhanced in-context learning (ICL)
APIGen searches for similar posts to the programming queries from the lexical, syntactical, and semantic perspectives.
With the reasoning process, APIGen makes recommended APIs better meet the programming requirement of queries.
arXiv Detail & Related papers (2024-01-29T02:35:42Z) - Pop Quiz! Do Pre-trained Code Models Possess Knowledge of Correct API
Names? [28.86399157983769]
Recent breakthroughs in pre-trained code models, such as CodeBERT and Codex, have shown their superior performance in various downstream tasks.
Recent studies reveal that even state-of-the-art pre-trained code models struggle with suggesting the correct APIs during code generation.
arXiv Detail & Related papers (2023-09-14T15:46:41Z) - Private-Library-Oriented Code Generation with Large Language Models [52.73999698194344]
This paper focuses on utilizing large language models (LLMs) for code generation in private libraries.
We propose a novel framework that emulates the process of programmers writing private code.
We create four private library benchmarks, including TorchDataEval, TorchDataComplexEval, MonkeyEval, and BeatNumEval.
arXiv Detail & Related papers (2023-07-28T07:43:13Z) - Do All Languages Cost the Same? Tokenization in the Era of Commercial
Language Models [68.29126169579132]
API vendors charge their users based on usage, more specifically on the number of tokens'' processed or generated by the underlying language models.
What constitutes a token, however, is training data and model dependent with a large variance in the number of tokens required to convey the same information in different languages.
We conduct a systematic analysis of the cost and utility of OpenAI's language model API on multilingual benchmarks in 22 typologically diverse languages.
arXiv Detail & Related papers (2023-05-23T05:46:45Z) - When Language Model Meets Private Library [25.610036042971043]
In practice, it is common for programmers to write code using private libraries.
This is a challenge for language models since they have never seen private APIs during training.
We propose a novel framework with two modules: the APIRetriever finds useful APIs, and then the APICoder generates code using these APIs.
arXiv Detail & Related papers (2022-10-31T11:42:06Z) - Binding Language Models in Symbolic Languages [146.3027328556881]
Binder is a training-free neural-symbolic framework that maps the task input to a program.
In the parsing stage, Codex is able to identify the part of the task input that cannot be answerable by the original programming language.
In the execution stage, Codex can perform versatile functionalities given proper prompts in the API calls.
arXiv Detail & Related papers (2022-10-06T12:55:17Z) - On the Effectiveness of Pretrained Models for API Learning [8.788509467038743]
Developers frequently use APIs to implement certain functionalities, such as parsing Excel Files, reading and writing text files line by line, etc.
Developers can greatly benefit from automatic API usage sequence generation based on natural language queries for building applications in a faster and cleaner manner.
Existing approaches utilize information retrieval models to search for matching API sequences given a query or use RNN-based encoder-decoder to generate API sequences.
arXiv Detail & Related papers (2022-04-05T20:33:24Z) - Compositional Generalization for Natural Language Interfaces to Web APIs [26.851998759793453]
This paper presents Okapi, a new dataset for Natural Language to executable web Application Programming Interfaces (NL2API)
This dataset is in English and contains 22,508 questions and 9,019 unique API calls, covering three domains.
We define new compositional generalization tasks for NL2API which explore the models' ability to extrapolate from simple API calls in the training set to new and more complex API calls in the inference phase.
arXiv Detail & Related papers (2021-12-09T20:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.