When Language Model Meets Private Library
- URL: http://arxiv.org/abs/2210.17236v1
- Date: Mon, 31 Oct 2022 11:42:06 GMT
- Title: When Language Model Meets Private Library
- Authors: Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji Wang, Jian-Guang
Lou
- Abstract summary: In practice, it is common for programmers to write code using private libraries.
This is a challenge for language models since they have never seen private APIs during training.
We propose a novel framework with two modules: the APIRetriever finds useful APIs, and then the APICoder generates code using these APIs.
- Score: 25.610036042971043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the rapid development of pre-training techniques, a number of language
models have been pre-trained on large-scale code corpora and perform well in
code generation. In this paper, we investigate how to equip pre-trained
language models with the ability of code generation for private libraries. In
practice, it is common for programmers to write code using private libraries.
However, this is a challenge for language models since they have never seen
private APIs during training. Motivated by the fact that private libraries
usually come with elaborate API documentation, we propose a novel framework
with two modules: the APIRetriever finds useful APIs, and then the APICoder
generates code using these APIs. For APIRetriever, we present a dense retrieval
system and also design a friendly interaction to involve uses. For APICoder, we
can directly use off-the-shelf language models, or continually pre-train the
base model on a code corpus containing API information. Both modules are
trained with data from public libraries and can be generalized to private ones.
Furthermore, we craft three benchmarks for private libraries, named
TorchDataEval, MonkeyEval, and BeatNumEval. Experimental results demonstrate
the impressive performance of our framework.
Related papers
- A Systematic Evaluation of Large Code Models in API Suggestion: When, Which, and How [53.65636914757381]
API suggestion is a critical task in modern software development.
Recent advancements in large code models (LCMs) have shown promise in the API suggestion task.
arXiv Detail & Related papers (2024-09-20T03:12:35Z) - Lightweight Syntactic API Usage Analysis with UCov [0.0]
We present a novel conceptual framework designed to assist library maintainers in understanding the interactions allowed by their APIs.
These customizable models enable library maintainers to improve their design ahead of release, reducing friction during evolution.
We implement these models for Java libraries in a new tool UCov and demonstrate its capabilities on three libraries exhibiting diverse styles of interaction.
arXiv Detail & Related papers (2024-02-19T10:33:41Z) - Pop Quiz! Do Pre-trained Code Models Possess Knowledge of Correct API
Names? [28.86399157983769]
Recent breakthroughs in pre-trained code models, such as CodeBERT and Codex, have shown their superior performance in various downstream tasks.
Recent studies reveal that even state-of-the-art pre-trained code models struggle with suggesting the correct APIs during code generation.
arXiv Detail & Related papers (2023-09-14T15:46:41Z) - Private-Library-Oriented Code Generation with Large Language Models [52.73999698194344]
This paper focuses on utilizing large language models (LLMs) for code generation in private libraries.
We propose a novel framework that emulates the process of programmers writing private code.
We create four private library benchmarks, including TorchDataEval, TorchDataComplexEval, MonkeyEval, and BeatNumEval.
arXiv Detail & Related papers (2023-07-28T07:43:13Z) - Evaluating Embedding APIs for Information Retrieval [51.24236853841468]
We evaluate the capabilities of existing semantic embedding APIs on domain generalization and multilingual retrieval.
We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective in English.
For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best, albeit at a higher cost.
arXiv Detail & Related papers (2023-05-10T16:40:52Z) - DocCoder: Generating Code by Retrieving and Reading Docs [87.88474546826913]
We introduce DocCoder, an approach that explicitly leverages code manuals and documentation.
Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model.
arXiv Detail & Related papers (2022-07-13T06:47:51Z) - CERT: Continual Pre-Training on Sketches for Library-Oriented Code
Generation [46.45445767488915]
We show how to leverage an unlabelled code corpus to train a model for library-oriented code generation.
We craft two benchmarks named PandasEval and NumpyEval to evaluate library-oriented code generation.
arXiv Detail & Related papers (2022-06-14T14:44:34Z) - A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions.
The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions.
Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages.
We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z) - Code Generation for Unknown Libraries via Reading API Documentations [10.122354606820416]
We consider the challenge of code generation for unknown libraries without additional training.
We implement a model that can extract relevant code signatures from API documentations based on a natural language intent.
arXiv Detail & Related papers (2022-02-16T00:36:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.