Related papers: When Language Model Meets Private Library

When Language Model Meets Private Library

URL: http://arxiv.org/abs/2210.17236v1
Date: Mon, 31 Oct 2022 11:42:06 GMT
Title: When Language Model Meets Private Library
Authors: Daoguang Zan, Bei Chen, Zeqi Lin, Bei Guan, Yongji Wang, Jian-Guang Lou
Abstract summary: In practice, it is common for programmers to write code using private libraries. This is a challenge for language models since they have never seen private APIs during training. We propose a novel framework with two modules: the APIRetriever finds useful APIs, and then the APICoder generates code using these APIs.
Score: 25.610036042971043
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the rapid development of pre-training techniques, a number of language models have been pre-trained on large-scale code corpora and perform well in code generation. In this paper, we investigate how to equip pre-trained language models with the ability of code generation for private libraries. In practice, it is common for programmers to write code using private libraries. However, this is a challenge for language models since they have never seen private APIs during training. Motivated by the fact that private libraries usually come with elaborate API documentation, we propose a novel framework with two modules: the APIRetriever finds useful APIs, and then the APICoder generates code using these APIs. For APIRetriever, we present a dense retrieval system and also design a friendly interaction to involve uses. For APICoder, we can directly use off-the-shelf language models, or continually pre-train the base model on a code corpus containing API information. Both modules are trained with data from public libraries and can be generalized to private ones. Furthermore, we craft three benchmarks for private libraries, named TorchDataEval, MonkeyEval, and BeatNumEval. Experimental results demonstrate the impressive performance of our framework.

Related papers

ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration [70.26807758443675]
ExploraCoder is a training-free framework that empowers large language models to invoke unseen APIs in code solution. We show that ExploraCoder significantly improves performance for models lacking prior API knowledge, achieving an absolute increase of 11.24% over niave RAG approaches and 14.07% over pretraining methods in pass@10.
arXiv Detail & Related papers (2024-12-06T19:00:15Z)
A Systematic Evaluation of Large Code Models in API Suggestion: When, Which, and How [53.65636914757381]
API suggestion is a critical task in modern software development. Recent advancements in large code models (LCMs) have shown promise in the API suggestion task.
arXiv Detail & Related papers (2024-09-20T03:12:35Z)
Lightweight Syntactic API Usage Analysis with UCov [0.0]
We present a novel conceptual framework designed to assist library maintainers in understanding the interactions allowed by their APIs. These customizable models enable library maintainers to improve their design ahead of release, reducing friction during evolution. We implement these models for Java libraries in a new tool UCov and demonstrate its capabilities on three libraries exhibiting diverse styles of interaction.
arXiv Detail & Related papers (2024-02-19T10:33:41Z)
Pop Quiz! Do Pre-trained Code Models Possess Knowledge of Correct API Names? [28.86399157983769]
Recent breakthroughs in pre-trained code models, such as CodeBERT and Codex, have shown their superior performance in various downstream tasks. Recent studies reveal that even state-of-the-art pre-trained code models struggle with suggesting the correct APIs during code generation.
arXiv Detail & Related papers (2023-09-14T15:46:41Z)
Private-Library-Oriented Code Generation with Large Language Models [52.73999698194344]
This paper focuses on utilizing large language models (LLMs) for code generation in private libraries. We propose a novel framework that emulates the process of programmers writing private code. We create four private library benchmarks, including TorchDataEval, TorchDataComplexEval, MonkeyEval, and BeatNumEval.
arXiv Detail & Related papers (2023-07-28T07:43:13Z)
Evaluating Embedding APIs for Information Retrieval [51.24236853841468]
We evaluate the capabilities of existing semantic embedding APIs on domain generalization and multilingual retrieval. We find that re-ranking BM25 results using the APIs is a budget-friendly approach and is most effective in English. For non-English retrieval, re-ranking still improves the results, but a hybrid model with BM25 works best, albeit at a higher cost.
arXiv Detail & Related papers (2023-05-10T16:40:52Z)
DocCoder: Generating Code by Retrieving and Reading Docs [87.88474546826913]
We introduce DocCoder, an approach that explicitly leverages code manuals and documentation. Our approach is general, can be applied to any programming language, and is agnostic to the underlying neural model.
arXiv Detail & Related papers (2022-07-13T06:47:51Z)
CERT: Continual Pre-Training on Sketches for Library-Oriented Code Generation [46.45445767488915]
We show how to leverage an unlabelled code corpus to train a model for library-oriented code generation. We craft two benchmarks named PandasEval and NumpyEval to evaluate library-oriented code generation.
arXiv Detail & Related papers (2022-06-14T14:44:34Z)
A Systematic Evaluation of Large Language Models of Code [88.34057460577957]
Large language models (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. The current state-of-the-art code LMs are not publicly available, leaving many questions about their model and data design decisions. Although Codex is not open-source, we find that existing open-source models do achieve close results in some programming languages. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine.
arXiv Detail & Related papers (2022-02-26T15:53:55Z)
Code Generation for Unknown Libraries via Reading API Documentations [10.122354606820416]
We consider the challenge of code generation for unknown libraries without additional training. We implement a model that can extract relevant code signatures from API documentations based on a natural language intent.
arXiv Detail & Related papers (2022-02-16T00:36:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.