Domain Adaptive Code Completion via Language Models and Decoupled Domain
Databases
- URL: http://arxiv.org/abs/2308.09313v2
- Date: Wed, 20 Sep 2023 04:33:09 GMT
- Title: Domain Adaptive Code Completion via Language Models and Decoupled Domain
Databases
- Authors: Ze Tang, Jidong Ge, Shangqing Liu, Tingwei Zhu, Tongtong Xu, Liguo
Huang, Bin Luo
- Abstract summary: $k$NM-LM is a retrieval-augmented language model that integrates domain knowledge into language models without fine-tuning.
Our approach is able to automatically adapt to different language models and domains.
- Score: 15.964849180459675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable performance in code
completion. However, due to the lack of domain-specific knowledge, they may not
be optimal in completing code that requires intensive domain knowledge for
example completing the library names. Although there are several works that
have confirmed the effectiveness of fine-tuning techniques to adapt language
models for code completion in specific domains. They are limited by the need
for constant fine-tuning of the model when the project is in constant
iteration.
To address this limitation, in this paper, we propose $k$NM-LM, a
retrieval-augmented language model (R-LM), that integrates domain knowledge
into language models without fine-tuning. Different from previous techniques,
our approach is able to automatically adapt to different language models and
domains. Specifically, it utilizes the in-domain code to build the
retrieval-based database decoupled from LM, and then combines it with LM
through Bayesian inference to complete the code. The extensive experiments on
the completion of intra-project and intra-scenario have confirmed that $k$NM-LM
brings about appreciable enhancements when compared to CodeGPT and UnixCoder. A
deep analysis of our tool including the responding speed, storage usage,
specific type code completion, and API invocation completion has confirmed that
$k$NM-LM provides satisfactory performance, which renders it highly appropriate
for domain adaptive code completion. Furthermore, our approach operates without
the requirement for direct access to the language model's parameters. As a
result, it can seamlessly integrate with black-box code completion models,
making it easy to integrate our approach as a plugin to further enhance the
performance of these models.
Related papers
- Retrieval-augmented code completion for local projects using large language models [0.0]
We focus on using large language models (LLMs) with around 160 million parameters that are suitable for local execution and augmentation with retrieval from local projects.
We train two models based on the transformer architecture, the generative model GPT-2 and the retrieval-adapted RETRO model, on open-source Python files.
We improve our models' performance with In-context retrieval-augmented generation, which retrieves code snippets based on the Jaccard similarity of tokens.
arXiv Detail & Related papers (2024-08-09T12:26:57Z) - Learning to Decode Collaboratively with Multiple Language Models [37.31339648499042]
We propose a method to teach multiple large language models (LLM) to collaborate by interleaving their generations at the token level.
Token-level collaboration during decoding allows for a fusion of each model's expertise in a manner tailored to the specific task at hand.
arXiv Detail & Related papers (2024-03-06T17:23:28Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - Language Models are Universal Embedders [48.12992614723464]
We show that pre-trained transformer decoders can embed universally when finetuned on limited English data.
Our models achieve competitive performance on different embedding tasks by minimal training data.
These results provide evidence of a promising path towards building powerful unified embedders.
arXiv Detail & Related papers (2023-10-12T11:25:46Z) - Adapting Large Language Models for Content Moderation: Pitfalls in Data
Engineering and Supervised Fine-tuning [79.53130089003986]
Large Language Models (LLMs) have become a feasible solution for handling tasks in various domains.
In this paper, we introduce how to fine-tune a LLM model that can be privately deployed for content moderation.
arXiv Detail & Related papers (2023-10-05T09:09:44Z) - Position-Enhanced Visual Instruction Tuning for Multimodal Large
Language Models [50.07056960586183]
We propose Position-enhanced Visual Instruction Tuning (PVIT) to extend the functionality of Multimodal Large Language Models (MLLMs)
This integration promotes a more detailed comprehension of images for the MLLM.
We present both quantitative experiments and qualitative analysis that demonstrate the superiority of the proposed model.
arXiv Detail & Related papers (2023-08-25T15:33:47Z) - $k$NN-Adapter: Efficient Domain Adaptation for Black-Box Language Models [18.969047541720123]
$k$NN-Adapter is a method to adapt large language models to a new domain.
Experiments on four different domains demonstrate that $k$NN-Adapter significantly improves perplexity.
arXiv Detail & Related papers (2023-02-21T18:54:21Z) - VarMAE: Pre-training of Variational Masked Autoencoder for
Domain-adaptive Language Understanding [5.1282202633907]
We propose a novel Transformer-based language model named VarMAE for domain-adaptive language understanding.
Under the masked autoencoding objective, we design a context uncertainty learning module to encode the token's context into a smooth latent distribution.
Experiments on science- and finance-domain NLU tasks demonstrate that VarMAE can be efficiently adapted to new domains with limited resources.
arXiv Detail & Related papers (2022-11-01T12:51:51Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - KALA: Knowledge-Augmented Language Model Adaptation [65.92457495576141]
We propose a novel domain adaption framework for pre-trained language models (PLMs)
Knowledge-Augmented Language model Adaptation (KALA) modulates the intermediate hidden representations of PLMs with domain knowledge.
Results show that, despite being computationally efficient, our KALA largely outperforms adaptive pre-training.
arXiv Detail & Related papers (2022-04-22T08:11:59Z) - Cross-Domain Deep Code Search with Meta Learning [14.618183588410194]
We propose CroCS, a novel approach for domain-specific code search.
CroCS employs a transfer learning framework where an initial program representation model is pre-trained on a large corpus of common programming languages.
arXiv Detail & Related papers (2022-01-01T09:00:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.