Cross-Domain Deep Code Search with Meta Learning
- URL: http://arxiv.org/abs/2201.00150v6
- Date: Tue, 12 Mar 2024 05:31:50 GMT
- Title: Cross-Domain Deep Code Search with Meta Learning
- Authors: Yitian Chai, Hongyu Zhang, Beijun Shen, Xiaodong Gu
- Abstract summary: We propose CroCS, a novel approach for domain-specific code search.
CroCS employs a transfer learning framework where an initial program representation model is pre-trained on a large corpus of common programming languages.
- Score: 14.618183588410194
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, pre-trained programming language models such as CodeBERT have
demonstrated substantial gains in code search. Despite showing great
performance, they rely on the availability of large amounts of parallel data to
fine-tune the semantic mappings between queries and code. This restricts their
practicality in domain-specific languages with relatively scarce and expensive
data. In this paper, we propose CroCS, a novel approach for domain-specific
code search. CroCS employs a transfer learning framework where an initial
program representation model is pre-trained on a large corpus of common
programming languages (such as Java and Python) and is further adapted to
domain-specific languages such as SQL and Solidity. Unlike cross-language
CodeBERT, which is directly fine-tuned in the target language, CroCS adapts a
few-shot meta-learning algorithm called MAML to learn the good initialization
of model parameters, which can be best reused in a domain-specific language. We
evaluate the proposed approach on two domain-specific languages, namely, SQL
and Solidity, with model transferred from two widely used languages (Python and
Java). Experimental results show that CDCS significantly outperforms
conventional pre-trained code models that are directly fine-tuned in
domain-specific languages, and it is particularly effective for scarce data.
Related papers
- IRCoder: Intermediate Representations Make Language Models Robust Multilingual Code Generators [49.903001442804594]
This work investigates the prospect of leveraging compiler intermediate representations (IR) to improve the multilingual capabilities of Code-LMs.
We first compile SLTrans, a parallel dataset consisting of nearly 4M self-contained source code files.
Next, we carry out continued causal language modelling training on SLTrans, forcing the Code-LMs to learn the IR language.
Our resulting models, dubbed IRCoder, display sizeable and consistent gains across a wide variety of code generation tasks and metrics.
arXiv Detail & Related papers (2024-03-06T17:52:08Z) - GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization
in Programming Language Understanding [5.9535699822923]
We propose a new benchmark dataset called GenCodeSearchNet (GeCS) to evaluate the programming language understanding capabilities of language models.
As part of the full dataset, we introduce a new, manually curated subset StatCodeSearch that focuses on R, a popular but so far underrepresented programming language.
For evaluation and comparison, we collect several baseline results using fine-tuned BERT-style models and GPT-style large language models.
arXiv Detail & Related papers (2023-11-16T09:35:00Z) - Language Models are Universal Embedders [48.12992614723464]
We show that pre-trained transformer decoders can embed universally when finetuned on limited English data.
Our models achieve competitive performance on different embedding tasks by minimal training data.
These results provide evidence of a promising path towards building powerful unified embedders.
arXiv Detail & Related papers (2023-10-12T11:25:46Z) - Domain Adaptive Code Completion via Language Models and Decoupled Domain
Databases [15.964849180459675]
$k$NM-LM is a retrieval-augmented language model that integrates domain knowledge into language models without fine-tuning.
Our approach is able to automatically adapt to different language models and domains.
arXiv Detail & Related papers (2023-08-18T05:25:55Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Deep Graph Matching and Searching for Semantic Code Retrieval [76.51445515611469]
We propose an end-to-end deep graph matching and searching model based on graph neural networks.
We first represent both natural language query texts and programming language code snippets with the unified graph-structured data.
In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them.
arXiv Detail & Related papers (2020-10-24T14:16:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.