GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization
in Programming Language Understanding
- URL: http://arxiv.org/abs/2311.09707v1
- Date: Thu, 16 Nov 2023 09:35:00 GMT
- Title: GenCodeSearchNet: A Benchmark Test Suite for Evaluating Generalization
in Programming Language Understanding
- Authors: Andor Diera, Abdelhalim Dahou, Lukas Galke, Fabian Karl, Florian
Sihler, Ansgar Scherp
- Abstract summary: We propose a new benchmark dataset called GenCodeSearchNet (GeCS) to evaluate the programming language understanding capabilities of language models.
As part of the full dataset, we introduce a new, manually curated subset StatCodeSearch that focuses on R, a popular but so far underrepresented programming language.
For evaluation and comparison, we collect several baseline results using fine-tuned BERT-style models and GPT-style large language models.
- Score: 5.9535699822923
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models can serve as a valuable tool for software developers to
increase productivity. Large generative models can be used for code generation
and code completion, while smaller encoder-only models are capable of
performing code search tasks using natural language queries.These capabilities
are heavily influenced by the quality and diversity of the available training
data. Source code datasets used for training usually focus on the most popular
languages and testing is mostly conducted on the same distributions, often
overlooking low-resource programming languages. Motivated by the NLP
generalization taxonomy proposed by Hupkes et.\,al., we propose a new benchmark
dataset called GenCodeSearchNet (GeCS) which builds upon existing natural
language code search datasets to systemically evaluate the programming language
understanding generalization capabilities of language models. As part of the
full dataset, we introduce a new, manually curated subset StatCodeSearch that
focuses on R, a popular but so far underrepresented programming language that
is often used by researchers outside the field of computer science. For
evaluation and comparison, we collect several baseline results using fine-tuned
BERT-style models and GPT-style large language models in a zero-shot setting.
Related papers
- CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation [58.84212778960507]
We propose CodeGRAG, a Graphical Retrieval Augmented Code Generation framework to enhance the performance of LLMs.
CodeGRAG builds the graphical view of code blocks based on the control flow and data flow of them to fill the gap between programming languages and natural language.
Various experiments and ablations are done on four datasets including both the C++ and python languages to validate the hard meta-graph prompt, the soft prompting technique, and the effectiveness of the objectives for pretrained GNN expert.
arXiv Detail & Related papers (2024-05-03T02:48:55Z) - Language Models are Universal Embedders [48.12992614723464]
We show that pre-trained transformer decoders can embed universally when finetuned on limited English data.
Our models achieve competitive performance on different embedding tasks by minimal training data.
These results provide evidence of a promising path towards building powerful unified embedders.
arXiv Detail & Related papers (2023-10-12T11:25:46Z) - Constructing Multilingual Code Search Dataset Using Neural Machine
Translation [48.32329232202801]
We create a multilingual code search dataset in four natural and four programming languages.
Our results show that the model pre-trained with all natural and programming language data has performed best in most cases.
arXiv Detail & Related papers (2023-06-27T16:42:36Z) - XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented
Languages [105.54207724678767]
Data scarcity is a crucial issue for the development of highly multilingual NLP systems.
We propose XTREME-UP, a benchmark defined by its focus on the scarce-data scenario rather than zero-shot.
XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies.
arXiv Detail & Related papers (2023-05-19T18:00:03Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Cross-Domain Deep Code Search with Meta Learning [14.618183588410194]
We propose CroCS, a novel approach for domain-specific code search.
CroCS employs a transfer learning framework where an initial program representation model is pre-trained on a large corpus of common programming languages.
arXiv Detail & Related papers (2022-01-01T09:00:48Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Automated Source Code Generation and Auto-completion Using Deep
Learning: Comparing and Discussing Current Language-Model-Related Approaches [0.0]
This paper compares different deep learning architectures to create and use language models based on programming code.
We discuss each approach's different strengths and weaknesses and what gaps we find to evaluate the language models or apply them in a real programming context.
arXiv Detail & Related papers (2020-09-16T15:17:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.