From Effectiveness to Efficiency: Comparative Evaluation of Code Generated by LCGMs for Bilingual Programming Questions
- URL: http://arxiv.org/abs/2406.00602v1
- Date: Sun, 2 Jun 2024 03:22:30 GMT
- Title: From Effectiveness to Efficiency: Comparative Evaluation of Code Generated by LCGMs for Bilingual Programming Questions
- Authors: Weipeng Jiang, Xuanqi Gao, Juan Zhai, Shiqing Ma, Xiaoyu Zhang, Chao Shen,
- Abstract summary: Large Code Generation Models (LCGMs) have garnered significant attention and achieved promising results across various programming tasks.
Existing benchmarks often rely on English programming questions and limited manual unit test cases, inadequately assessing LCGM-generated code quality.
This paper investigates code quality differences, specifically effectiveness and efficiency, when employing different natural languages as inputs.
- Score: 32.464611304079234
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Code Generation Models (LCGMs) have garnered significant attention and achieved promising results across various programming tasks. However, concerns arise regarding performance when using non-English prompts, as these models are primarily trained on English-centric corpora, and most programming language tokens resemble English. Existing benchmarks often rely on English programming questions and limited manual unit test cases, inadequately assessing LCGM-generated code quality. This paper investigates code quality differences, specifically effectiveness and efficiency, when employing different natural languages as inputs, focusing on Chinese and English due to their prominent corpora and LCGM availability. Evaluating LCGM-generated code quality under bilingual inputs presents three challenges: (1) lack of high-quality bilingual programming question datasets, (2) insufficient unit test cases for comprehensive correctness verification, and (3) limited support for comparing generated code performance. To address these challenges, we curated a test suite of 52 bilingual programming questions and developed automated input generators for each. We enhanced correctness verification by sampling larger unit test cases and estimated code performance by profiling execution time relative to input size growth. Using this framework, we conducted an empirical study on six state-of-the-art LCGMs. The results revealed that LCGM-generated code exhibits varying bilingual correctness on an average of 10.5% of tasks, with 39.5% of correct code showing diverse bilingual performance differences. Our findings suggested LCGMs may not consistently generate high-quality code across different languages, providing insights for future research directions.
Related papers
- A Preliminary Study of Multilingual Code Language Models for Code Generation Task Using Translated Benchmarks [0.0]
We evaluate the performance of Poly-Coder, a pioneering open-source, multilingual CLM built for code generation.
Our results suggest that the outcomes observed in these translated benchmarks align well with evaluation metrics used during the training phase.
These initial insights highlight the need for more comprehensive empirical studies.
arXiv Detail & Related papers (2024-11-23T06:40:47Z) - Crystal: Illuminating LLM Abilities on Language and Code [58.5467653736537]
We propose a pretraining strategy to enhance the integration of natural language and coding capabilities.
The resulting model, Crystal, demonstrates remarkable capabilities in both domains.
arXiv Detail & Related papers (2024-11-06T10:28:46Z) - mHumanEval -- A Multilingual Benchmark to Evaluate Large Language Models for Code Generation [28.531581489405745]
mHumanEval is an extended benchmark supporting prompts in over 200 natural languages.
We provide expert human translations for 15 diverse natural languages (NLs)
We conclude by analyzing the multilingual code generation capabilities of state-of-the-art (SOTA) Code LLMs.
arXiv Detail & Related papers (2024-10-19T08:44:26Z) - Examination of Code generated by Large Language Models [35.51378656555693]
Large language models (LLMs) are transforming software development by automating code generation.
To assess the current state of LLMs in generating correct code of high quality, we conducted controlled experiments with ChatGPT and Copilot.
We observed significant differences between the LLMs, between the languages, between algorithm and test codes, and over time.
arXiv Detail & Related papers (2024-08-29T15:12:16Z) - Large Language Models for cross-language code clone detection [3.5202378300682162]
Cross-lingual code clone detection has gained traction with the software engineering community.
Inspired by the significant advances in machine learning, this paper revisits cross-lingual code clone detection.
arXiv Detail & Related papers (2024-08-08T12:57:14Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - The Power of Question Translation Training in Multilingual Reasoning: Broadened Scope and Deepened Insights [108.40766216456413]
We propose a question alignment framework to bridge the gap between large language models' English and non-English performance.
Experiment results show it can boost multilingual performance across diverse reasoning scenarios, model families, and sizes.
We analyze representation space, generated response and data scales, and reveal how question translation training strengthens language alignment within LLMs.
arXiv Detail & Related papers (2024-05-02T14:49:50Z) - Exploring Multi-Lingual Bias of Large Code Models in Code Generation [55.336629780101475]
Code generation aims to synthesize code and fulfill functional requirements based on natural language (NL) specifications.
Despite the effectiveness, we observe a noticeable multilingual bias in the generation performance of large code models (LCMs)
LCMs demonstrate proficiency in generating solutions when provided with instructions in English, yet may falter when faced with semantically equivalent instructions in other NLs such as Chinese.
arXiv Detail & Related papers (2024-04-30T08:51:49Z) - Exploring the Impact of the Output Format on the Evaluation of Large Language Models for Code Translation [8.81447711370817]
We empirically analyze the generated outputs of eleven popular instruct-tuned large language models (LLMs)
Our results demonstrate that a strategic combination of prompt engineering and regular expression can effectively extract the source code from the model generation output.
arXiv Detail & Related papers (2024-03-25T21:41:31Z) - AdaCCD: Adaptive Semantic Contrasts Discovery Based Cross Lingual
Adaptation for Code Clone Detection [69.79627042058048]
AdaCCD is a novel cross-lingual adaptation method that can detect cloned codes in a new language without annotations in that language.
We evaluate the cross-lingual adaptation results of AdaCCD by constructing a multilingual code clone detection benchmark consisting of 5 programming languages.
arXiv Detail & Related papers (2023-11-13T12:20:48Z) - Testing LLMs on Code Generation with Varying Levels of Prompt
Specificity [0.0]
Large language models (LLMs) have demonstrated unparalleled prowess in mimicking human-like text generation and processing.
The potential to transform natural language prompts into executable code promises a major shift in software development practices.
arXiv Detail & Related papers (2023-11-10T23:41:41Z) - CodeFuse-13B: A Pretrained Multi-lingual Code Large Language Model [58.127534002232096]
This paper introduces CodeFuse-13B, an open-sourced pre-trained code LLM.
It is specifically designed for code-related tasks with both English and Chinese prompts.
CodeFuse achieves its effectiveness by utilizing a high quality pre-training dataset.
arXiv Detail & Related papers (2023-10-10T02:38:44Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - CodeApex: A Bilingual Programming Evaluation Benchmark for Large
Language Models [43.655927559990616]
We propose CodeApex, a benchmark dataset focusing on the programming comprehension, code generation, and code correction abilities of LLMs.
We evaluate 12 widely used LLMs, including both general-purpose and specialized models.
GPT-4 exhibits the best programming capabilities, achieving approximate accuracy of 69%, 54%, and 66% on the three tasks, respectively.
arXiv Detail & Related papers (2023-09-05T04:12:01Z) - Towards Generating Functionally Correct Code Edits from Natural Language
Issue Descriptions [11.327913840111378]
We introduce Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J dataset augmented with high-level descriptions of bug fixes.
We empirically evaluate the performance of several state-of-the-art LLMs for the this task.
arXiv Detail & Related papers (2023-04-07T18:58:33Z) - LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results.
Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results.
LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z) - MCoNaLa: A Benchmark for Code Generation from Multiple Natural Languages [76.93265104421559]
We benchmark code generation from natural language commands extending beyond English.
We annotated a total of 896 NL-code pairs in three languages: Spanish, Japanese, and Russian.
While the difficulties vary across these three languages, all systems lag significantly behind their English counterparts.
arXiv Detail & Related papers (2022-03-16T04:21:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.