L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models
- URL: http://arxiv.org/abs/2309.17446v2
- Date: Mon, 2 Oct 2023 09:54:50 GMT
- Title: L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models
- Authors: Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui
Shen, Stephen Yin, Ye Liu, Semih Yavuz, Caiming Xiong, Shafiq Joty, Yingbo
Zhou, Dragomir Radev, Arman Cohan
- Abstract summary: We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
- Score: 102.00201523306986
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recently, large language models (LLMs), especially those that are pretrained
on code, have demonstrated strong capabilities in generating programs from
natural language inputs in a few-shot or even zero-shot manner. Despite
promising results, there is a notable lack of a comprehensive evaluation of
these models language-to-code generation capabilities. Existing studies often
focus on specific tasks, model architectures, or learning paradigms, leading to
a fragmented understanding of the overall landscape. In this work, we present
L2CEval, a systematic evaluation of the language-to-code generation
capabilities of LLMs on 7 tasks across the domain spectrum of semantic parsing,
math reasoning and Python programming, analyzing the factors that potentially
affect their performance, such as model size, pretraining data, instruction
tuning, and different prompting methods. In addition to assessing model
performance, we measure confidence calibration for the models and conduct human
evaluations of the output programs. This enables us to identify and analyze the
typical failure modes across various tasks and models. L2CEval offers a
comprehensive understanding of the capabilities and limitations of LLMs in
language-to-code generation. We also release the evaluation framework and all
model outputs, hoping to lay the groundwork for further future research in this
domain.
Related papers
- FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - Do Machines and Humans Focus on Similar Code? Exploring Explainability
of Large Language Models in Code Summarization [10.201463330812167]
We report negative results from our investigation of explainability of language models in code summarization through the lens of human comprehension.
We employ a state-of-the-art model-agnostic, black-box, perturbation-based approach, SHAP, to identify which code tokens influence that generation of summaries.
Our study highlights an inability to align human focus with SHAP-based model focus measures.
arXiv Detail & Related papers (2024-02-22T00:01:02Z) - Gl\'orIA - A Generative and Open Large Language Model for Portuguese [4.782288068552145]
We introduce Gl'orIA, a robust European Portuguese decoder LLM.
To pre-train Gl'orIA, we assembled a comprehensive PT-PT text corpus comprising 35 billion tokens from various sources.
Evaluation shows that Gl'orIA significantly outperforms existing open PT decoder models in language modeling.
arXiv Detail & Related papers (2024-02-20T12:36:40Z) - Exploring the Potential of Large Language Models in Computational Argumentation [54.85665903448207]
Large language models (LLMs) have demonstrated impressive capabilities in understanding context and generating natural language.
This work aims to embark on an assessment of LLMs, such as ChatGPT, Flan models, and LLaMA2 models, in both zero-shot and few-shot settings.
arXiv Detail & Related papers (2023-11-15T15:12:15Z) - A Survey of Large Language Models [81.06947636926638]
Language modeling has been widely studied for language understanding and generation in the past two decades.
Recently, pre-trained language models (PLMs) have been proposed by pre-training Transformer models over large-scale corpora.
To discriminate the difference in parameter scale, the research community has coined the term large language models (LLM) for the PLMs of significant size.
arXiv Detail & Related papers (2023-03-31T17:28:46Z) - MEGA: Multilingual Evaluation of Generative AI [23.109803506475174]
Generative AI models have shown impressive performance on many Natural Language Processing tasks.
Most studies on generative LLMs have been restricted to English.
It is unclear how capable these models are at understanding and generating text in other languages.
arXiv Detail & Related papers (2023-03-22T13:03:10Z) - Multi-lingual Evaluation of Code Generation Models [82.7357812992118]
We present new benchmarks on evaluation code generation models: MBXP and Multilingual HumanEval, and MathQA-X.
These datasets cover over 10 programming languages.
We are able to assess the performance of code generation models in a multi-lingual fashion.
arXiv Detail & Related papers (2022-10-26T17:17:06Z) - Curriculum: A Broad-Coverage Benchmark for Linguistic Phenomena in
Natural Language Understanding [1.827510863075184]
Curriculum is a new format of NLI benchmark for evaluation of broad-coverage linguistic phenomena.
We show that this linguistic-phenomena-driven benchmark can serve as an effective tool for diagnosing model behavior and verifying model learning quality.
arXiv Detail & Related papers (2022-04-13T10:32:03Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - On the Universality of Deep COntextual Language Models [15.218264849664715]
Deep Contextual Language Models (LMs) like ELMO, BERT, and their successors dominate the landscape of Natural Language Processing.
Multilingual versions of such models like XLM-R and mBERT have given promising results in zero-shot cross-lingual transfer.
Due to this initial success, pre-trained models are being used as Universal Language Models'
arXiv Detail & Related papers (2021-09-15T08:00:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.