On the Reliability and Explainability of Language Models for Program
Generation
- URL: http://arxiv.org/abs/2302.09587v3
- Date: Mon, 8 Jan 2024 13:26:52 GMT
- Title: On the Reliability and Explainability of Language Models for Program
Generation
- Authors: Yue Liu, Chakkrit Tantithamthavorn, Yonghui Liu, Li Li
- Abstract summary: We study the capabilities and limitations of automated program generation approaches.
We employ advanced explainable AI approaches to highlight the tokens that significantly contribute to the code transformation.
Our analysis reveals that, in various experimental scenarios, language models can recognize code grammar and structural information, but they exhibit limited robustness to changes in input sequences.
- Score: 15.569926313298337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have adopted pre-trained language models, such as CodeT5 and
CodeGPT, for automated program generation tasks like code generation, repair,
and translation. Numerous language model-based approaches have been proposed
and evaluated on various benchmark datasets, demonstrating promising
performance. However, there is still uncertainty about the reliability of these
models, particularly their realistic ability to consistently transform code
sequences. This raises the question: are these techniques sufficiently
trustworthy for automated program generation? Consequently, Further research is
needed to understand model logic and assess reliability and explainability. To
bridge these research gaps, we conduct a thorough empirical study of eight
popular language models on five representative datasets to determine the
capabilities and limitations of automated program generation approaches. We
further employ advanced explainable AI approaches to highlight the tokens that
significantly contribute to the code transformation. We discover that
state-of-the-art approaches suffer from inappropriate performance evaluation
stemming from severe data duplication, causing over-optimistic results. Our
explainability analysis reveals that, in various experimental scenarios,
language models can recognize code grammar and structural information, but they
exhibit limited robustness to changes in input sequences. Overall, more
rigorous evaluation approaches and benchmarks are critical to enhance the
reliability and explainability of automated program generation moving forward.
Our findings provide important guidelines for this goal.
Related papers
- On Uncertainty In Natural Language Processing [2.5076643086429993]
This thesis studies how uncertainty in natural language processing can be characterized from a linguistic, statistical and neural perspective.
We propose a method for calibrated sampling in natural language generation based on non-exchangeable conformal prediction.
Lastly, we develop an approach to quantify confidence in large black-box language models using auxiliary predictors.
arXiv Detail & Related papers (2024-10-04T14:08:02Z) - SIaM: Self-Improving Code-Assisted Mathematical Reasoning of Large Language Models [54.78329741186446]
We propose a novel paradigm that uses a code-based critic model to guide steps including question-code data construction, quality control, and complementary evaluation.
Experiments across both in-domain and out-of-domain benchmarks in English and Chinese demonstrate the effectiveness of the proposed paradigm.
arXiv Detail & Related papers (2024-08-28T06:33:03Z) - Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models [6.394084132117747]
We propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language models.
Our technique generates fluent, in-distribution counterfactuals, making the evaluation protocol more reliable.
arXiv Detail & Related papers (2024-08-21T00:17:59Z) - Improving Retrieval Augmented Language Model with Self-Reasoning [20.715106330314605]
We propose a novel self-reasoning framework aimed at improving the reliability and traceability of RALMs.
The framework involves constructing self-reason trajectories with three processes: a relevance-aware process, an evidence-aware selective process, and a trajectory analysis process.
We have evaluated our framework across four public datasets to demonstrate the superiority of our method.
arXiv Detail & Related papers (2024-07-29T09:05:10Z) - L2CEval: Evaluating Language-to-Code Generation Capabilities of Large
Language Models [102.00201523306986]
We present L2CEval, a systematic evaluation of the language-to-code generation capabilities of large language models (LLMs)
We analyze the factors that potentially affect their performance, such as model size, pretraining data, instruction tuning, and different prompting methods.
In addition to assessing model performance, we measure confidence calibration for the models and conduct human evaluations of the output programs.
arXiv Detail & Related papers (2023-09-29T17:57:00Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - Plex: Towards Reliability using Pretrained Large Model Extensions [69.13326436826227]
We develop ViT-Plex and T5-Plex, pretrained large model extensions for vision and language modalities, respectively.
Plex greatly improves the state-of-the-art across reliability tasks, and simplifies the traditional protocol.
We demonstrate scaling effects over model sizes up to 1B parameters and pretraining dataset sizes up to 4B examples.
arXiv Detail & Related papers (2022-07-15T11:39:37Z) - Evaluating the Robustness of Neural Language Models to Input
Perturbations [7.064032374579076]
In this study, we design and implement various types of character-level and word-level perturbation methods to simulate noisy input texts.
We investigate the ability of high-performance language models such as BERT, XLNet, RoBERTa, and ELMo in handling different types of input perturbations.
The results suggest that language models are sensitive to input perturbations and their performance can decrease even when small changes are introduced.
arXiv Detail & Related papers (2021-08-27T12:31:17Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.