Benchmarking Causal Study to Interpret Large Language Models for Source
Code
- URL: http://arxiv.org/abs/2308.12415v1
- Date: Wed, 23 Aug 2023 20:32:12 GMT
- Title: Benchmarking Causal Study to Interpret Large Language Models for Source
Code
- Authors: Daniel Rodriguez-Cardenas, David N. Palacio, Dipin Khati, Henry Burke,
Denys Poshyvanyk
- Abstract summary: This paper introduces a benchmarking strategy named Galeras comprised of curated testbeds for three SE tasks.
We illustrate the insights of our benchmarking strategy by conducting a case study on the performance of ChatGPT under distinct prompt engineering methods.
- Score: 6.301373791541809
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: One of the most common solutions adopted by software researchers to address
code generation is by training Large Language Models (LLMs) on massive amounts
of source code. Although a number of studies have shown that LLMs have been
effectively evaluated on popular accuracy metrics (e.g., BLEU, CodeBleu),
previous research has largely overlooked the role of Causal Inference as a
fundamental component of the interpretability of LLMs' performance. Existing
benchmarks and datasets are meant to highlight the difference between the
expected and the generated outcome, but do not take into account confounding
variables (e.g., lines of code, prompt size) that equally influence the
accuracy metrics. The fact remains that, when dealing with generative software
tasks by LLMs, no benchmark is available to tell researchers how to quantify
neither the causal effect of SE-based treatments nor the correlation of
confounders to the model's performance. In an effort to bring statistical rigor
to the evaluation of LLMs, this paper introduces a benchmarking strategy named
Galeras comprised of curated testbeds for three SE tasks (i.e., code
completion, code summarization, and commit generation) to help aid the
interpretation of LLMs' performance. We illustrate the insights of our
benchmarking strategy by conducting a case study on the performance of ChatGPT
under distinct prompt engineering methods. The results of the case study
demonstrate the positive causal influence of prompt semantics on ChatGPT's
generative performance by an average treatment effect of $\approx 3\%$.
Moreover, it was found that confounders such as prompt size are highly
correlated with accuracy metrics ($\approx 0.412\%$). The end result of our
case study is to showcase causal inference evaluations, in practice, to reduce
confounding bias. By reducing the bias, we offer an interpretable solution for
the accuracy metric under analysis.
Related papers
- Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters [27.656263126925815]
We study the scaling of inference-time computation in LLMs.
We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt.
arXiv Detail & Related papers (2024-08-06T17:35:05Z) - The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism [39.392450788666814]
Current evaluations of large language models (LLMs) often overlook non-determinism.
greedy decoding generally outperforms sampling methods for most evaluated tasks.
Smaller LLMs can match or surpass larger models such as GPT-4-Turbo.
arXiv Detail & Related papers (2024-07-15T06:12:17Z) - What's Wrong with Your Code Generated by Large Language Models? An Extensive Study [80.18342600996601]
Large language models (LLMs) produce code that is shorter yet more complicated as compared to canonical solutions.
We develop a taxonomy of bugs for incorrect codes that includes three categories and 12 sub-categories, and analyze the root cause for common bug types.
We propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback.
arXiv Detail & Related papers (2024-07-08T17:27:17Z) - Evaluating Human Alignment and Model Faithfulness of LLM Rationale [66.75309523854476]
We study how well large language models (LLMs) explain their generations through rationales.
We show that prompting-based methods are less "faithful" than attribution-based explanations.
arXiv Detail & Related papers (2024-06-28T20:06:30Z) - DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity.
Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data.
Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z) - Automated Commit Message Generation with Large Language Models: An Empirical Study and Beyond [24.151927600694066]
Commit Message Generation (CMG) approaches aim to automatically generate commit messages based on given code diffs.
This paper conducts the first comprehensive experiment to investigate how far we have been in applying Large Language Models (LLMs) to generate high-quality commit messages.
arXiv Detail & Related papers (2024-04-23T08:24:43Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - Evaluating the Factuality of Large Language Models using Large-Scale Knowledge Graphs [30.179703001666173]
Factuality issue is a critical concern for Large Language Models (LLMs)
We propose GraphEval to evaluate an LLM's performance using a substantially large test dataset.
Test dataset is retrieved from a large knowledge graph with more than 10 million facts without expensive human efforts.
arXiv Detail & Related papers (2024-04-01T06:01:17Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond [135.8013388183257]
We propose a new protocol for inconsistency detection benchmark creation and implement it in a 10-domain benchmark called SummEdits.
Most LLMs struggle on SummEdits, with performance close to random chance.
The best-performing model, GPT-4, is still 8% below estimated human performance.
arXiv Detail & Related papers (2023-05-23T21:50:06Z) - Causal Reasoning and Large Language Models: Opening a New Frontier for Causality [29.433401785920065]
Large language models (LLMs) can generate causal arguments with high probability.
LLMs may be used by human domain experts to save effort in setting up a causal analysis.
arXiv Detail & Related papers (2023-04-28T19:00:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.