Turbulence: Systematically and Automatically Testing Instruction-Tuned
Large Language Models for Code
- URL: http://arxiv.org/abs/2312.14856v2
- Date: Sun, 14 Jan 2024 18:58:36 GMT
- Title: Turbulence: Systematically and Automatically Testing Instruction-Tuned
Large Language Models for Code
- Authors: Shahin Honarvar, Mark van der Wilk, Alastair Donaldson
- Abstract summary: We present a method for evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence.
Turbulence consists of a large set of natural language $textitquestion templates$, each of which is a programming problem, parameterised so that it can be asked in many different forms.
From a single question template, it is possible to ask an LLM a $textitneighbourhood$ of very similar programming questions, and assess the correctness of the result returned for each question.
- Score: 12.58098809948832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method for systematically evaluating the correctness and
robustness of instruction-tuned large language models (LLMs) for code
generation via a new benchmark, Turbulence. Turbulence consists of a large set
of natural language $\textit{question templates}$, each of which is a
programming problem, parameterised so that it can be asked in many different
forms. Each question template has an associated $\textit{test oracle}$ that
judges whether a code solution returned by an LLM is correct. Thus, from a
single question template, it is possible to ask an LLM a
$\textit{neighbourhood}$ of very similar programming questions, and assess the
correctness of the result returned for each question. This allows gaps in an
LLM's code generation abilities to be identified, including
$\textit{anomalies}$ where the LLM correctly solves $\textit{almost all}$
questions in a neighbourhood but fails for particular parameter instantiations.
We present experiments against five LLMs from OpenAI, Cohere and Meta, each at
two temperature configurations. Our findings show that, across the board,
Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond
merely highlighting that LLMs sometimes produce wrong code (which is no
surprise): by systematically identifying cases where LLMs are able to solve
some problems in a neighbourhood but do not manage to generalise to solve the
whole neighbourhood, our method is effective at highlighting
$\textit{robustness}$ issues. We present data and examples that shed light on
the kinds of mistakes that LLMs make when they return incorrect code results.
Related papers
- Where Do Large Language Models Fail When Generating Code? [10.519984835232359]
Large Language Models (LLMs) have shown great potential in code generation.
It is unclear what kinds of code generation errors LLMs can make.
We analyzed incorrect code snippets generated by six popular LLMs on the HumanEval dataset.
arXiv Detail & Related papers (2024-06-13T01:29:52Z) - ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline [42.61538071832468]
Large language models (LLMs) have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving.
We tailor the Self-Critique pipeline, which addresses the challenge in the feedback learning stage of LLM alignment.
arXiv Detail & Related papers (2024-04-03T17:51:18Z) - LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback [65.84061725174269]
Recent large language models (LLM) are leveraging human feedback to improve their generation quality.
We propose LLMRefine, an inference time optimization method to refine LLM's output.
We conduct experiments on three text generation tasks, including machine translation, long-form question answering (QA), and topical summarization.
LLMRefine consistently outperforms all baseline approaches, achieving improvements up to 1.7 MetricX points on translation tasks, 8.1 ROUGE-L on ASQA, 2.2 ROUGE-L on topical summarization.
arXiv Detail & Related papers (2023-11-15T19:52:11Z) - Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans.
RaR serves as a simple yet effective prompting method for improving performance.
We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z) - LPML: LLM-Prompting Markup Language for Mathematical Reasoning [8.995617701116142]
We propose a novel framework that integrates the Chain-of-Thought (CoT) method with an external tool (Python REPL)
Our approach enables LLMs to write the markup language and perform advanced mathematical reasoning using only zero-shot prompting.
arXiv Detail & Related papers (2023-09-21T02:46:20Z) - Question Answering as Programming for Solving Time-Sensitive Questions [84.07553016489769]
Question answering plays a pivotal role in human daily life because it involves our acquisition of knowledge about the world.
Recently, Large Language Models (LLMs) have shown remarkable intelligence in question answering.
This can be attributed to the LLMs' inability to perform rigorous reasoning based on surface-level text semantics.
We propose a novel approach where we reframe the $textbfQ$uestion $textbfA$rogrogering task.
arXiv Detail & Related papers (2023-05-23T16:35:16Z) - Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study [44.39031420687302]
Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks.
We try to understand this by designing a benchmark to evaluate the structural understanding capabilities of LLMs.
We propose $textitself-augmentation$ for effective structural prompting, such as critical value / range identification.
arXiv Detail & Related papers (2023-05-22T14:23:46Z) - Statistical Knowledge Assessment for Large Language Models [79.07989821512128]
Given varying prompts regarding a factoid question, can a large language model (LLM) reliably generate factually correct answers?
We propose KaRR, a statistical approach to assess factual knowledge for LLMs.
Our results reveal that the knowledge in LLMs with the same backbone architecture adheres to the scaling law, while tuning on instruction-following data sometimes compromises the model's capability to generate factually correct text reliably.
arXiv Detail & Related papers (2023-05-17T18:54:37Z) - LLM+P: Empowering Large Language Models with Optimal Planning
Proficiency [46.20085545432116]
Large language models (LLMs) have demonstrated remarkable zero-shot generalization abilities.
classical planners, once a problem is given in a formatted way, can use efficient search algorithms to quickly identify correct, or even optimal, plans.
This paper introduces LLM+P, the first framework that incorporates the strengths of classical planners into LLMs.
arXiv Detail & Related papers (2023-04-22T20:34:03Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.