Turbulence: Systematically and Automatically Testing Instruction-Tuned
Large Language Models for Code
- URL: http://arxiv.org/abs/2312.14856v2
- Date: Sun, 14 Jan 2024 18:58:36 GMT
- Title: Turbulence: Systematically and Automatically Testing Instruction-Tuned
Large Language Models for Code
- Authors: Shahin Honarvar, Mark van der Wilk, Alastair Donaldson
- Abstract summary: We present a method for evaluating the correctness and robustness of instruction-tuned large language models (LLMs) for code generation via a new benchmark, Turbulence.
Turbulence consists of a large set of natural language $textitquestion templates$, each of which is a programming problem, parameterised so that it can be asked in many different forms.
From a single question template, it is possible to ask an LLM a $textitneighbourhood$ of very similar programming questions, and assess the correctness of the result returned for each question.
- Score: 12.58098809948832
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a method for systematically evaluating the correctness and
robustness of instruction-tuned large language models (LLMs) for code
generation via a new benchmark, Turbulence. Turbulence consists of a large set
of natural language $\textit{question templates}$, each of which is a
programming problem, parameterised so that it can be asked in many different
forms. Each question template has an associated $\textit{test oracle}$ that
judges whether a code solution returned by an LLM is correct. Thus, from a
single question template, it is possible to ask an LLM a
$\textit{neighbourhood}$ of very similar programming questions, and assess the
correctness of the result returned for each question. This allows gaps in an
LLM's code generation abilities to be identified, including
$\textit{anomalies}$ where the LLM correctly solves $\textit{almost all}$
questions in a neighbourhood but fails for particular parameter instantiations.
We present experiments against five LLMs from OpenAI, Cohere and Meta, each at
two temperature configurations. Our findings show that, across the board,
Turbulence is able to reveal gaps in LLM reasoning ability. This goes beyond
merely highlighting that LLMs sometimes produce wrong code (which is no
surprise): by systematically identifying cases where LLMs are able to solve
some problems in a neighbourhood but do not manage to generalise to solve the
whole neighbourhood, our method is effective at highlighting
$\textit{robustness}$ issues. We present data and examples that shed light on
the kinds of mistakes that LLMs make when they return incorrect code results.
Related papers
- SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks.
We show that even the most prominent LLMs exhibit these error patterns in their outputs.
Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z) - Capturing Sparks of Abstraction for the ARC Challenge [0.10878040851637999]
Even commercial Large Language Models (LLMs) struggle to 'understand' many of the problems.
We demonstrate that 'Sparks of Abstraction' can be extracted from the LLM output.
Both the arc-dsl-llm DSL framework and the Gemini LLM-generated data are made Open Source.
arXiv Detail & Related papers (2024-11-17T23:40:00Z) - Grounding by Trying: LLMs with Reinforcement Learning-Enhanced Retrieval [55.63711219190506]
Large language models (LLMs) often struggle with posing the right search queries.
We introduce $underlineLe$arning to $underlineRe$trieve by $underlineT$rying (LeReT)
LeReT can improve the absolute retrieval accuracy by up to 29% and the downstream generator evaluations by 17%.
arXiv Detail & Related papers (2024-10-30T17:02:54Z) - Are LLMs Aware that Some Questions are not Open-ended? [58.93124686141781]
We study whether Large Language Models are aware that some questions have limited answers and need to respond more deterministically.
The lack of question awareness in LLMs leads to two phenomena: (1) too casual to answer non-open-ended questions or (2) too boring to answer open-ended questions.
arXiv Detail & Related papers (2024-10-01T06:07:00Z) - LPML: LLM-Prompting Markup Language for Mathematical Reasoning [8.995617701116142]
We propose a novel framework that integrates the Chain-of-Thought (CoT) method with an external tool (Python REPL)
Our approach enables LLMs to write the markup language and perform advanced mathematical reasoning using only zero-shot prompting.
arXiv Detail & Related papers (2023-09-21T02:46:20Z) - Question Answering as Programming for Solving Time-Sensitive Questions [84.07553016489769]
Question answering plays a pivotal role in human daily life because it involves our acquisition of knowledge about the world.
Recently, Large Language Models (LLMs) have shown remarkable intelligence in question answering.
This can be attributed to the LLMs' inability to perform rigorous reasoning based on surface-level text semantics.
We propose a novel approach where we reframe the $textbfQ$uestion $textbfA$rogrogering task.
arXiv Detail & Related papers (2023-05-23T16:35:16Z) - Table Meets LLM: Can Large Language Models Understand Structured Table Data? A Benchmark and Empirical Study [44.39031420687302]
Large language models (LLMs) are becoming attractive as few-shot reasoners to solve Natural Language (NL)-related tasks.
We try to understand this by designing a benchmark to evaluate the structural understanding capabilities of LLMs.
We propose $textitself-augmentation$ for effective structural prompting, such as critical value / range identification.
arXiv Detail & Related papers (2023-05-22T14:23:46Z) - LLM+P: Empowering Large Language Models with Optimal Planning
Proficiency [46.20085545432116]
Large language models (LLMs) have demonstrated remarkable zero-shot generalization abilities.
classical planners, once a problem is given in a formatted way, can use efficient search algorithms to quickly identify correct, or even optimal, plans.
This paper introduces LLM+P, the first framework that incorporates the strengths of classical planners into LLMs.
arXiv Detail & Related papers (2023-04-22T20:34:03Z) - Check Your Facts and Try Again: Improving Large Language Models with
External Knowledge and Automated Feedback [127.75419038610455]
Large language models (LLMs) are able to generate human-like, fluent responses for many downstream tasks.
This paper proposes a LLM-Augmenter system, which augments a black-box LLM with a set of plug-and-play modules.
arXiv Detail & Related papers (2023-02-24T18:48:43Z) - PAL: Program-aided Language Models [112.94785609781503]
We present Program-Aided Language models (PaL) to understand natural language problems.
PaL offloads the solution step to a programmatic runtime such as a Python interpreter.
We set new state-of-the-art results in all 12 benchmarks.
arXiv Detail & Related papers (2022-11-18T18:56:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.