LLM4VV: Developing LLM-Driven Testsuite for Compiler Validation
- URL: http://arxiv.org/abs/2310.04963v3
- Date: Sun, 10 Mar 2024 21:05:28 GMT
- Title: LLM4VV: Developing LLM-Driven Testsuite for Compiler Validation
- Authors: Christian Munley, Aaron Jarmusch and Sunita Chandrasekaran
- Abstract summary: Large language models (LLMs) are a powerful tool for a wide span of applications involving natural language.
We explore the capabilities of state-of-the-art LLMs, including open-source LLMs -- Meta Codellama, Phind fine-tuned version of Codellama, Deepseek Deepseek Coder and closed-source LLMs -- OpenAI GPT-3.5-Turbo and GPT-4-Turbo.
- Score: 7.979116939578324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are a new and powerful tool for a wide span of
applications involving natural language and demonstrate impressive code
generation abilities. The goal of this work is to automatically generate tests
and use these tests to validate and verify compiler implementations of a
directive-based parallel programming paradigm, OpenACC. To do so, in this
paper, we explore the capabilities of state-of-the-art LLMs, including
open-source LLMs -- Meta Codellama, Phind fine-tuned version of Codellama,
Deepseek Deepseek Coder and closed-source LLMs -- OpenAI GPT-3.5-Turbo and
GPT-4-Turbo. We further fine-tuned the open-source LLMs and GPT-3.5-Turbo using
our own testsuite dataset along with using the OpenACC specification. We also
explored these LLMs using various prompt engineering techniques that include
code template, template with retrieval-augmented generation (RAG), one-shot
example, one-shot with RAG, expressive prompt with code template and RAG. This
paper highlights our findings from over 5000 tests generated via all the above
mentioned methods. Our contributions include: (a) exploring the capabilities of
the latest and relevant LLMs for code generation, (b) investigating fine-tuning
and prompt methods, and (c) analyzing the outcome of LLMs generated tests
including manually analysis of representative set of tests. We found the LLM
Deepseek-Coder-33b-Instruct produced the most passing tests followed by
GPT-4-Turbo.
Related papers
- Source Code Summarization in the Era of Large Language Models [23.715005053430957]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks.
In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z) - InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct [43.7550233177368]
We propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse.
We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks.
arXiv Detail & Related papers (2024-07-08T08:00:05Z) - Applying RLAIF for Code Generation with API-usage in Lightweight LLMs [15.366324461797582]
Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains.
This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (1B parameters) LLMs.
arXiv Detail & Related papers (2024-06-28T17:16:03Z) - An Empirical Study of Unit Test Generation with Large Language Models [16.447000441006814]
Unit testing is an essential activity in software development for verifying the correctness of software components.
The emergence of Large Language Models (LLMs) offers a new direction for automating unit test generation.
arXiv Detail & Related papers (2024-06-26T08:57:03Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Large Language Models (LLMs) are routinely used in retrieval-augmented applications to orchestrate tasks and process inputs from users and other sources.
This opens the door to prompt injection attacks, where the LLM receives and acts upon instructions from supposedly data-only sources, thus deviating from the user's original instructions.
We define this as task drift, and we propose to catch it by scanning and analyzing the LLM's activations.
We show that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions, without being trained on any of these attacks.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents [62.02920842630234]
We show how to build small models that have GPT-4-level performance but for 400x lower cost.
We unify pre-existing datasets into a benchmark LLM-AggreFact.
Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy.
arXiv Detail & Related papers (2024-04-16T17:59:10Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task
Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions.
We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence.
The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z) - Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs.
Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.