LLM4VV: Developing LLM-Driven Testsuite for Compiler Validation
- URL: http://arxiv.org/abs/2310.04963v3
- Date: Sun, 10 Mar 2024 21:05:28 GMT
- Title: LLM4VV: Developing LLM-Driven Testsuite for Compiler Validation
- Authors: Christian Munley, Aaron Jarmusch and Sunita Chandrasekaran
- Abstract summary: Large language models (LLMs) are a powerful tool for a wide span of applications involving natural language.
We explore the capabilities of state-of-the-art LLMs, including open-source LLMs -- Meta Codellama, Phind fine-tuned version of Codellama, Deepseek Deepseek Coder and closed-source LLMs -- OpenAI GPT-3.5-Turbo and GPT-4-Turbo.
- Score: 7.979116939578324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are a new and powerful tool for a wide span of
applications involving natural language and demonstrate impressive code
generation abilities. The goal of this work is to automatically generate tests
and use these tests to validate and verify compiler implementations of a
directive-based parallel programming paradigm, OpenACC. To do so, in this
paper, we explore the capabilities of state-of-the-art LLMs, including
open-source LLMs -- Meta Codellama, Phind fine-tuned version of Codellama,
Deepseek Deepseek Coder and closed-source LLMs -- OpenAI GPT-3.5-Turbo and
GPT-4-Turbo. We further fine-tuned the open-source LLMs and GPT-3.5-Turbo using
our own testsuite dataset along with using the OpenACC specification. We also
explored these LLMs using various prompt engineering techniques that include
code template, template with retrieval-augmented generation (RAG), one-shot
example, one-shot with RAG, expressive prompt with code template and RAG. This
paper highlights our findings from over 5000 tests generated via all the above
mentioned methods. Our contributions include: (a) exploring the capabilities of
the latest and relevant LLMs for code generation, (b) investigating fine-tuning
and prompt methods, and (c) analyzing the outcome of LLMs generated tests
including manually analysis of representative set of tests. We found the LLM
Deepseek-Coder-33b-Instruct produced the most passing tests followed by
GPT-4-Turbo.
Related papers
- Source Code Summarization in the Era of Large Language Models [23.715005053430957]
Large language models (LLMs) have led to a great boost in the performance of code-related tasks.
In this paper, we undertake a systematic and comprehensive study on code summarization in the era of LLMs.
arXiv Detail & Related papers (2024-07-09T05:48:42Z) - InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct [43.7550233177368]
We propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse.
We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks.
arXiv Detail & Related papers (2024-07-08T08:00:05Z) - Applying RLAIF for Code Generation with API-usage in Lightweight LLMs [15.366324461797582]
Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains.
This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (1B parameters) LLMs.
arXiv Detail & Related papers (2024-06-28T17:16:03Z) - On the Evaluation of Large Language Models in Unit Test Generation [16.447000441006814]
Unit testing is an essential activity in software development for verifying the correctness of software components.
The emergence of Large Language Models (LLMs) offers a new direction for automating unit test generation.
arXiv Detail & Related papers (2024-06-26T08:57:03Z) - CodecLM: Aligning Language Models with Tailored Synthetic Data [51.59223474427153]
We introduce CodecLM, a framework for adaptively generating high-quality synthetic data for instruction-following abilities.
We first encode seed instructions into metadata, which are concise keywords generated on-the-fly to capture the target instruction distribution.
We also introduce Self-Rubrics and Contrastive Filtering during decoding to tailor data-efficient samples.
arXiv Detail & Related papers (2024-04-08T21:15:36Z) - PPTC-R benchmark: Towards Evaluating the Robustness of Large Language
Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels.
We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings.
We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z) - TAT-LLM: A Specialized Language Model for Discrete Reasoning over Tabular and Textual Data [73.29220562541204]
We consider harnessing the amazing power of language models (LLMs) to solve our task.
We develop a TAT-LLM language model by fine-tuning LLaMA 2 with the training data generated automatically from existing expert-annotated datasets.
arXiv Detail & Related papers (2024-01-24T04:28:50Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - PPTC Benchmark: Evaluating Large Language Models for PowerPoint Task
Completion [96.47420221442397]
We introduce the PowerPoint Task Completion benchmark to assess the ability of Large Language Models to finish multi-turn, multi-modal instructions.
We also propose the PPTX-Match Evaluation System that evaluates if LLMs finish the instruction based on the prediction file rather than the label API sequence.
The results show that GPT-4 outperforms other LLMs with 75.1% accuracy in single-turn dialogue testing but faces challenges in completing entire sessions, achieving just 6% session accuracy.
arXiv Detail & Related papers (2023-11-03T08:06:35Z) - Guiding Large Language Models via Directional Stimulus Prompting [114.84930073977672]
We introduce Directional Stimulus Prompting, a novel framework for guiding black-box large language models (LLMs) toward specific desired outputs.
Instead of directly adjusting LLMs, our method employs a small tunable policy model to generate an auxiliary directional stimulus prompt for each input instance.
arXiv Detail & Related papers (2023-02-22T17:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.