Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation
- URL: http://arxiv.org/abs/2408.11053v2
- Date: Mon, 03 Feb 2025 19:29:27 GMT
- Title: Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation
- Authors: Nathaniel Pinckney, Christopher Batten, Mingjie Liu, Haoxing Ren, Brucek Khailany,
- Abstract summary: We evaluate new commercial and open models since the release of the open-source VerilogEval benchmark.
We find measurable improvements in state-of-the-art models.
We find that prompt engineering remains crucial for achieving good pass rates.
- Score: 6.463959200930805
- License:
- Abstract: The application of large-language models (LLMs) to digital hardware code generation is an emerging field, with most LLMs primarily trained on natural language and software code. Hardware code like Verilog constitutes a small portion of training data, and few hardware benchmarks exist. The open-source VerilogEval benchmark, released in November 2023, provided a consistent evaluation framework for LLMs on code completion tasks. Since then, both commercial and open models have seen significant development. In this work, we evaluate new commercial and open models since VerilogEval's original release-including GPT-4o, GPT-4 Turbo, Llama3.1 (8B/70B/405B), Llama3 70B, Mistral Large, DeepSeek Coder (33B and 6.7B), CodeGemma 7B, and RTL-Coder-against an improved VerilogEval benchmark suite. We find measurable improvements in state-of-the-art models: GPT-4o achieves a 63% pass rate on specification-to-RTL tasks. The recently released and open Llama3.1 405B achieves a 58% pass rate, almost matching GPT-4o, while the smaller domain-specific RTL-Coder 6.7B models achieve an impressive 34% pass rate. Additionally, we enhance VerilogEval's infrastructure by automatically classifying failures, introducing in-context learning support, and extending the tasks to specification-to-RTL translation. We find that prompt engineering remains crucial for achieving good pass rates and varies widely with model and task. A benchmark infrastructure that allows for prompt engineering and failure analysis is essential for continued model development and deployment.
Related papers
- Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.
Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.
This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z) - CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance [12.001043263281698]
Existing methods fail to steer Large Language Models (LLMs) between textual reasoning and code generation.
We introduce CodeSteer, an effective method for guiding LLM code/text generation.
Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4.
arXiv Detail & Related papers (2025-02-04T15:53:59Z) - LLM4CVE: Enabling Iterative Automated Vulnerability Repair with Large Language Models [9.946058168276744]
Large Language Models (LLM) have opened up the possibility for many software defects to be patched automatically.
We propose an iterative pipeline that robustly fixes vulnerable functions in real-world code with high accuracy.
We achieve a human-verified quality score of 8.51/10 and an increase in groundtruth code similarity of 20% with Llama 3 70B.
arXiv Detail & Related papers (2025-01-07T00:21:42Z) - PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback [78.89596149768458]
Large Language Models (LLMs) are widely adopted for assisting in software development tasks.
We propose PerfCodeGen, a training-free framework that enhances the performance of LLM-generated code.
arXiv Detail & Related papers (2024-11-18T06:22:38Z) - Can EDA Tool Feedback Improve Verilog Generation by LLMs? [25.596711210493172]
Large Language Models (LLMs) are emerging as a potential tool to help generate fully functioning HDL code.
We evaluate the ability of LLMs to leverage feedback from electronic design automation (EDA) tools to fix mistakes in their own generated Verilog.
arXiv Detail & Related papers (2024-11-01T17:33:28Z) - Large Language Models as Code Executors: An Exploratory Study [29.545321608864295]
This paper pioneers the exploration of Large Language Models (LLMs) as code executors.
We are the first to examine this feasibility across various LLMs, including OpenAI's o1, GPT-4o, GPT-3.5, DeepSeek, and Qwen-Coder.
We introduce an Iterative Instruction Prompting (IIP) technique that processes code snippets line by line, enhancing the accuracy of weaker models by an average of 7.22%.
arXiv Detail & Related papers (2024-10-09T08:23:22Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
In-Context Learning (ICL) and.
Efficient Fine-Tuning (PEFT) are currently two mainstream methods for augmenting.
LLMs to downstream tasks.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents [50.82665351100067]
FlowGen is a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents.
We evaluate FlowGenScrum on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET.
arXiv Detail & Related papers (2024-03-23T14:04:48Z) - Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data augmentation framework [50.02710905062184]
This paper proposes an automated design-data augmentation framework, which generates high-volume and high-quality natural language aligned with Verilog and EDA scripts.
The accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark.
arXiv Detail & Related papers (2024-03-17T13:01:03Z) - LLM4PLC: Harnessing Large Language Models for Verifiable Programming of
PLCs in Industrial Control Systems [9.946058168276744]
Large Language Models (LLMs) fail to produce valid programs for Industrial Control Systems (ICS) operated by Programmable Logic Controllers (PLCs)
We propose a user-guided iterative pipeline leveraging user feedback and external verification tools including grammar checkers, compilers and SMV verifiers.
We run a complete test suite on GPT-3.5, GPT-4, Code Llama-7B, a fine-tuned Code Llama-7B model, Code Llama-34B, and a fine-tuned Code Llama-34B model.
arXiv Detail & Related papers (2024-01-08T23:52:42Z) - Large Language Model-Aware In-Context Learning for Code Generation [75.68709482932903]
Large language models (LLMs) have shown impressive in-context learning (ICL) ability in code generation.
We propose a novel learning-based selection approach named LAIL (LLM-Aware In-context Learning) for code generation.
arXiv Detail & Related papers (2023-10-15T06:12:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.