Related papers: Benchmarking Large Language Models for Automated Verilog RTL Code Generation

Benchmarking Large Language Models for Automated Verilog RTL Code Generation

URL: http://arxiv.org/abs/2212.11140v1
Date: Tue, 13 Dec 2022 16:34:39 GMT
Title: Benchmarking Large Language Models for Automated Verilog RTL Code Generation
Authors: Shailja Thakur, Baleegh Ahmad, Zhenxing Fan, Hammond Pearce, Benjamin Tan, Ramesh Karri, Brendan Dolan-Gavitt, Siddharth Garg
Abstract summary: We characterize the ability of large language models (LLMs) to generate useful Verilog. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code.
Score: 21.747037230069854
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automating hardware design could obviate a significant amount of human error from the engineering process and lead to fewer errors. Verilog is a popular hardware description language to model and design digital systems, thus generating Verilog code is a critical first step. Emerging large language models (LLMs) are able to write high-quality code in other programming languages. In this paper, we characterize the ability of LLMs to generate useful Verilog. For this, we fine-tune pre-trained LLMs on Verilog datasets collected from GitHub and Verilog textbooks. We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code generated in response to problems of varying difficulty. Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code (25.9% overall). Further, when analyzing functional correctness, a fine-tuned open-source CodeGen LLM can outperform the state-of-the-art commercial Codex LLM (6.5% overall). Training/evaluation scripts and LLM checkpoints are available: https://github.com/shailja-thakur/VGen.

Related papers

IFEvalCode: Controlled Code Generation [69.28317223249358]
The paper introduces forward and backward constraints generation to improve the instruction-following capabilities of Code LLMs.<n>The authors present IFEvalCode, a multilingual benchmark comprising 1.6K test samples across seven programming languages.
arXiv Detail & Related papers (2025-07-30T08:08:48Z)
Insights from Verification: Training a Verilog Generation LLM with Reinforcement Learning with Testbench Feedback [36.69082579950107]
Large language models (LLMs) have shown strong performance in Verilog generation from natural language description. This paper introduces a method that integrates verification insights from testbench into the training of Verilog generation LLMs.
arXiv Detail & Related papers (2025-04-22T11:38:14Z)
SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
SpecTool is a new benchmark to identify error patterns in LLM output on tool-use tasks. We show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
Can EDA Tool Feedback Improve Verilog Generation by LLMs? [25.596711210493172]
Large Language Models (LLMs) are emerging as a potential tool to help generate fully functioning HDL code. We evaluate the ability of LLMs to leverage feedback from electronic design automation (EDA) tools to fix mistakes in their own generated Verilog.
arXiv Detail & Related papers (2024-11-01T17:33:28Z)
CraftRTL: High-quality Synthetic Data Generation for Verilog Code Models with Correct-by-Construction Non-Textual Representations and Targeted Code Repair [4.554742043916029]
This paper first presents an analysis of fine-tuned LLMs on Verilog coding, with synthetic data from prior methods.<n>We identify two main issues: difficulties in handling non-textual representations and significant variability during training with models randomly making "minor" mistakes.<n>Our fine-tuned Starcoder2-15B outperforms prior state-of-the-art results by 3.8%, 10.9%, 6.6% for pass@1 on VerilogEval-Machine, VerilogEval-Human, and RTLLM.
arXiv Detail & Related papers (2024-09-19T12:15:55Z)
CodeV: Empowering LLMs for Verilog Generation through Multi-Level Summarization [37.4446786461791]
This paper introduces CodeV, a series of open-source instruction-tuned Verilog generation LLMs. We show that CodeV relatively surpasses the previous open-source SOTA by 14.4% (BetterV in VerilogEval) and 11.3% (RTLCoder in RTLLM) respectively.
arXiv Detail & Related papers (2024-07-15T03:57:20Z)
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph [70.79413606968814]
We introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks.
arXiv Detail & Related papers (2024-06-25T04:27:53Z)
Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data augmentation framework [50.02710905062184]
This paper proposes an automated design-data augmentation framework, which generates high-volume and high-quality natural language aligned with Verilog and EDA scripts. The accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark.
arXiv Detail & Related papers (2024-03-17T13:01:03Z)
Decoding Logic Errors: A Comparative Study on Bug Detection by Students and Large Language Models [5.162225137921625]
Large language models (LLMs) have recently demonstrated surprising performance for a range of computing tasks. We investigate the performance of two popular LLMs, GPT-3 and GPT-4, for detecting and providing a novice-friendly explanation of logic errors.
arXiv Detail & Related papers (2023-11-27T17:28:33Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
VerilogEval: Evaluating Large Language Models for Verilog Code Generation [6.88526119890374]
We present a comprehensive evaluation dataset consisting of 156 problems from the Verilog instructional website HDLBits. The evaluation set consists of a diverse set of Verilog code generation tasks, ranging from simple combinational circuits to complex finite state machines.
arXiv Detail & Related papers (2023-09-14T09:15:34Z)
VeriGen: A Large Language Model for Verilog Code Generation [22.837558083876743]
We fine-tune pre-existing Large Language Models (LLMs) on Verilog datasets compiled from GitHub and Verilog textbooks. Here, our fine-tuned open-source CodeGen-16B model outperforms the commercial state-of-the-art GPT-3.5-turbo model with a 1.1% overall increase. Notably, it demonstrates a 41% improvement in generating syntactically correct Verilog code across various problem categories compared to its pre-trained counterpart.
arXiv Detail & Related papers (2023-07-28T02:57:14Z)
LEVER: Learning to Verify Language-to-Code Generation with Execution [64.36459105535]
We propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the LLMs is correct or not based on the natural language input, the program itself and its execution results. LEVER consistently improves over the base code LLMs(4.6% to 10.9% with code-davinci) and achieves new state-of-the-art results on all of them.
arXiv Detail & Related papers (2023-02-16T18:23:22Z)
Self-Supervised Log Parsing [59.04636530383049]
Large-scale software systems generate massive volumes of semi-structured log records. Existing approaches rely on log-specifics or manual rule extraction. We propose NuLog that utilizes a self-supervised learning model and formulates the parsing task as masked language modeling.
arXiv Detail & Related papers (2020-03-17T19:25:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.