Related papers: Enhancing Large Language Models for Hardware Verification: A Novel SystemVerilog Assertion Dataset

Enhancing Large Language Models for Hardware Verification: A Novel SystemVerilog Assertion Dataset

URL: http://arxiv.org/abs/2503.08923v1
Date: Tue, 11 Mar 2025 22:13:26 GMT
Title: Enhancing Large Language Models for Hardware Verification: A Novel SystemVerilog Assertion Dataset
Authors: Anand Menon, Samit S Miftah, Shamik Kundu, Souvik Kundu, Amisha Srivastava, Arnab Raha, Gabriel Theodor Sonnenschein, Suvadeep Banerjee, Deepak Mathaikutty, Kanad Basu,
Abstract summary: **VERT** is an open-source dataset designed to enhance SystemVerilog assertion generation using LLMs.<n>It enables researchers in academia and industry to fine-tune open-source models, outperforming larger proprietary ones in both accuracy and efficiency.
Score: 3.8212435331909256
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Hardware verification is crucial in modern SoC design, consuming around 70% of development time. SystemVerilog assertions ensure correct functionality. However, existing industrial practices rely on manual efforts for assertion generation, which becomes increasingly untenable as hardware systems become complex. Recent research shows that Large Language Models (LLMs) can automate this process. However, proprietary SOTA models like GPT-4o often generate inaccurate assertions and require expensive licenses, while smaller open-source LLMs need fine-tuning to manage HDL code complexities. To address these issues, we introduce **VERT**, an open-source dataset designed to enhance SystemVerilog assertion generation using LLMs. VERT enables researchers in academia and industry to fine-tune open-source models, outperforming larger proprietary ones in both accuracy and efficiency while ensuring data privacy through local fine-tuning and eliminating costly licenses. The dataset is curated by systematically augmenting variables from open-source HDL repositories to generate synthetic code snippets paired with corresponding assertions. Experimental results demonstrate that fine-tuned models like Deepseek Coder 6.7B and Llama 3.1 8B outperform GPT-4o, achieving up to 96.88% improvement over base models and 24.14% over GPT-4o on platforms including OpenTitan, CVA6, OpenPiton and Pulpissimo. VERT is available at https://github.com/AnandMenon12/VERT.

Related papers

OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples. Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments. We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z)
Learning to Solve and Verify: A Self-Play Framework for Code and Test Generation [69.62857948698436]
Recent advances in large language models (LLMs) have improved their performance on coding benchmarks. However, improvement is plateauing due to the exhaustion of readily available high-quality data. We propose Sol-Ver, a self-play solver-verifier framework that jointly improves a single model's code and test generation capacity.
arXiv Detail & Related papers (2025-02-20T18:32:19Z)
CodeLutra: Boosting LLM Code Generation via Preference-Guided Refinement [32.46078765471136]
Large Language Models (LLMs) have revolutionized code generation but require significant resources and often over-generalize.<n>We introduce CodeLutra, a framework that leverages both correct and incorrect code attempts.<n>By learning from both successes and mistakes, CodeLutra provides a scalable and efficient path to high-quality code generation.
arXiv Detail & Related papers (2024-11-07T21:51:07Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
Automatically Improving LLM-based Verilog Generation using EDA Tool Feedback [25.596711210493172]
Large Language Models (LLMs) are emerging as a potential tool to help generate fully functioning HDL code.<n>We evaluate the ability of LLMs to leverage feedback from electronic design automation (EDA) tools to fix mistakes in their own generated Verilog.
arXiv Detail & Related papers (2024-11-01T17:33:28Z)
Revisiting VerilogEval: A Year of Improvements in Large-Language Models for Hardware Code Generation [6.463959200930805]
We evaluate new commercial and open models since the release of the open-source VerilogEval benchmark.<n>We find measurable improvements in state-of-the-art models.<n>We find that prompt engineering remains crucial for achieving good pass rates.
arXiv Detail & Related papers (2024-08-20T17:58:56Z)
OriGen:Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection [54.775409528658486]
OriGen is a fully open-source framework that incorporates self-reflection capabilities and a novel dataset augmentation methodology. Our approach employs a code-tocode augmentation technique to enhance the quality of open-source RTL code datasets.
arXiv Detail & Related papers (2024-07-23T07:22:25Z)
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents [62.02920842630234]
We show how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors. For evaluation, we unify datasets from recent work on fact-checking and grounding LLM generations into a new benchmark, LLM-AggreFact.
arXiv Detail & Related papers (2024-04-16T17:59:10Z)
Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data augmentation framework [50.02710905062184]
This paper proposes an automated design-data augmentation framework, which generates high-volume and high-quality natural language aligned with Verilog and EDA scripts. The accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark.
arXiv Detail & Related papers (2024-03-17T13:01:03Z)
Comparing GPT-4 and Open-Source Language Models in Misinformation Mitigation [6.929834518749884]
GPT-4 is known to be strong in this domain, but it is closed source, potentially expensive, and can show instability between different versions. We show that Zephyr-7b presents a consistently viable alternative, overcoming key limitations of commonly used approaches. We then highlight how GPT-3.5 exhibits unstable performance, such that this very widely used model could provide misleading results in misinformation detection.
arXiv Detail & Related papers (2024-01-12T22:27:25Z)
LLM4PLC: Harnessing Large Language Models for Verifiable Programming of PLCs in Industrial Control Systems [9.946058168276744]
Large Language Models (LLMs) fail to produce valid programs for Industrial Control Systems (ICS) operated by Programmable Logic Controllers (PLCs) We propose a user-guided iterative pipeline leveraging user feedback and external verification tools including grammar checkers, compilers and SMV verifiers. We run a complete test suite on GPT-3.5, GPT-4, Code Llama-7B, a fine-tuned Code Llama-7B model, Code Llama-34B, and a fine-tuned Code Llama-34B model.
arXiv Detail & Related papers (2024-01-08T23:52:42Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.