hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation
- URL: http://arxiv.org/abs/2506.04544v2
- Date: Tue, 08 Jul 2025 19:43:08 GMT
- Title: hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation
- Authors: Charles Hong, Brendan Roberts, Huijae An, Alex Um, Advay Ratan, Yakun Sophia Shao,
- Abstract summary: We present.<n>HDL-to-Verilog, a dataset which seeks to increase the amount of available human-written Verilog data.<n>We demonstrate the value of.<n>HDL-to-Verilog by improving performance of a 32 billion-weight model by up to 23%.
- Score: 0.9149489479543916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) are playing an increasingly large role in domains such as code generation, including hardware code generation, where Verilog is the key language. However, the amount of publicly available Verilog code pales in comparison to the amount of code available for software languages like Python. In this work, we present hdl2v ("HDL-to-Verilog"), a dataset which seeks to increase the amount of available human-written Verilog data by translating or compiling three other hardware description languages - VHDL, Chisel, and PyMTL3 - to Verilog. Furthermore, we demonstrate the value of hdl2v in enhancing LLM Verilog generation by improving performance of a 32 billion-parameter open-weight model by up to 23% (pass@10) in VerilogEvalV2, without utilizing any data augmentation or knowledge distillation from larger models. We also show hdl2v's ability to boost the performance of a data augmentation-based fine-tuning approach by 63%. Finally, we characterize and analyze our dataset to better understand which characteristics of HDL-to-Verilog datasets can be expanded upon in future work for even better performance.
Related papers
- VerilogDB: The Largest, Highest-Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation [1.0798445660490976]
Large Language Models (LLMs) are gaining popularity for hardware design automation, particularly through Register Transfer Level (RTL) code generation.<n>We construct a robust Verilog dataset through an automated three-pronged process involving database (DB) creation and management.<n>The resulting dataset comprises 20,392 Verilog samples, 751 MB of Verilog code data, which is the largest high-quality Verilog dataset for fine-tuning to our knowledge.
arXiv Detail & Related papers (2025-07-09T17:06:54Z) - HDLxGraph: Bridging Large Language Models and HDL Repositories via HDL Graph Databases [57.51078142561683]
Large Language Models (LLMs) have demonstrated their potential in hardware design tasks.<n>Yet, their performance in real-world, repository-level HDL projects with thousands or even tens of thousands of code lines is hindered.<n>We propose HDLxGraph, a novel framework that integrates Graph Retrieval Augmented Generation (Graph RAG) with LLMs.
arXiv Detail & Related papers (2025-05-21T16:14:10Z) - Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning [69.7882311630412]
We propose Code2Logic, a novel game-code-driven approach for multimodal reasoning data synthesis.<n>Our approach leverages Large Language Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning processes and results through code execution.<n>GameQA is cost-effective and scalable to produce, challenging for state-of-the-art models, and diverse with 30 games and 158 tasks.
arXiv Detail & Related papers (2025-05-20T03:47:44Z) - CodeV: Empowering LLMs with HDL Generation through Multi-Level Summarization [32.462699328256384]
Traditional methods of adapting large language models for hardware design rely on synthetic HDL datasets.<n>We propose an efficient LLM fine-tuning pipeline for HDL generation that integrates a multi-level summarization data synthesis process with a novel Chat-FIM-Tag supervised fine-tuning method.
arXiv Detail & Related papers (2024-07-15T03:57:20Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - VHDL-Eval: A Framework for Evaluating Large Language Models in VHDL Code Generation [4.700008016247411]
This paper introduces a comprehensive evaluation framework designed specifically for assessing VHDL code generation task.
This dataset is constructed by translating a collection of Verilog evaluation problems to VHDL and aggregating publicly available VHDL problems, resulting in a total of 202 problems.
To assess the functional correctness of the generated VHDL code, we utilize a curated set of self-verifying testbenches.
arXiv Detail & Related papers (2024-06-06T00:06:50Z) - Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data augmentation framework [50.02710905062184]
This paper proposes an automated design-data augmentation framework, which generates high-volume and high-quality natural language aligned with Verilog and EDA scripts.
The accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark.
arXiv Detail & Related papers (2024-03-17T13:01:03Z) - Code Needs Comments: Enhancing Code LLMs with Comment Augmentation [91.52444946362547]
We introduce a novel data augmentation method that generates comments for existing code, coupled with a data filtering strategy that filters out code data poorly correlated with natural language.
We conducted experiments on three code-focused Large Language Models and observed consistent improvements in performance on two widely-used programming skill benchmarks.
arXiv Detail & Related papers (2024-02-20T13:56:38Z) - LLM-Assisted Code Cleaning For Training Accurate Code Generators [53.087019724256606]
We investigate data quality for code and find that making the code more structured and readable leads to improved code generation performance of the system.
We build a novel data-cleaning pipeline that uses these principles to transform existing programs.
We evaluate our approach on two challenging algorithmic code generation benchmarks and find that fine-tuning CodeLLaMa-7B improves the performance by up to 30% compared to fine-tuning on the original dataset.
arXiv Detail & Related papers (2023-11-25T02:45:50Z) - GlotLID: Language Identification for Low-Resource Languages [51.38634652914054]
GlotLID-M is an LID model that satisfies the desiderata of wide coverage, reliability and efficiency.
It identifies 1665 languages, a large increase in coverage compared to prior work.
arXiv Detail & Related papers (2023-10-24T23:45:57Z) - Benchmarking Large Language Models for Automated Verilog RTL Code
Generation [21.747037230069854]
We characterize the ability of large language models (LLMs) to generate useful Verilog.
We construct an evaluation framework comprising test-benches for functional analysis and a flow to test the syntax of Verilog code.
Our findings show that across our problem scenarios, the fine-tuning results in LLMs more capable of producing syntactically correct code.
arXiv Detail & Related papers (2022-12-13T16:34:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.