DeepCircuitX: A Comprehensive Repository-Level Dataset for RTL Code Understanding, Generation, and PPA Analysis
- URL: http://arxiv.org/abs/2502.18297v1
- Date: Tue, 25 Feb 2025 15:34:00 GMT
- Title: DeepCircuitX: A Comprehensive Repository-Level Dataset for RTL Code Understanding, Generation, and PPA Analysis
- Authors: Zeju Li, Changran Xu, Zhengyuan Shi, Zedong Peng, Yi Liu, Yunhao Zhou, Lingfeng Zhou, Chengyu Ma, Jianyuan Zhong, Xi Wang, Jieru Zhao, Zhufei Chu, Xiaoyan Yang, Qiang Xu,
- Abstract summary: DeepCircuitX is a comprehensive repository-level dataset designed to advance RTL (Register Transfer Level) code understanding, generation, and power-performance-area ( PPA) analysis.<n>Unlike existing datasets that are limited to either file-level RTL code or physical layout data, DeepCircuitX provides a holistic, multilevel resource that spans repository, file, module, and block-level RTL code.<n>DeepCircuitX is enriched with Chain of Thought (CoT) annotations, offering detailed descriptions of functionality and structure at multiple levels.
- Score: 14.341633834445307
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces DeepCircuitX, a comprehensive repository-level dataset designed to advance RTL (Register Transfer Level) code understanding, generation, and power-performance-area (PPA) analysis. Unlike existing datasets that are limited to either file-level RTL code or physical layout data, DeepCircuitX provides a holistic, multilevel resource that spans repository, file, module, and block-level RTL code. This structure enables more nuanced training and evaluation of large language models (LLMs) for RTL-specific tasks. DeepCircuitX is enriched with Chain of Thought (CoT) annotations, offering detailed descriptions of functionality and structure at multiple levels. These annotations enhance its utility for a wide range of tasks, including RTL code understanding, generation, and completion. Additionally, the dataset includes synthesized netlists and PPA metrics, facilitating early-stage design exploration and enabling accurate PPA prediction directly from RTL code. We demonstrate the dataset's effectiveness on various LLMs finetuned with our dataset and confirm the quality with human evaluations. Our results highlight DeepCircuitX as a critical resource for advancing RTL-focused machine learning applications in hardware design automation.Our data is available at https://zeju.gitbook.io/lcm-team.
Related papers
- RTLRepoCoder: Repository-Level RTL Code Completion through the Combination of Fine-Tuning and Retrieval Augmentation [6.428086269916113]
We propose RTLRepoCoder, a groundbreaking solution that incorporates specific fine-tuning and Retrieval-Augmented Generation (RAG) for repository-level Verilog code completion.
Our solution achieves state-of-the-art performance on public benchmark, significantly surpassing GPT-4 and advanced domain-specific LLMs on Edit Similarity and Exact Match rate.
arXiv Detail & Related papers (2025-04-11T09:04:50Z) - OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples.
Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments.
We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z) - SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories.<n> evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation.<n>We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z) - RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects [0.02630859234884723]
Large Language Models (LLMs) have demonstrated potential in assisting with Register Transfer Level (RTL) design tasks.
There remains to be a significant gap in benchmarks that accurately reflect the complexity of real-world RTL projects.
This paper presents RTL-Repo, a benchmark designed to evaluate LLMs on large-scale RTL design projects.
arXiv Detail & Related papers (2024-05-27T17:36:01Z) - Cross-domain Learning Framework for Tracking Users in RIS-aided Multi-band ISAC Systems with Sparse Labeled Data [55.70071704247794]
Integrated sensing and communications (ISAC) is pivotal for 6G communications and is boosted by the rapid development of reconfigurable intelligent surfaces (RISs)
This paper proposes the X2Track framework, where we model the tracking function by a hierarchical architecture, jointly utilizing multi-modal CSI indicators across multiple bands, and optimize it in a cross-domain manner.
Under X2Track, we design an efficient deep learning algorithm to minimize tracking errors, based on transformer neural networks and adversarial learning techniques.
arXiv Detail & Related papers (2024-05-10T08:04:27Z) - Enhancing LLM-Based Coding Tools through Native Integration of
IDE-Derived Static Context [41.91246546266515]
We argue that Integrated Development Environments (IDEs) can provide direct, accurate and real-time cross-file information for repository-level code completion.
We propose a framework that leveragesIDE native static contexts for cross-context construction and diagnosis results for self-refinement.
arXiv Detail & Related papers (2024-02-06T01:59:41Z) - mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large
Language Model [73.38800189095173]
This work focuses on strengthening the multi-modal diagram analysis ability of Multimodal LLMs.
By parsing Latex source files of high-quality papers, we carefully build a multi-modal diagram understanding dataset M-Paper.
M-Paper is the first dataset to support joint comprehension of multiple scientific diagrams, including figures and tables in the format of images or Latex codes.
arXiv Detail & Related papers (2023-11-30T04:43:26Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z) - Verilog-to-PyG -- A Framework for Graph Learning and Augmentation on RTL
Designs [15.67829950106923]
We introduce an innovative open-source framework that translates RTL designs into graph representation foundations.
The Verilog-to-PyG (V2PYG) framework is compatible with the open-source Electronic Design Automation (EDA) toolchain OpenROAD.
We will present novel RTL data augmentation methods that enable functional equivalent design augmentation for the construction of an extensive graph-based RTL design database.
arXiv Detail & Related papers (2023-11-09T20:11:40Z) - Harnessing Explanations: LLM-to-LM Interpreter for Enhanced
Text-Attributed Graph Representation Learning [51.90524745663737]
A key innovation is our use of explanations as features, which can be used to boost GNN performance on downstream tasks.
Our method achieves state-of-the-art results on well-established TAG datasets.
Our method significantly speeds up training, achieving a 2.88 times improvement over the closest baseline on ogbn-arxiv.
arXiv Detail & Related papers (2023-05-31T03:18:03Z) - Language Models Enable Simple Systems for Generating Structured Views of Heterogeneous Data Lakes [54.13559879916708]
EVAPORATE is a prototype system powered by large language models (LLMs)
Code synthesis is cheap, but far less accurate than directly processing each document with the LLM.
We propose an extended code implementation, EVAPORATE-CODE+, which achieves better quality than direct extraction.
arXiv Detail & Related papers (2023-04-19T06:00:26Z) - HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High
Level Synthesis [1.7795190822602627]
This paper presents a dataset for ML-assisted FPGA design using HLS, called HLSDataset.
The dataset is generated from widely used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta.
The total number of generated Verilog samples is nearly 9,000 per FPGA type.
arXiv Detail & Related papers (2023-02-17T17:00:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.