Towards the Imagenets of ML4EDA
- URL: http://arxiv.org/abs/2310.10560v1
- Date: Mon, 16 Oct 2023 16:35:03 GMT
- Title: Towards the Imagenets of ML4EDA
- Authors: Animesh Basak Chowdhury, Shailja Thakur, Hammond Pearce, Ramesh Karri,
Siddharth Garg
- Abstract summary: We describe our experience curating two large-scale, high-quality datasets for Verilog code generation and logic synthesis.
The first, VeriGen, is a dataset of Verilog code collected from GitHub and Verilog textbooks.
The second, OpenABC-D, is a large-scale, labeled dataset designed to aid ML for logic synthesis.
- Score: 24.696892205786742
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Despite the growing interest in ML-guided EDA tools from RTL to GDSII, there
are no standard datasets or prototypical learning tasks defined for the EDA
problem domain. Experience from the computer vision community suggests that
such datasets are crucial to spur further progress in ML for EDA. Here we
describe our experience curating two large-scale, high-quality datasets for
Verilog code generation and logic synthesis. The first, VeriGen, is a dataset
of Verilog code collected from GitHub and Verilog textbooks. The second,
OpenABC-D, is a large-scale, labeled dataset designed to aid ML for logic
synthesis tasks. The dataset consists of 870,000 And-Inverter-Graphs (AIGs)
produced from 1500 synthesis runs on a large number of open-source hardware
projects. In this paper we will discuss challenges in curating, maintaining and
growing the size and scale of these datasets. We will also touch upon questions
of dataset quality and security, and the use of novel data augmentation tools
that are tailored for the hardware domain.
Related papers
- RedPajama: an Open Dataset for Training Large Language Models [80.74772646989423]
We identify three core data-related challenges that must be addressed to advance open-source language models.
These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis.
We release RedPajama-V1, an open reproduction of the LLaMA training dataset, and RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata.
arXiv Detail & Related papers (2024-11-19T09:35:28Z) - SEART Data Hub: Streamlining Large-Scale Source Code Mining and Pre-Processing [13.717170962455526]
We present the SEART Data Hub, a web application that allows to easily build and pre-process large-scale datasets featuring code mined from public GitHub repositories.
Through a simple web interface, researchers can specify a set of mining criteria as well as specific pre-processing steps they want to perform.
After submitting the request, the user will receive an email with a download link for the required dataset within a few hours.
arXiv Detail & Related papers (2024-09-27T11:42:19Z) - Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? [73.81908518992161]
We introduce Spider2-V, the first multimodal agent benchmark focusing on professional data science and engineering.
Spider2-V features real-world tasks in authentic computer environments and incorporating 20 enterprise-level professional applications.
These tasks evaluate the ability of a multimodal agent to perform data-related tasks by writing code and managing the GUI in enterprise data software systems.
arXiv Detail & Related papers (2024-07-15T17:54:37Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
Large Language Models (LLMs) have made tremendous strides in code generation, but existing research fails to account for the dynamic nature of software development.
We propose two novel tasks aimed at bridging this gap: version-specific code completion (VSCC) and version-aware code migration (VACM)
We conduct an extensive evaluation on VersiCode, which reveals that version-controllable code generation is indeed a significant challenge.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - EDA Corpus: A Large Language Model Dataset for Enhanced Interaction with OpenROAD [0.2581187101462483]
We present an open-source dataset tailored for OpenROAD, a widely adopted open-source EDA toolchain.
The dataset features over 1000 data points and is structured in two formats: (i) a pairwise set comprised of question prompts with prose answers, and (ii) a pairwise set comprised of code prompts and their corresponding OpenROAD scripts.
arXiv Detail & Related papers (2024-05-04T21:29:37Z) - Data is all you need: Finetuning LLMs for Chip Design via an Automated design-data augmentation framework [50.02710905062184]
This paper proposes an automated design-data augmentation framework, which generates high-volume and high-quality natural language aligned with Verilog and EDA scripts.
The accuracy of Verilog generation surpasses that of the current state-of-the-art open-source Verilog generation model, increasing from 58.8% to 70.6% with the same benchmark.
arXiv Detail & Related papers (2024-03-17T13:01:03Z) - Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator [63.762209407570715]
Genixer is a comprehensive data generation pipeline consisting of four key steps.
A synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks.
MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data.
arXiv Detail & Related papers (2023-12-11T09:44:41Z) - HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High
Level Synthesis [1.7795190822602627]
This paper presents a dataset for ML-assisted FPGA design using HLS, called HLSDataset.
The dataset is generated from widely used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta.
The total number of generated Verilog samples is nearly 9,000 per FPGA type.
arXiv Detail & Related papers (2023-02-17T17:00:12Z) - JEMMA: An Extensible Java Dataset for ML4Code Applications [34.76698017961728]
We introduce JEMMA, a large-scale, diverse, and high-quality dataset targeted at Machine Learning for Source Code (ML4Code)
Our goal with JEMMA is to lower the barrier to entry in ML4Code by providing the building blocks to experiment with source code models and tasks.
JEMMA comes with a considerable amount of pre-processed information such as metadata, representations (e.g., code tokens, ASTs, graphs), and several properties.
arXiv Detail & Related papers (2022-12-18T17:04:14Z) - OpenABC-D: A Large-Scale Dataset For Machine Learning Guided Integrated
Circuit Synthesis [10.338357262730863]
OpenABC-D is a large-scale, labeled dataset produced by prototypical open source designs with a leading open-source logic synthesis tool.
We define a generic learning problem on this dataset and benchmark existing solutions.
arXiv Detail & Related papers (2021-10-21T17:19:19Z) - Open Graph Benchmark: Datasets for Machine Learning on Graphs [86.96887552203479]
We present the Open Graph Benchmark (OGB) to facilitate scalable, robust, and reproducible graph machine learning (ML) research.
OGB datasets are large-scale, encompass multiple important graph ML tasks, and cover a diverse range of domains.
For each dataset, we provide a unified evaluation protocol using meaningful application-specific data splits and evaluation metrics.
arXiv Detail & Related papers (2020-05-02T03:09:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.