Related papers: HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis

HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis

URL: http://arxiv.org/abs/2302.10977v2
Date: Mon, 21 Aug 2023 17:36:36 GMT
Title: HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis
Authors: Zhigang Wei, Aman Arora, Ruihao Li, Lizy K. John
Abstract summary: This paper presents a dataset for ML-assisted FPGA design using HLS, called HLSDataset. The dataset is generated from widely used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta. The total number of generated Verilog samples is nearly 9,000 per FPGA type.
Score: 1.7795190822602627
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Machine Learning (ML) has been widely adopted in design exploration using high level synthesis (HLS) to give a better and faster performance, and resource and power estimation at very early stages for FPGA-based design. To perform prediction accurately, high-quality and large-volume datasets are required for training ML models.This paper presents a dataset for ML-assisted FPGA design using HLS, called HLSDataset. The dataset is generated from widely used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta. The Verilog samples are generated with a variety of directives including loop unroll, loop pipeline and array partition to make sure optimized and realistic designs are covered. The total number of generated Verilog samples is nearly 9,000 per FPGA type. To demonstrate the effectiveness of our dataset, we undertake case studies to perform power estimation and resource usage estimation with ML models trained with our dataset. All the codes and dataset are public at the github repo.We believe that HLSDataset can save valuable time for researchers by avoiding the tedious process of running tools, scripting and parsing files to generate the dataset, and enable them to spend more time where it counts, that is, in training ML models.

Related papers

VerilogDB: The Largest, Highest-Quality Dataset with a Preprocessing Framework for LLM-based RTL Generation [1.0798445660490976]
Large Language Models (LLMs) are gaining popularity for hardware design automation, particularly through Register Transfer Level (RTL) code generation.<n>We construct a robust Verilog dataset through an automated three-pronged process involving database (DB) creation and management.<n>The resulting dataset comprises 20,392 Verilog samples, 751 MB of Verilog code data, which is the largest high-quality Verilog dataset for fine-tuning to our knowledge.
arXiv Detail & Related papers (2025-07-09T17:06:54Z)
ForgeHLS: A Large-Scale, Open-Source Dataset for High-Level Synthesis [13.87691887333415]
We introduce ForgeHLS, a large-scale, open-source dataset explicitly designed for machine learning (ML)-driven HLS research.<n> ForgeHLS comprises over 400k diverse designs generated from 846 kernels covering a broad range of application domains.<n>Compared to existing datasets, ForgeHLS significantly enhances scale, diversity, and design coverage.
arXiv Detail & Related papers (2025-07-04T02:23:46Z)
UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge. We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process. Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z)
OpenLS-DGF: An Adaptive Open-Source Dataset Generation Framework for Machine Learning Tasks in Logic Synthesis [11.073500440401894]
OpenLS-DGF is an adaptive logic synthesis dataset generation framework. It supports various machine learning tasks by encapsulating the three fundamental steps of logic synthesis. The generated OpenLS-D-v1 dataset comprises 46 combinational designs from established benchmarks.
arXiv Detail & Related papers (2024-11-14T13:18:06Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
rule4ml: An Open-Source Tool for Resource Utilization and Latency Estimation for ML Models on FPGA [0.0]
This paper introduces a novel method to predict the resource utilization and inference latency of Neural Networks (NNs) before their synthesis and implementation on FPGA. We leverage HLS4ML, a tool-flow that helps translate NNs into high-level synthesis (HLS) code. Our method uses trained regression models for immediate pre-synthesis predictions.
arXiv Detail & Related papers (2024-08-09T19:35:10Z)
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z)
ITERTL: An Iterative Framework for Fine-tuning LLMs for RTL Code Generation [9.409062607311528]
Large language models (LLMs) have demonstrated excellent performance, inspiring researchers to explore their use in automating register transfer level (RTL) code generation. Existing approaches to fine-tune LLMs for RTL generation typically are conducted on fixed datasets. We introduce an iterative training paradigm named ITERTL to mitigate these issues. Our model outperforms GPT4 and state-of-the-art (SOTA) open-source models, achieving remarkable 53.8% pass@1 rate on VerilogEval-human benchmark.
arXiv Detail & Related papers (2024-06-28T01:44:57Z)
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime. We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z)
Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator [63.762209407570715]
Genixer is a comprehensive data generation pipeline consisting of four key steps. A synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks. MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data.
arXiv Detail & Related papers (2023-12-11T09:44:41Z)
EDALearn: A Comprehensive RTL-to-Signoff EDA Benchmark for Democratized and Reproducible ML for EDA Research [5.093676641214663]
We introduce EDALearn, the first holistic, open-source benchmark suite specifically for Machine Learning tasks in EDA. This benchmark suite presents an end-to-end flow from synthesis to physical implementation, enriching data collection across various stages. Our contributions aim to encourage further advances in the ML-EDA domain.
arXiv Detail & Related papers (2023-12-04T06:51:46Z)
Data-Juicer: A One-Stop Data Processing System for Large Language Models [73.27731037450995]
A data recipe is a mixture of data from different sources for training Large Language Models (LLMs) We build a new system named Data-Juicer, with which we can efficiently generate diverse data recipes. The data recipes derived with Data-Juicer gain notable improvements on state-of-the-art LLMs.
arXiv Detail & Related papers (2023-09-05T08:22:07Z)
MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation. Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results. For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data. For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.