HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High
Level Synthesis
- URL: http://arxiv.org/abs/2302.10977v2
- Date: Mon, 21 Aug 2023 17:36:36 GMT
- Title: HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High
Level Synthesis
- Authors: Zhigang Wei, Aman Arora, Ruihao Li, Lizy K. John
- Abstract summary: This paper presents a dataset for ML-assisted FPGA design using HLS, called HLSDataset.
The dataset is generated from widely used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta.
The total number of generated Verilog samples is nearly 9,000 per FPGA type.
- Score: 1.7795190822602627
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Machine Learning (ML) has been widely adopted in design exploration using
high level synthesis (HLS) to give a better and faster performance, and
resource and power estimation at very early stages for FPGA-based design. To
perform prediction accurately, high-quality and large-volume datasets are
required for training ML models.This paper presents a dataset for ML-assisted
FPGA design using HLS, called HLSDataset. The dataset is generated from widely
used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta. The
Verilog samples are generated with a variety of directives including loop
unroll, loop pipeline and array partition to make sure optimized and realistic
designs are covered. The total number of generated Verilog samples is nearly
9,000 per FPGA type. To demonstrate the effectiveness of our dataset, we
undertake case studies to perform power estimation and resource usage
estimation with ML models trained with our dataset. All the codes and dataset
are public at the github repo.We believe that HLSDataset can save valuable time
for researchers by avoiding the tedious process of running tools, scripting and
parsing files to generate the dataset, and enable them to spend more time where
it counts, that is, in training ML models.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - rule4ml: An Open-Source Tool for Resource Utilization and Latency Estimation for ML Models on FPGA [0.0]
This paper introduces a novel method to predict the resource utilization and inference latency of Neural Networks (NNs) before their synthesis and implementation on FPGA.
We leverage HLS4ML, a tool-flow that helps translate NNs into high-level synthesis (HLS) code.
Our method uses trained regression models for immediate pre-synthesis predictions.
arXiv Detail & Related papers (2024-08-09T19:35:10Z) - Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation.
By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator [63.762209407570715]
Genixer is a comprehensive data generation pipeline consisting of four key steps.
A synthetic VQA-like dataset trained with LLaVA1.5 enhances performance on 10 out of 12 multimodal benchmarks.
MLLMs trained with task-specific datasets can surpass GPT-4V in generating complex instruction tuning data.
arXiv Detail & Related papers (2023-12-11T09:44:41Z) - EDALearn: A Comprehensive RTL-to-Signoff EDA Benchmark for Democratized
and Reproducible ML for EDA Research [5.093676641214663]
We introduce EDALearn, the first holistic, open-source benchmark suite specifically for Machine Learning tasks in EDA.
This benchmark suite presents an end-to-end flow from synthesis to physical implementation, enriching data collection across various stages.
Our contributions aim to encourage further advances in the ML-EDA domain.
arXiv Detail & Related papers (2023-12-04T06:51:46Z) - Data-Juicer: A One-Stop Data Processing System for Large Language Models [73.27731037450995]
A data recipe is a mixture of data from different sources for training Large Language Models (LLMs)
We build a new system named Data-Juicer, with which we can efficiently generate diverse data recipes.
The data recipes derived with Data-Juicer gain notable improvements on state-of-the-art LLMs.
arXiv Detail & Related papers (2023-09-05T08:22:07Z) - MLLM-DataEngine: An Iterative Refinement Approach for MLLM [62.30753425449056]
We propose a novel closed-loop system that bridges data generation, model training, and evaluation.
Within each loop, the MLLM-DataEngine first analyze the weakness of the model based on the evaluation results.
For targeting, we propose an Adaptive Bad-case Sampling module, which adjusts the ratio of different types of data.
For quality, we resort to GPT-4 to generate high-quality data with each given data type.
arXiv Detail & Related papers (2023-08-25T01:41:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.