ForgeHLS: A Large-Scale, Open-Source Dataset for High-Level Synthesis
- URL: http://arxiv.org/abs/2507.03255v3
- Date: Mon, 04 Aug 2025 08:06:57 GMT
- Title: ForgeHLS: A Large-Scale, Open-Source Dataset for High-Level Synthesis
- Authors: Zedong Peng, Zeju Li, Mingzhe Gao, Qiang Xu, Chen Zhang, Jieru Zhao,
- Abstract summary: We introduce ForgeHLS, a large-scale, open-source dataset explicitly designed for machine learning (ML)-driven HLS research.<n> ForgeHLS comprises over 400k diverse designs generated from 846 kernels covering a broad range of application domains.<n>Compared to existing datasets, ForgeHLS significantly enhances scale, diversity, and design coverage.
- Score: 13.87691887333415
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-Level Synthesis (HLS) plays a crucial role in modern hardware design by transforming high-level code into optimized hardware implementations. However, progress in applying machine learning (ML) to HLS optimization has been hindered by a shortage of sufficiently large and diverse datasets. To bridge this gap, we introduce ForgeHLS, a large-scale, open-source dataset explicitly designed for ML-driven HLS research. ForgeHLS comprises over 400k diverse designs generated from 846 kernels covering a broad range of application domains, consuming over 200k CPU hours during dataset construction. Each kernel includes systematically automated pragma insertions (loop unrolling, pipelining, array partitioning), combined with extensive design space exploration using Bayesian optimization. Compared to existing datasets, ForgeHLS significantly enhances scale, diversity, and design coverage. We further define and evaluate representative downstream tasks in Quality of Result (QoR) prediction and automated pragma exploration, clearly demonstrating ForgeHLS utility for developing and improving ML-based HLS optimization methodologies. The dataset and code are public at https://github.com/zedong-peng/ForgeHLS.
Related papers
- Deep Representation Learning for Electronic Design Automation [0.0]
Representation learning has become an effective technique utilized by electronic design automation (EDA) algorithms.<n>This paper examines the application of representation learning in EDA, covering foundational concepts and analyzing prior work and case studies.<n>Key techniques, including image-based methods, graph-based approaches, and hybrid multimodal solutions, are presented to illustrate the improvements provided in routing, timing, and parasitic prediction.
arXiv Detail & Related papers (2025-05-04T13:18:58Z) - OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs [62.68905180014956]
We introduce OpenCodeInstruct, the largest open-access instruction tuning dataset, comprising 5 million diverse samples.<n>Each sample includes a programming question, solution, test cases, execution feedback, and LLM-generated quality assessments.<n>We fine-tune various base models, including LLaMA and Qwen, across multiple scales (1B+, 3B+, and 7B+) using our dataset.
arXiv Detail & Related papers (2025-04-05T02:52:16Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.<n>We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.<n>Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - AnyRefill: A Unified, Data-Efficient Framework for Left-Prompt-Guided Vision Tasks [116.8706375364465]
We present a novel Left-Prompt-Guided (LPG) paradigm to address a diverse range of reference-based vision tasks.<n>We propose AnyRefill, that effectively adapts Text-to-Image (T2I) models to various vision tasks.
arXiv Detail & Related papers (2025-02-16T15:12:40Z) - SnipGen: A Mining Repository Framework for Evaluating LLMs for Code [51.07471575337676]
Language Models (LLMs) are trained on extensive datasets that include code repositories.<n> evaluating their effectiveness poses significant challenges due to the potential overlap between the datasets used for training and those employed for evaluation.<n>We introduce SnipGen, a comprehensive repository mining framework designed to leverage prompt engineering across various downstream tasks for code generation.
arXiv Detail & Related papers (2025-02-10T21:28:15Z) - A Survey of Research in Large Language Models for Electronic Design Automation [5.426530967206322]
Large Language Models (LLMs) have emerged as transformative technologies.<n>This survey focuses on advancements in model architectures, the implications of varying model sizes, and innovative customization techniques.<n>It aims to offer valuable insights to professionals in the EDA industry, AI researchers, and anyone interested in the convergence of advanced AI technologies and electronic design.
arXiv Detail & Related papers (2025-01-16T16:51:59Z) - Data Analysis in the Era of Generative AI [56.44807642944589]
This paper explores the potential of AI-powered tools to reshape data analysis, focusing on design considerations and challenges.
We explore how the emergence of large language and multimodal models offers new opportunities to enhance various stages of data analysis workflow.
We then examine human-centered design principles that facilitate intuitive interactions, build user trust, and streamline the AI-assisted analysis workflow across multiple apps.
arXiv Detail & Related papers (2024-09-27T06:31:03Z) - Deep Inverse Design for High-Level Synthesis [1.9029532975354944]
We propose Deep Inverse Design for HLS (DID4HLS), a novel approach that integrates graph neural networks and generative models.<n>DID4HLS iteratively optimize hardware designs aimed at compute-intensive algorithms by learning conditional distributions of design features from post-HLS data.<n>Compared to four state-of-the-art DSE baselines, our method achieved an average improvement of 42.8% on average distance to reference set.
arXiv Detail & Related papers (2024-07-11T18:13:38Z) - HLSFactory: A Framework Empowering High-Level Synthesis Datasets for Machine Learning and Beyond [3.206764939601044]
Machine learning (ML) techniques have been applied to high-level synthesis (HLS) flows for quality-of-result (QoR) prediction and design space exploration (DSE)<n>The scarcity of high-quality HLS datasets and the complexity of building such datasets present challenges.<n>We introduce HLSFactory, a comprehensive framework designed to facilitate the curation and generation of high-quality HLS design datasets.
arXiv Detail & Related papers (2024-05-01T19:02:18Z) - Skip the Benchmark: Generating System-Level High-Level Synthesis Data using Generative Machine Learning [8.416553728391309]
High-Level Synthesis (HLS) Design Space Exploration (DSE) is a widely accepted approach for exploring optimal hardware solutions during the HLS process.
Several HLS benchmarks and datasets are available for the research community to evaluate their methodologies.
This paper proposes a novel approach, called Vaegan, that employs generative machine learning to generate synthetic data that is robust enough to support complex system-level HLS DSE experiments.
arXiv Detail & Related papers (2024-04-23T05:32:22Z) - Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime.
We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z) - EDALearn: A Comprehensive RTL-to-Signoff EDA Benchmark for Democratized and Reproducible ML for EDA Research [7.754108359835169]
We introduce EDALearn, the first holistic, open-source benchmark suite specifically for Machine Learning tasks in EDA.<n>This benchmark suite presents an end-to-end flow from synthesis to physical implementation, enriching data collection across various stages.<n>Our contributions aim to encourage further advances in the ML-EDA domain.
arXiv Detail & Related papers (2023-12-04T06:51:46Z) - LAMBO: Large AI Model Empowered Edge Intelligence [71.56135386994119]
Next-generation edge intelligence is anticipated to benefit various applications via offloading techniques.
Traditional offloading architectures face several issues, including heterogeneous constraints, partial perception, uncertain generalization, and lack of tractability.
We propose a Large AI Model-Based Offloading (LAMBO) framework with over one billion parameters for solving these problems.
arXiv Detail & Related papers (2023-08-29T07:25:42Z) - Open-Set Domain Adaptation with Visual-Language Foundation Models [51.49854335102149]
Unsupervised domain adaptation (UDA) has proven to be very effective in transferring knowledge from a source domain to a target domain with unlabeled data.
Open-set domain adaptation (ODA) has emerged as a potential solution to identify these classes during the training phase.
arXiv Detail & Related papers (2023-07-30T11:38:46Z) - HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High
Level Synthesis [1.7795190822602627]
This paper presents a dataset for ML-assisted FPGA design using HLS, called HLSDataset.
The dataset is generated from widely used HLS C benchmarks including Polybench, Machsuite, CHStone and Rossetta.
The total number of generated Verilog samples is nearly 9,000 per FPGA type.
arXiv Detail & Related papers (2023-02-17T17:00:12Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z) - EPiDA: An Easy Plug-in Data Augmentation Framework for High Performance
Text Classification [34.15923302216751]
We present an easy and plug-in data augmentation framework EPiDA to support effective text classification.
EPiDA employs two mechanisms: relative entropy (REM) and conditional minimization entropy (CEM) to control data generation.
EPiDA can support efficient and continuous data generation for effective classification training.
arXiv Detail & Related papers (2022-04-24T06:53:48Z) - Adaptive Linear Span Network for Object Skeleton Detection [56.78705071830965]
We propose adaptive linear span network (AdaLSN) to automatically configure and integrate scale-aware features for object skeleton detection.
AdaLSN substantiates its versatility by achieving significantly higher accuracy and latency trade-off.
It also demonstrates general applicability to image-to-mask tasks such as edge detection and road extraction.
arXiv Detail & Related papers (2020-11-08T12:51:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.