Related papers: Skip the Benchmark: Generating System-Level High-Level Synthesis Data using Generative Machine Learning

Skip the Benchmark: Generating System-Level High-Level Synthesis Data using Generative Machine Learning

URL: http://arxiv.org/abs/2404.14754v1
Date: Tue, 23 Apr 2024 05:32:22 GMT
Title: Skip the Benchmark: Generating System-Level High-Level Synthesis Data using Generative Machine Learning
Authors: Yuchao Liao, Tosiron Adegbija, Roman Lysecky, Ravi Tandon,
Abstract summary: High-Level Synthesis (HLS) Design Space Exploration (DSE) is a widely accepted approach for exploring optimal hardware solutions during the HLS process. Several HLS benchmarks and datasets are available for the research community to evaluate their methodologies. This paper proposes a novel approach, called Vaegan, that employs generative machine learning to generate synthetic data that is robust enough to support complex system-level HLS DSE experiments.
Score: 8.416553728391309
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High-Level Synthesis (HLS) Design Space Exploration (DSE) is a widely accepted approach for efficiently exploring Pareto-optimal and optimal hardware solutions during the HLS process. Several HLS benchmarks and datasets are available for the research community to evaluate their methodologies. Unfortunately, these resources are limited and may not be sufficient for complex, multi-component system-level explorations. Generating new data using existing HLS benchmarks can be cumbersome, given the expertise and time required to effectively generate data for different HLS designs and directives. As a result, synthetic data has been used in prior work to evaluate system-level HLS DSE. However, the fidelity of the synthetic data to real data is often unclear, leading to uncertainty about the quality of system-level HLS DSE. This paper proposes a novel approach, called Vaegan, that employs generative machine learning to generate synthetic data that is robust enough to support complex system-level HLS DSE experiments that would be unattainable with only the currently available data. We explore and adapt a Variational Autoencoder (VAE) and Generative Adversarial Network (GAN) for this task and evaluate our approach using state-of-the-art datasets and metrics. We compare our approach to prior works and show that Vaegan effectively generates synthetic HLS data that closely mirrors the ground truth's distribution.

Related papers

InfoSynth: Information-Guided Benchmark Synthesis for LLMs [69.80981631587501]
Large language models (LLMs) have demonstrated significant advancements in reasoning and code generation.<n>Traditional benchmark creation relies on manual human effort, a process that is both expensive and time-consuming.<n>This work introduces Info Synth, a novel framework for automatically generating and evaluating reasoning benchmarks.
arXiv Detail & Related papers (2026-01-02T05:26:27Z)
Syn-GRPO: Self-Evolving Data Synthesis for MLLM Perception Reasoning [58.4099027998709]
This work proposes Syn-GRPO, which employs an online data generator to synthesize high-quality training data with diverse responses in GRPO training.<n>Experiment results demonstrate that Syn-GRPO improves the data quality by a large margin, achieving significant superior performance to existing MLLM perception methods.
arXiv Detail & Related papers (2025-11-24T17:42:29Z)
ForgeHLS: A Large-Scale, Open-Source Dataset for High-Level Synthesis [13.87691887333415]
We introduce ForgeHLS, a large-scale, open-source dataset explicitly designed for machine learning (ML)-driven HLS research.<n> ForgeHLS comprises over 400k diverse designs generated from 846 kernels covering a broad range of application domains.<n>Compared to existing datasets, ForgeHLS significantly enhances scale, diversity, and design coverage.
arXiv Detail & Related papers (2025-07-04T02:23:46Z)
Private Training & Data Generation by Clustering Embeddings [74.00687214400021]
Differential privacy (DP) provides a robust framework for protecting individual data.<n>We introduce a novel principled method for DP synthetic image embedding generation.<n> Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy.
arXiv Detail & Related papers (2025-06-20T00:17:14Z)
SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis [89.99161034065614]
Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios.<n>Existing approaches face critical limitations that lack high-quality training trajectories and suffer from distributional mismatches.<n>This paper introduces SimpleDeepSearcher, a framework that bridges the gap through strategic data engineering rather than complex training paradigms.
arXiv Detail & Related papers (2025-05-22T16:05:02Z)
Synthline: A Product Line Approach for Synthetic Requirements Engineering Data Generation using Large Language Models [0.5156484100374059]
This paper introduces Synthline, a Product Line (PL) approach that leverages Large Language Models to generate synthetic Requirements Engineering (RE) data.<n>Our analysis reveals that while synthetic datasets exhibit less diversity than real data, they are good enough to serve as viable training resources.<n>Our evaluation shows that combining synthetic and real data leads to substantial performance improvements.
arXiv Detail & Related papers (2025-05-06T07:57:16Z)
Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets. Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z)
OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.390060529534644]
We propose OS-Genesis, a novel data synthesis pipeline for Graphical User Interface (GUI) agents. Instead of relying on pre-defined tasks, OS-Genesis enables agents first to perceive environments and perform step-wise interactions. We demonstrate that training GUI agents with OS-Genesis significantly improves their performance on highly challenging online benchmarks.
arXiv Detail & Related papers (2024-12-27T16:21:58Z)
Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable. We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data. Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z)
Efficacy of Synthetic Data as a Benchmark [3.2968976262860408]
We investigate the effectiveness of generating synthetic data through large language models (LLMs) Our experiments show that while synthetic data can effectively capture performance of various methods for simpler tasks, it falls short for more complex tasks like named entity recognition. We propose a new metric called the bias factor, which evaluates the biases introduced when the same LLM is used to both generate benchmarking data and to perform the tasks.
arXiv Detail & Related papers (2024-09-18T13:20:23Z)
Are LLMs Any Good for High-Level Synthesis? [1.3927943269211591]
Large Language Models (LLMs) can streamline or replace the High-Level Synthesis (HLS) process. LLMs can understand natural language specifications and translate C code or natural language specifications. This study aims to illuminate the role of LLMs in HLS, identifying promising directions for optimized hardware design in applications such as AI acceleration, embedded systems, and high-performance computing.
arXiv Detail & Related papers (2024-08-19T21:40:28Z)
FuseGen: PLM Fusion for Data-generation based Zero-shot Learning [18.51772808242954]
FuseGen is a novel data generation-based zero-shot learning framework. It introduces a new criteria for subset selection from synthetic datasets. The chosen subset provides in-context feedback to each PLM, enhancing dataset quality.
arXiv Detail & Related papers (2024-06-18T11:55:05Z)
KAXAI: An Integrated Environment for Knowledge Analysis and Explainable AI [0.0]
The paper describes the design of a system that integrates AutoML, XAI, and synthetic data generation. The system allows users to navigate and harness the power of machine learning while abstracting its complexities and providing high usability.
arXiv Detail & Related papers (2023-12-30T10:20:47Z)
Curated LLM: Synergy of LLMs and Data Curation for tabular augmentation in low-data regimes [57.62036621319563]
We introduce CLLM, which leverages the prior knowledge of Large Language Models (LLMs) for data augmentation in the low-data regime. We demonstrate the superior performance of CLLM in the low-data regime compared to conventional generators.
arXiv Detail & Related papers (2023-12-19T12:34:46Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
Towards Automated Imbalanced Learning with Deep Hierarchical Reinforcement Learning [57.163525407022966]
Imbalanced learning is a fundamental challenge in data mining, where there is a disproportionate ratio of training samples in each class. Over-sampling is an effective technique to tackle imbalanced learning through generating synthetic samples for the minority class. We propose AutoSMOTE, an automated over-sampling algorithm that can jointly optimize different levels of decisions.
arXiv Detail & Related papers (2022-08-26T04:28:01Z)
TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets. We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z)
High Dimensional Level Set Estimation with Bayesian Neural Network [58.684954492439424]
This paper proposes novel methods to solve the high dimensional Level Set Estimation problems using Bayesian Neural Networks. For each problem, we derive the corresponding theoretic information based acquisition function to sample the data points. Numerical experiments on both synthetic and real-world datasets show that our proposed method can achieve better results compared to existing state-of-the-art approaches.
arXiv Detail & Related papers (2020-12-17T23:21:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.