Iceberg: Enhancing HLS Modeling with Synthetic Data
- URL: http://arxiv.org/abs/2507.09948v2
- Date: Sat, 19 Jul 2025 21:32:24 GMT
- Title: Iceberg: Enhancing HLS Modeling with Synthetic Data
- Authors: Zijian Ding, Tung Nguyen, Weikai Li, Aditya Grover, Yizhou Sun, Jason Cong,
- Abstract summary: Iceberg is a synthetic data augmentation approach that expands both large language model (LLM)-generated programs and weak labels of unseen design configurations.<n>Our weak label generation method is integrated with an in-context model architecture, enabling meta-learning from actual and proximate labels.
- Score: 61.48659845413156
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning-based prediction models for High-Level Synthesis (HLS) of hardware designs often struggle to generalize. In this paper, we study how to close the generalizability gap of these models through pretraining on synthetic data and introduce Iceberg, a synthetic data augmentation approach that expands both large language model (LLM)-generated programs and weak labels of unseen design configurations. Our weak label generation method is integrated with an in-context model architecture, enabling meta-learning from actual and proximate labels. Iceberg improves the geometric mean modeling accuracy by $86.4\%$ when adapt to six real-world applications with few-shot examples and achieves a $2.47\times$ and a $1.12\times$ better offline DSE performance when adapting to two different test datasets. Our open-sourced code is here: https://github.com/UCLA-VAST/iceberg
Related papers
- Model Inversion with Layer-Specific Modeling and Alignment for Data-Free Continual Learning [19.12792297140574]
Continual learning aims to incrementally train a model on a sequence of tasks while retaining performance on prior ones.<n> storing and replaying data is often infeasible due to privacy or security constraints.<n>We propose Per-layer Model Inversion (PMI), inspired by faster convergence in single-layer optimization.
arXiv Detail & Related papers (2025-10-30T09:58:48Z) - Limited Preference Data? Learning Better Reward Model with Latent Space Synthesis [19.056892622506826]
Reward modeling is crucial for aligning large language models with human preferences.<n>Existing textual data synthesis methods are computationally expensive.<n>We propose a novel framework for synthesizing preference data directly in the language's latent embedding space.
arXiv Detail & Related papers (2025-09-30T10:48:50Z) - Aleph-Alpha-GermanWeb: Improving German-language LLM pre-training with model-based data curation and synthetic data generation [6.6723686572805185]
We introduce a German-language dataset pipeline that combines model-based filtering techniques with synthetic data generation.<n>We use our pipeline to create Aleph-Alpha-GermanWeb, a large-scale German pre-training dataset.<n>A comparison on German-language benchmarks, including MMMLU, shows significant performance gains of Aleph-Alpha-GermanWeb over FineWeb2 alone.
arXiv Detail & Related papers (2025-04-24T17:23:46Z) - Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data [61.62554324594797]
We propose DreamMask, which explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.<n>In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods.<n>For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
arXiv Detail & Related papers (2025-01-03T19:00:00Z) - ToolACE: Winning the Points of LLM Function Calling [139.07157814653638]
ToolACE is an automatic agentic pipeline designed to generate accurate, complex, and diverse tool-learning data.
We demonstrate that models trained on our synthesized data, even with only 8B parameters, achieve state-of-the-art performance on the Berkeley Function-Calling Leaderboard.
arXiv Detail & Related papers (2024-09-02T03:19:56Z) - Foundational GPT Model for MEG [3.524869467682149]
We propose two classes of deep learning foundational models that can be trained using forecasting of unlabelled brain signals.
First, we consider a modified Wavenet; and second, we consider a modified Transformer-based (GPT2) model.
We compare the performance of these deep learning models with standard linear autoregressive (AR) modelling on MEG data.
arXiv Detail & Related papers (2024-04-14T13:48:24Z) - From Fake to Real: Pretraining on Balanced Synthetic Images to Prevent Spurious Correlations in Image Recognition [64.59093444558549]
We propose a simple, easy-to-implement, two-step training pipeline that we call From Fake to Real.
By training on real and synthetic data separately, FFR does not expose the model to the statistical differences between real and synthetic data.
Our experiments show that FFR improves worst group accuracy over the state-of-the-art by up to 20% over three datasets.
arXiv Detail & Related papers (2023-08-08T19:52:28Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Improved Convergence Guarantees for Shallow Neural Networks [91.3755431537592]
We prove convergence of depth 2 neural networks, trained via gradient descent, to a global minimum.
Our model has the following features: regression with quadratic loss function, fully connected feedforward architecture, RelU activations, Gaussian data instances, adversarial labels.
They strongly suggest that, at least in our model, the convergence phenomenon extends well beyond the NTK regime''
arXiv Detail & Related papers (2022-12-05T14:47:52Z) - Ensembling and Knowledge Distilling of Large Sequence Taggers for
Grammatical Error Correction [0.0]
We investigate improvements to the GEC sequence tagging architecture with a focus on ensembling of cutting-edge Transformer-based encoders in Large configurations.
Our best ensemble achieves a new SOTA result with an $F_0.5$ score of 76.05 on BEA 2019 (test)
In addition, we perform knowledge distillation with a trained ensemble to generate new synthetic training datasets, "Troy-Blogs" and "Troy-1BW"
arXiv Detail & Related papers (2022-03-24T13:18:36Z) - Using GPT-2 to Create Synthetic Data to Improve the Prediction
Performance of NLP Machine Learning Classification Models [0.0]
It is becoming common practice to utilize synthetic data to boost the performance of Machine Learning Models.
I used a Yelp pizza restaurant reviews dataset and transfer learning to fine-tune a pre-trained GPT-2 Transformer Model to generate synthetic pizza reviews data.
I then combined this synthetic data with the original genuine data to create a new joint dataset.
arXiv Detail & Related papers (2021-04-02T20:20:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.