Meta-Learning and Synthetic Data for Automated Pretraining and Finetuning
- URL: http://arxiv.org/abs/2506.12161v1
- Date: Wed, 11 Jun 2025 12:48:45 GMT
- Title: Meta-Learning and Synthetic Data for Automated Pretraining and Finetuning
- Authors: Fabio Ferreira,
- Abstract summary: Growing number of pretrained models in Machine Learning (ML) presents significant challenges for practitioners.<n>As models grow in scale, the increasing reliance on real-world data poses a bottleneck for training and requires leveraging data more effectively.<n>This dissertation adopts meta-learning to extend automated machine learning to the deep learning domain.
- Score: 2.657867981416885
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The growing number of pretrained models in Machine Learning (ML) presents significant challenges for practitioners. Given a new dataset, they need to determine the most suitable deep learning (DL) pipeline, consisting of the pretrained model and the hyperparameters for finetuning to it. Moreover, as models grow in scale, the increasing reliance on real-world data poses a bottleneck for training and requires leveraging data more effectively. Addressing the first challenge often involves manual model selection and hyperparameter tuning. At the same time, as models grow larger and more and more of the available human-generated data is being used for training, data augmentation and synthetic data become critical elements. Automated machine learning offers a path to address these challenges but is traditionally designed for tabular data and classical ML methods. This dissertation adopts meta-learning to extend automated machine learning to the deep learning domain. We propose empirical approaches to automate DL pipeline selection for Computer Vision tasks using prior task knowledge to learn surrogate models for pipeline ranking. Extending these methods to the language domain, we learn to finetune large language models. As a result, we show that our approach can outperform finetuning foundation models. Additionally, we meta-learn data augmentation and synthetic data to enhance performance in up-stream and down-stream tasks. We empirically show the underestimated importance of data augmentation when using Self-Supervised Learning and meta-learn advanced data augmentation strategies. Leveraging synthetic data, we also propose to meta-learn neural synthetic data generators as proxies for Reinforcement Learning (RL) environments. Additionally, we learn a multiple-environment world model in an in-context learning fashion by purely using synthetic, randomly sampled data.
Related papers
- Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.<n>Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data [61.62554324594797]
We propose DreamMask, which explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.<n>In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods.<n>For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
arXiv Detail & Related papers (2025-01-03T19:00:00Z) - Multi-Armed Bandit Approach for Optimizing Training on Synthetic Data [7.603659241572307]
We propose a novel UCB-based training procedure combined with a dynamic usability metric.<n>Our proposed metric integrates low-level and high-level information from synthetic images and their corresponding real and synthetic datasets.<n>We show that our metric is an effective way to rank synthetic images based on their usability.
arXiv Detail & Related papers (2024-12-06T23:36:36Z) - Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic.
Our approach transforms numerical data into text, re-framing data generation as a language modeling task.
Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - Accelerating Large Language Model Pretraining via LFR Pedagogy: Learn, Focus, and Review [50.78587571704713]
Learn-Focus-Review (LFR) is a dynamic training approach that adapts to the model's learning progress.<n>LFR tracks the model's learning performance across data blocks (sequences of tokens) and prioritizes revisiting challenging regions of the dataset.<n>Compared to baseline models trained on the full datasets, LFR consistently achieved lower perplexity and higher accuracy.
arXiv Detail & Related papers (2024-09-10T00:59:18Z) - A survey of synthetic data augmentation methods in computer vision [0.0]
This paper presents an extensive review of synthetic data augmentation techniques.
We focus on the important data generation and augmentation techniques, general scope of application and specific use-cases.
We provide a summary of common synthetic datasets for training computer vision models.
arXiv Detail & Related papers (2024-03-15T07:34:08Z) - Synthetic data, real errors: how (not) to publish and use synthetic data [86.65594304109567]
We show how the generative process affects the downstream ML task.
We introduce Deep Generative Ensemble (DGE) to approximate the posterior distribution over the generative process model parameters.
arXiv Detail & Related papers (2023-05-16T07:30:29Z) - Scalable Modular Synthetic Data Generation for Advancing Aerial Autonomy [2.9005223064604078]
We introduce a scalable Aerial Synthetic Data Augmentation (ASDA) framework tailored to aerial autonomy applications.
ASDA extends a central data collection engine with two scriptable pipelines that automatically perform scene and data augmentations.
We demonstrate the effectiveness of our method in automatically generating diverse datasets.
arXiv Detail & Related papers (2022-11-10T04:37:41Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Using GPT-2 to Create Synthetic Data to Improve the Prediction
Performance of NLP Machine Learning Classification Models [0.0]
It is becoming common practice to utilize synthetic data to boost the performance of Machine Learning Models.
I used a Yelp pizza restaurant reviews dataset and transfer learning to fine-tune a pre-trained GPT-2 Transformer Model to generate synthetic pizza reviews data.
I then combined this synthetic data with the original genuine data to create a new joint dataset.
arXiv Detail & Related papers (2021-04-02T20:20:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.