Optimizing the AI Development Process by Providing the Best Support
Environment
- URL: http://arxiv.org/abs/2305.00136v3
- Date: Tue, 15 Aug 2023 08:15:19 GMT
- Title: Optimizing the AI Development Process by Providing the Best Support
Environment
- Authors: Taha Khamis, Hamam Mokayed
- Abstract summary: Main stages of machine learning are problem understanding, data management, model building, model deployment and maintenance.
The framework was built using python language to perform data augmentation using deep learning advancements.
- Score: 0.756282840161499
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The purpose of this study is to investigate the development process for
Artificial inelegance (AI) and machine learning (ML) applications in order to
provide the best support environment. The main stages of ML are problem
understanding, data management, model building, model deployment and
maintenance. This project focuses on investigating the data management stage of
ML development and its obstacles as it is the most important stage of machine
learning development because the accuracy of the end model is relying on the
kind of data fed into the model. The biggest obstacle found on this stage was
the lack of sufficient data for model learning, especially in the fields where
data is confidential. This project aimed to build and develop a framework for
researchers and developers that can help solve the lack of sufficient data
during data management stage. The framework utilizes several data augmentation
techniques that can be used to generate new data from the original dataset
which can improve the overall performance of the ML applications by increasing
the quantity and quality of available data to feed the model with the best
possible data. The framework was built using python language to perform data
augmentation using deep learning advancements.
Related papers
- Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets.
The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method.
The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z) - RedPajama: an Open Dataset for Training Large Language Models [80.74772646989423]
We identify three core data-related challenges that must be addressed to advance open-source language models.
These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis.
We release RedPajama-V1, an open reproduction of the LLaMA training dataset, and RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata.
arXiv Detail & Related papers (2024-11-19T09:35:28Z) - Training Data for Large Language Model [2.1178416840822027]
ChatGPT surpassed previous models in terms of parameters and the scale of its pretraining corpus.
ChatGPT achieved revolutionary performance improvements through fine-tuning on a vast amount of high-quality, human-annotated data.
This paper summarizes the current state of pretraining and fine-tuning data for training large-scale language models.
arXiv Detail & Related papers (2024-11-12T11:09:58Z) - Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models.
We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z) - AI Competitions and Benchmarks: Dataset Development [42.164845505628506]
This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience.
We develop the tasks involved in dataset development and offer insights into their effective management.
Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation.
arXiv Detail & Related papers (2024-04-15T12:01:42Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - Data Management For Training Large Language Models: A Survey [64.18200694790787]
Data plays a fundamental role in training Large Language Models (LLMs)
This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs.
arXiv Detail & Related papers (2023-12-04T07:42:16Z) - Towards Collaborative Intelligence: Routability Estimation based on
Decentralized Private Data [33.22449628584873]
In this work, we propose an Federated-Learning based approach for well-studied machine learning applications in EDA.
Our approach allows an ML model to be collaboratively trained with data from multiple clients but without explicit access to the data for respecting their data privacy.
Experiments on a comprehensive dataset show that collaborative training improves accuracy by 11% compared with individual local models.
arXiv Detail & Related papers (2022-03-30T02:35:40Z) - Fix your Models by Fixing your Datasets [0.6058427379240697]
Current machine learning tools lack streamlined processes for improving the data quality.
We introduce a systematic framework for finding noisy or mislabelled samples in the dataset.
We demonstrate the efficacy of our framework on public as well as private enterprise datasets of two Fortune 500 companies.
arXiv Detail & Related papers (2021-12-15T02:41:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.