Optimizing the AI Development Process by Providing the Best Support
Environment
- URL: http://arxiv.org/abs/2305.00136v3
- Date: Tue, 15 Aug 2023 08:15:19 GMT
- Title: Optimizing the AI Development Process by Providing the Best Support
Environment
- Authors: Taha Khamis, Hamam Mokayed
- Abstract summary: Main stages of machine learning are problem understanding, data management, model building, model deployment and maintenance.
The framework was built using python language to perform data augmentation using deep learning advancements.
- Score: 0.756282840161499
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The purpose of this study is to investigate the development process for
Artificial inelegance (AI) and machine learning (ML) applications in order to
provide the best support environment. The main stages of ML are problem
understanding, data management, model building, model deployment and
maintenance. This project focuses on investigating the data management stage of
ML development and its obstacles as it is the most important stage of machine
learning development because the accuracy of the end model is relying on the
kind of data fed into the model. The biggest obstacle found on this stage was
the lack of sufficient data for model learning, especially in the fields where
data is confidential. This project aimed to build and develop a framework for
researchers and developers that can help solve the lack of sufficient data
during data management stage. The framework utilizes several data augmentation
techniques that can be used to generate new data from the original dataset
which can improve the overall performance of the ML applications by increasing
the quantity and quality of available data to feed the model with the best
possible data. The framework was built using python language to perform data
augmentation using deep learning advancements.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - A Survey on Data Synthesis and Augmentation for Large Language Models [35.59526251210408]
This paper reviews and summarizes data generation techniques throughout the lifecycle of Large Language Models.
We discuss the current constraints faced by these methods and investigate potential pathways for future development and research.
arXiv Detail & Related papers (2024-10-16T16:12:39Z) - The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective [53.48484062444108]
We find that the development of models and data is not two separate paths but rather interconnected.
On the one hand, vaster and higher-quality data contribute to better performance of MLLMs; on the other hand, MLLMs can facilitate the development of data.
To promote the data-model co-development for MLLM community, we systematically review existing works related to MLLMs from the data-model co-development perspective.
arXiv Detail & Related papers (2024-07-11T15:08:11Z) - AI Competitions and Benchmarks: Dataset Development [42.164845505628506]
This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience.
We develop the tasks involved in dataset development and offer insights into their effective management.
Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation.
arXiv Detail & Related papers (2024-04-15T12:01:42Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - Data Management For Training Large Language Models: A Survey [64.18200694790787]
Data plays a fundamental role in training Large Language Models (LLMs)
This survey aims to provide a comprehensive overview of current research in data management within both the pretraining and supervised fine-tuning stages of LLMs.
arXiv Detail & Related papers (2023-12-04T07:42:16Z) - Data-Centric Long-Tailed Image Recognition [49.90107582624604]
Long-tail models exhibit a strong demand for high-quality data.
Data-centric approaches aim to enhance both the quantity and quality of data to improve model performance.
There is currently a lack of research into the underlying mechanisms explaining the effectiveness of information augmentation.
arXiv Detail & Related papers (2023-11-03T06:34:37Z) - Towards Collaborative Intelligence: Routability Estimation based on
Decentralized Private Data [33.22449628584873]
In this work, we propose an Federated-Learning based approach for well-studied machine learning applications in EDA.
Our approach allows an ML model to be collaboratively trained with data from multiple clients but without explicit access to the data for respecting their data privacy.
Experiments on a comprehensive dataset show that collaborative training improves accuracy by 11% compared with individual local models.
arXiv Detail & Related papers (2022-03-30T02:35:40Z) - Fix your Models by Fixing your Datasets [0.6058427379240697]
Current machine learning tools lack streamlined processes for improving the data quality.
We introduce a systematic framework for finding noisy or mislabelled samples in the dataset.
We demonstrate the efficacy of our framework on public as well as private enterprise datasets of two Fortune 500 companies.
arXiv Detail & Related papers (2021-12-15T02:41:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.