How to Do Machine Learning with Small Data? -- A Review from an
Industrial Perspective
- URL: http://arxiv.org/abs/2311.07126v1
- Date: Mon, 13 Nov 2023 07:39:13 GMT
- Title: How to Do Machine Learning with Small Data? -- A Review from an
Industrial Perspective
- Authors: Ivan Kraljevski, Yong Chul Ju, Dmitrij Ivanov, Constanze Tsch\"ope,
Matthias Wolff
- Abstract summary: Authors focus on interpreting the general term of "small data" and their engineering and industrial application role.
Small data is defined in terms of various characteristics compared to big data, and a machine learning formalism was introduced.
Five critical challenges of machine learning with small data in industrial applications are presented.
- Score: 1.443696537295348
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Artificial intelligence experienced a technological breakthrough in science,
industry, and everyday life in the recent few decades. The advancements can be
credited to the ever-increasing availability and miniaturization of
computational resources that resulted in exponential data growth. However,
because of the insufficient amount of data in some cases, employing machine
learning in solving complex tasks is not straightforward or even possible. As a
result, machine learning with small data experiences rising importance in data
science and application in several fields. The authors focus on interpreting
the general term of "small data" and their engineering and industrial
application role. They give a brief overview of the most important industrial
applications of machine learning and small data. Small data is defined in terms
of various characteristics compared to big data, and a machine learning
formalism was introduced. Five critical challenges of machine learning with
small data in industrial applications are presented: unlabeled data, imbalanced
data, missing data, insufficient data, and rare events. Based on those
definitions, an overview of the considerations in domain representation and
data acquisition is given along with a taxonomy of machine learning approaches
in the context of small data.
Related papers
- AI Competitions and Benchmarks: Dataset Development [42.164845505628506]
This chapter provides a comprehensive overview of established methodological tools, enriched by our practical experience.
We develop the tasks involved in dataset development and offer insights into their effective management.
Then, we provide more details about the implementation process which includes data collection, transformation, and quality evaluation.
arXiv Detail & Related papers (2024-04-15T12:01:42Z) - A Vision for Semantically Enriched Data Science [19.604667287258724]
Key areas such as utilizing domain knowledge and data semantics are areas where we have seen little automation.
We envision how leveraging "semantic" understanding and reasoning on data in combination with novel tools for data science automation can help with consistent and explainable data augmentation and transformation.
arXiv Detail & Related papers (2023-03-02T16:03:12Z) - A Survey of Machine Unlearning [56.017968863854186]
Recent regulations now require that, on request, private information about a user must be removed from computer systems.
ML models often remember' the old data.
Recent works on machine unlearning have not been able to completely solve the problem.
arXiv Detail & Related papers (2022-09-06T08:51:53Z) - Advancing Reacting Flow Simulations with Data-Driven Models [50.9598607067535]
Key to effective use of machine learning tools in multi-physics problems is to couple them to physical and computer models.
The present chapter reviews some of the open opportunities for the application of data-driven reduced-order modeling of combustion systems.
arXiv Detail & Related papers (2022-09-05T16:48:34Z) - A Survey of Learning on Small Data: Generalization, Optimization, and
Challenge [101.27154181792567]
Learning on small data that approximates the generalization ability of big data is one of the ultimate purposes of AI.
This survey follows the active sampling theory under a PAC framework to analyze the generalization error and label complexity of learning on small data.
Multiple data applications that may benefit from efficient small data representation are surveyed.
arXiv Detail & Related papers (2022-07-29T02:34:19Z) - Open Environment Machine Learning [84.90891046882213]
Conventional machine learning studies assume close world scenarios where important factors of the learning process hold invariant.
This article briefly introduces some advances in this line of research, focusing on techniques concerning emerging new classes, decremental/incremental features, changing data distributions, varied learning objectives, and discusses some theoretical issues.
arXiv Detail & Related papers (2022-06-01T11:57:56Z) - Maximizing information from chemical engineering data sets: Applications
to machine learning [61.442473332320176]
We identify four characteristics of data arising in chemical engineering applications that make applying classical artificial intelligence approaches difficult.
For each of these data characteristics, we discuss applications where these data characteristics arise and show how current chemical engineering research is extending the fields of data science and machine learning to incorporate these challenges.
arXiv Detail & Related papers (2022-01-25T01:25:45Z) - Data Collection and Quality Challenges in Deep Learning: A Data-Centric
AI Perspective [16.480530590466472]
Data-centric AI practices are now becoming mainstream.
Many datasets in the real world are small, dirty, biased, and even poisoned.
For data quality, we study data validation and data cleaning techniques.
arXiv Detail & Related papers (2021-12-13T03:57:36Z) - Understanding and Preparing Data of Industrial Processes for Machine
Learning Applications [0.0]
This paper addresses the challenge of missing values due to sensor unavailability at different production units of nonlinear production lines.
In cases where only a small proportion of the data is missing, those missing values can often be imputed.
This paper presents a technique, that allows to utilize all of the available data without the need of removing large amounts of observations.
arXiv Detail & Related papers (2021-09-08T07:39:11Z) - Synthetic Data: Opening the data floodgates to enable faster, more
directed development of machine learning methods [96.92041573661407]
Many ground-breaking advancements in machine learning can be attributed to the availability of a large volume of rich data.
Many large-scale datasets are highly sensitive, such as healthcare data, and are not widely available to the machine learning community.
Generating synthetic data with privacy guarantees provides one such solution.
arXiv Detail & Related papers (2020-12-08T17:26:10Z) - Data science on industrial data -- Today's challenges in brown field
applications [0.0]
This paper shows state of the art and what to expect when working with stock machines in the field.
A major focus in this paper is on data collection which can be more cumbersome than most people might expect.
Data quality for machine learning applications is a challenge once leaving the laboratory.
arXiv Detail & Related papers (2020-06-10T10:05:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.