Quality of Data in Machine Learning
- URL: http://arxiv.org/abs/2112.09400v1
- Date: Fri, 17 Dec 2021 09:22:46 GMT
- Title: Quality of Data in Machine Learning
- Authors: Antti Kariluoto, Arto P\"arn\"anen, Joni Kultanen, Jukka Soininen,
Pekka Abrahamsson
- Abstract summary: The study refutes the starting assumption and continues to state that in this case the significance in data lies in the quality of the data instead of the quantity of the data.
- Score: 3.9998518782208774
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A common assumption exists according to which machine learning models improve
their performance when they have more data to learn from. In this study, the
authors wished to clarify the dilemma by performing an empirical experiment
utilizing novel vocational student data. The experiment compared different
machine learning algorithms while varying the number of data and feature
combinations available for training and testing the models. The experiment
revealed that the increase of data records or their sample frequency does not
immediately lead to significant increases in the model accuracies or
performance, however the variance of accuracies does diminish in the case of
ensemble models. Similar phenomenon was witnessed while increasing the number
of input features for the models. The study refutes the starting assumption and
continues to state that in this case the significance in data lies in the
quality of the data instead of the quantity of the data.
Related papers
- Fair Generalized Linear Mixed Models [0.0]
Fairness in machine learning aims to ensure that biases in the data and model inaccuracies do not lead to discriminatory decisions.
We present an algorithm that can handle both problems simultaneously.
arXiv Detail & Related papers (2024-05-15T11:42:41Z) - Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals [91.59906995214209]
We propose a new evaluation method, Counterfactual Attentiveness Test (CAT)
CAT uses counterfactuals by replacing part of the input with its counterpart from a different example, expecting an attentive model to change its prediction.
We show that GPT3 becomes less attentive with an increased number of demonstrations, while its accuracy on the test data improves.
arXiv Detail & Related papers (2023-11-16T06:27:35Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Analyzing Effects of Fake Training Data on the Performance of Deep
Learning Systems [0.0]
Deep learning models frequently suffer from various problems such as class imbalance and lack of robustness to distribution shift.
With the advent of Generative Adversarial Networks (GANs) it is now possible to generate high-quality synthetic data.
We analyze the effect that various quantities of synthetic data, when mixed with original data, can have on a model's robustness to out-of-distribution data and the general quality of predictions.
arXiv Detail & Related papers (2023-03-02T13:53:22Z) - On Inductive Biases for Machine Learning in Data Constrained Settings [0.0]
This thesis explores a different answer to the problem of learning expressive models in data constrained settings.
Instead of relying on big datasets to learn neural networks, we will replace some modules by known functions reflecting the structure of the data.
Our approach falls under the hood of "inductive biases", which can be defined as hypothesis on the data at hand restricting the space of models to explore.
arXiv Detail & Related papers (2023-02-21T14:22:01Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Automatic Data Augmentation via Invariance-Constrained Learning [94.27081585149836]
Underlying data structures are often exploited to improve the solution of learning tasks.
Data augmentation induces these symmetries during training by applying multiple transformations to the input data.
This work tackles these issues by automatically adapting the data augmentation while solving the learning task.
arXiv Detail & Related papers (2022-09-29T18:11:01Z) - Intra-domain and cross-domain transfer learning for time series data --
How transferable are the features? [0.0]
This study aims to assess how transferable are the features between different domains of time series data.
The effects of transfer learning are observed in terms of predictive performance of the models and their convergence rate during training.
arXiv Detail & Related papers (2022-01-12T12:55:21Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Synthesizing Irreproducibility in Deep Networks [2.28438857884398]
Modern day deep networks suffer from irreproducibility (also referred to as nondeterminism or underspecification)
We show that even with a single nonlinearity and for very simple data and models, irreproducibility occurs.
Model complexity and the choice of nonlinearity also play significant roles in making deep models irreproducible.
arXiv Detail & Related papers (2021-02-21T21:51:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.