False perfection in machine prediction: Detecting and assessing
circularity problems in machine learning
- URL: http://arxiv.org/abs/2106.12417v1
- Date: Wed, 23 Jun 2021 14:11:06 GMT
- Title: False perfection in machine prediction: Detecting and assessing
circularity problems in machine learning
- Authors: Michael Hagmann, Stefan Riezler
- Abstract summary: We demonstrate a problem of machine learning in vital application areas such as medical informatics or patent law.
The inclusion of measurements on which target outputs are deterministically defined in the representations of input data leads to perfect, but circular predictions.
We argue that a transfer of research results to real-world applications requires to avoid circularity by separating measurements that define target outcomes from data representations.
- Score: 11.878820609988695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Machine learning algorithms train models from patterns of input data and
target outputs, with the goal of predicting correct outputs for unseen test
inputs. Here we demonstrate a problem of machine learning in vital application
areas such as medical informatics or patent law that consists of the inclusion
of measurements on which target outputs are deterministically defined in the
representations of input data. This leads to perfect, but circular predictions
based on a machine reconstruction of the known target definition, but fails on
real-world data where the defining measurements may not or only incompletely be
available. We present a circularity test that shows, for given datasets and
black-box machine learning models, whether the target functional definition can
be reconstructed and has been used in training. We argue that a transfer of
research results to real-world applications requires to avoid circularity by
separating measurements that define target outcomes from data representations
in machine learning.
Related papers
- Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest.
Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z) - Anticipated Network Surveillance -- An extrapolated study to predict
cyber-attacks using Machine Learning and Data Analytics [0.0]
This paper discusses a novel technique to predict an upcoming attack in a network based on several data parameters.
The proposed model comprises dataset pre-processing, and training, followed by the testing phase.
Based on the results of the testing phase, the best model is selected using which, event class which may lead to an attack is extracted.
arXiv Detail & Related papers (2023-12-27T01:09:11Z) - Validity problems in clinical machine learning by indirect data labeling
using consensus definitions [18.18186817228833]
We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine.
It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation.
arXiv Detail & Related papers (2023-11-06T11:14:48Z) - Task-Aware Machine Unlearning and Its Application in Load Forecasting [4.00606516946677]
This paper introduces the concept of machine unlearning which is specifically designed to remove the influence of part of the dataset on an already trained forecaster.
A performance-aware algorithm is proposed by evaluating the sensitivity of local model parameter change using influence function and sample re-weighting.
We tested the unlearning algorithms on linear, CNN, andMixer based load forecasters with a realistic load dataset.
arXiv Detail & Related papers (2023-08-28T08:50:12Z) - Machine Unlearning for Causal Inference [0.6621714555125157]
It is important to enable the model to forget some of its learning/captured information about a given user (machine unlearning)
This paper introduces the concept of machine unlearning for causal inference, particularly propensity score matching and treatment effect estimation.
The dataset used in the study is the Lalonde dataset, a widely used dataset for evaluating the effectiveness of job training programs.
arXiv Detail & Related papers (2023-08-24T17:27:01Z) - TransferD2: Automated Defect Detection Approach in Smart Manufacturing
using Transfer Learning Techniques [1.8899300124593645]
We propose a transfer learning approach, namely TransferD2, to correctly identify defects on a dataset of source objects.
Our proposed approach can be applied in defect detection applications where insufficient data is available for training a model and can be extended to identify imperfections in new unseen data.
arXiv Detail & Related papers (2023-02-26T13:24:46Z) - Prediction-Powered Inference [68.97619568620709]
Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system.
The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients.
Prediction-powered inference could enable researchers to draw valid and more data-efficient conclusions using machine learning.
arXiv Detail & Related papers (2023-01-23T18:59:28Z) - Predicting Seriousness of Injury in a Traffic Accident: A New Imbalanced
Dataset and Benchmark [62.997667081978825]
The paper introduces a new dataset to assess the performance of machine learning algorithms in the prediction of the seriousness of injury in a traffic accident.
The dataset is created by aggregating publicly available datasets from the UK Department for Transport.
arXiv Detail & Related papers (2022-05-20T21:15:26Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - ALT-MAS: A Data-Efficient Framework for Active Testing of Machine
Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data.
The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.