Related papers: False perfection in machine prediction: Detecting and assessing circularity problems in machine learning

False perfection in machine prediction: Detecting and assessing circularity problems in machine learning

URL: http://arxiv.org/abs/2106.12417v1
Date: Wed, 23 Jun 2021 14:11:06 GMT
Title: False perfection in machine prediction: Detecting and assessing circularity problems in machine learning
Authors: Michael Hagmann, Stefan Riezler
Abstract summary: We demonstrate a problem of machine learning in vital application areas such as medical informatics or patent law. The inclusion of measurements on which target outputs are deterministically defined in the representations of input data leads to perfect, but circular predictions. We argue that a transfer of research results to real-world applications requires to avoid circularity by separating measurements that define target outcomes from data representations.
Score: 11.878820609988695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Machine learning algorithms train models from patterns of input data and target outputs, with the goal of predicting correct outputs for unseen test inputs. Here we demonstrate a problem of machine learning in vital application areas such as medical informatics or patent law that consists of the inclusion of measurements on which target outputs are deterministically defined in the representations of input data. This leads to perfect, but circular predictions based on a machine reconstruction of the known target definition, but fails on real-world data where the defining measurements may not or only incompletely be available. We present a circularity test that shows, for given datasets and black-box machine learning models, whether the target functional definition can be reconstructed and has been used in training. We argue that a transfer of research results to real-world applications requires to avoid circularity by separating measurements that define target outcomes from data representations in machine learning.

Related papers

Prediction-Powered Inference with Imputed Covariates and Nonuniform Sampling [20.078602767179355]
Failure to properly account for errors in machine learning predictions renders standard statistical procedures invalid. We introduce bootstrap confidence intervals that apply when the complete data is a nonuniform (i.e., weighted, stratified, or clustered) sample and to settings where an arbitrary subset of features is imputed. We prove that these confidence intervals are valid under no assumptions on the quality of the machine learning model and are no wider than the intervals obtained by methods that do not use machine learning predictions.
arXiv Detail & Related papers (2025-01-30T18:46:43Z)
Attribute-to-Delete: Machine Unlearning via Datamodel Matching [65.13151619119782]
Machine unlearning -- efficiently removing a small "forget set" training data on a pre-divertrained machine learning model -- has recently attracted interest. Recent research shows that machine unlearning techniques do not hold up in such a challenging setting.
arXiv Detail & Related papers (2024-10-30T17:20:10Z)
Anticipated Network Surveillance -- An extrapolated study to predict cyber-attacks using Machine Learning and Data Analytics [0.0]
This paper discusses a novel technique to predict an upcoming attack in a network based on several data parameters. The proposed model comprises dataset pre-processing, and training, followed by the testing phase. Based on the results of the testing phase, the best model is selected using which, event class which may lead to an attack is extracted.
arXiv Detail & Related papers (2023-12-27T01:09:11Z)
Validity problems in clinical machine learning by indirect data labeling using consensus definitions [18.18186817228833]
We demonstrate a validity problem of machine learning in the vital application area of disease diagnosis in medicine. It arises when target labels in training data are determined by an indirect measurement, and the fundamental measurements needed to determine this indirect measurement are included in the input data representation.
arXiv Detail & Related papers (2023-11-06T11:14:48Z)
Task-Aware Machine Unlearning and Its Application in Load Forecasting [4.00606516946677]
This paper introduces the concept of machine unlearning which is specifically designed to remove the influence of part of the dataset on an already trained forecaster. A performance-aware algorithm is proposed by evaluating the sensitivity of local model parameter change using influence function and sample re-weighting. We tested the unlearning algorithms on linear, CNN, andMixer based load forecasters with a realistic load dataset.
arXiv Detail & Related papers (2023-08-28T08:50:12Z)
Machine Unlearning for Causal Inference [0.6621714555125157]
It is important to enable the model to forget some of its learning/captured information about a given user (machine unlearning) This paper introduces the concept of machine unlearning for causal inference, particularly propensity score matching and treatment effect estimation. The dataset used in the study is the Lalonde dataset, a widely used dataset for evaluating the effectiveness of job training programs.
arXiv Detail & Related papers (2023-08-24T17:27:01Z)
TransferD2: Automated Defect Detection Approach in Smart Manufacturing using Transfer Learning Techniques [1.8899300124593645]
We propose a transfer learning approach, namely TransferD2, to correctly identify defects on a dataset of source objects. Our proposed approach can be applied in defect detection applications where insufficient data is available for training a model and can be extended to identify imperfections in new unseen data.
arXiv Detail & Related papers (2023-02-26T13:24:46Z)
Prediction-Powered Inference [68.97619568620709]
Prediction-powered inference is a framework for performing valid statistical inference when an experimental dataset is supplemented with predictions from a machine-learning system. The framework yields simple algorithms for computing provably valid confidence intervals for quantities such as means, quantiles, and linear and logistic regression coefficients. Prediction-powered inference could enable researchers to draw valid and more data-efficient conclusions using machine learning.
arXiv Detail & Related papers (2023-01-23T18:59:28Z)
Predicting Seriousness of Injury in a Traffic Accident: A New Imbalanced Dataset and Benchmark [62.997667081978825]
The paper introduces a new dataset to assess the performance of machine learning algorithms in the prediction of the seriousness of injury in a traffic accident. The dataset is created by aggregating publicly available datasets from the UK Department for Transport.
arXiv Detail & Related papers (2022-05-20T21:15:26Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
ALT-MAS: A Data-Efficient Framework for Active Testing of Machine Learning Algorithms [58.684954492439424]
We propose a novel framework to efficiently test a machine learning model using only a small amount of labeled test data. The idea is to estimate the metrics of interest for a model-under-test using Bayesian neural network (BNN)
arXiv Detail & Related papers (2021-04-11T12:14:04Z)
Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines. Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.