Toward Formal Data Set Verification for Building Effective Machine
Learning Models
- URL: http://arxiv.org/abs/2108.11220v1
- Date: Wed, 25 Aug 2021 13:22:24 GMT
- Title: Toward Formal Data Set Verification for Building Effective Machine
Learning Models
- Authors: Jorge L\'opez, Maxime Labonne and Claude Poletti
- Abstract summary: We present a formal approach for verifying a set of arbitrarily stated properties over a data set.
The proposed approach relies on the transformation of the data set into a first order logic formula.
A prototype tool, which uses the z3 solver, has been developed.
- Score: 2.707154152696381
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In order to properly train a machine learning model, data must be properly
collected. To guarantee a proper data collection, verifying that the collected
data set holds certain properties is a possible solution. For example,
guaranteeing that the data set contains samples across the whole input space,
or that the data set is balanced w.r.t. different classes. We present a formal
approach for verifying a set of arbitrarily stated properties over a data set.
The proposed approach relies on the transformation of the data set into a first
order logic formula, which can be later verified w.r.t. the different
properties also stated in the same logic. A prototype tool, which uses the z3
solver, has been developed; the prototype can take as an input a set of
properties stated in a formal language and formally verify a given data set
w.r.t. to the given set of properties. Preliminary experimental results show
the feasibility and performance of the proposed approach, and furthermore the
flexibility for expressing properties of interest.
Related papers
- Balancing Fairness and Accuracy in Data-Restricted Binary Classification [14.439413517433891]
This paper proposes a framework that models the trade-off between accuracy and fairness under four practical scenarios.
Experiments on three datasets demonstrate the utility of the proposed framework as a tool for quantifying the trade-offs.
arXiv Detail & Related papers (2024-03-12T15:01:27Z) - Generating Survival Interpretable Trajectories and Data [2.4861619769660637]
The paper demonstrates the efficiency and properties of the proposed model using numerical experiments on synthetic and real datasets.
The code of the algorithm implementing the proposed model is publicly available.
arXiv Detail & Related papers (2024-02-19T18:02:10Z) - Controllable Data Generation Via Iterative Data-Property Mutual Mappings [13.282793266390316]
We propose a framework to enhance VAE-based data generators with property controllability and ensure disentanglement.
The proposed framework is implemented on four VAE-based controllable generators to evaluate its performance on property error, disentanglement, generation quality, and training time.
arXiv Detail & Related papers (2023-10-11T17:34:56Z) - Attesting Distributional Properties of Training Data for Machine Learning [15.2927830843089]
Several jurisdictions are preparing machine learning regulatory frameworks.
Draft regulations indicate that model trainers are required to show that training datasets have specific distributional properties.
We propose the notion of property attestation allowing a prover to demonstrate relevant distributional properties of training data to a verifier without revealing the data.
arXiv Detail & Related papers (2023-08-18T13:33:02Z) - Example-Based Explainable AI and its Application for Remote Sensing
Image Classification [0.0]
We show an example of an instance in a training dataset that is similar to the input data to be inferred.
Using a remote sensing image dataset from the Sentinel-2 satellite, the concept was successfully demonstrated.
arXiv Detail & Related papers (2023-02-03T03:48:43Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next.
In such settings, there is a distinct type of distribution shift between the training and test data.
We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z) - Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D
Object Detection [85.11649974840758]
3D object detection networks tend to be biased towards the data they are trained on.
We propose a single-frame approach for source-free, unsupervised domain adaptation of lidar-based 3D object detectors.
arXiv Detail & Related papers (2021-11-30T18:42:42Z) - DIVA: Dataset Derivative of a Learning Task [108.18912044384213]
We present a method to compute the derivative of a learning task with respect to a dataset.
A learning task is a function from a training set to the validation error, which can be represented by a trained deep neural network (DNN)
The "dataset derivative" is a linear operator, computed around the trained model, that informs how outliers of the weight of each training sample affect the validation error.
arXiv Detail & Related papers (2021-11-18T16:33:12Z) - When Can Models Learn From Explanations? A Formal Framework for
Understanding the Roles of Explanation Data [84.87772675171412]
We study the circumstances under which explanations of individual data points can improve modeling performance.
We make use of three existing datasets with explanations: e-SNLI, TACRED, SemEval.
arXiv Detail & Related papers (2021-02-03T18:57:08Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.