Related papers: Toward Formal Data Set Verification for Building Effective Machine Learning Models

Toward Formal Data Set Verification for Building Effective Machine Learning Models

URL: http://arxiv.org/abs/2108.11220v1
Date: Wed, 25 Aug 2021 13:22:24 GMT
Title: Toward Formal Data Set Verification for Building Effective Machine Learning Models
Authors: Jorge L\'opez, Maxime Labonne and Claude Poletti
Abstract summary: We present a formal approach for verifying a set of arbitrarily stated properties over a data set. The proposed approach relies on the transformation of the data set into a first order logic formula. A prototype tool, which uses the z3 solver, has been developed.
Score: 2.707154152696381
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: In order to properly train a machine learning model, data must be properly collected. To guarantee a proper data collection, verifying that the collected data set holds certain properties is a possible solution. For example, guaranteeing that the data set contains samples across the whole input space, or that the data set is balanced w.r.t. different classes. We present a formal approach for verifying a set of arbitrarily stated properties over a data set. The proposed approach relies on the transformation of the data set into a first order logic formula, which can be later verified w.r.t. the different properties also stated in the same logic. A prototype tool, which uses the z3 solver, has been developed; the prototype can take as an input a set of properties stated in a formal language and formally verify a given data set w.r.t. to the given set of properties. Preliminary experimental results show the feasibility and performance of the proposed approach, and furthermore the flexibility for expressing properties of interest.

Related papers

CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking [85.68235482145091]
Large-scale speech datasets have become valuable intellectual property. We propose a novel dataset ownership verification method. Our approach introduces a clustering-based backdoor watermark (CBW) We conduct extensive experiments on benchmark datasets, verifying the effectiveness and robustness of our method against potential adaptive attacks.
arXiv Detail & Related papers (2025-03-02T02:02:57Z)
DUPRE: Data Utility Prediction for Efficient Data Valuation [49.60564885180563]
Cooperative game theory-based data valuation, such as Data Shapley, requires evaluating the data utility and retraining the ML model for multiple data subsets. Our framework, textttDUPRE, takes an alternative yet complementary approach that reduces the cost per subset evaluation by predicting data utilities instead of evaluating them by model retraining. Specifically, given the evaluated data utilities of some data subsets, textttDUPRE fits a emphGaussian process (GP) regression model to predict the utility of every other data subset.
arXiv Detail & Related papers (2025-02-22T08:53:39Z)
Balancing Fairness and Accuracy in Data-Restricted Binary Classification [14.439413517433891]
This paper proposes a framework that models the trade-off between accuracy and fairness under four practical scenarios. Experiments on three datasets demonstrate the utility of the proposed framework as a tool for quantifying the trade-offs.
arXiv Detail & Related papers (2024-03-12T15:01:27Z)
Generating Survival Interpretable Trajectories and Data [2.4861619769660637]
The paper demonstrates the efficiency and properties of the proposed model using numerical experiments on synthetic and real datasets. The code of the algorithm implementing the proposed model is publicly available.
arXiv Detail & Related papers (2024-02-19T18:02:10Z)
Controllable Data Generation Via Iterative Data-Property Mutual Mappings [13.282793266390316]
We propose a framework to enhance VAE-based data generators with property controllability and ensure disentanglement. The proposed framework is implemented on four VAE-based controllable generators to evaluate its performance on property error, disentanglement, generation quality, and training time.
arXiv Detail & Related papers (2023-10-11T17:34:56Z)
Attesting Distributional Properties of Training Data for Machine Learning [15.2927830843089]
Several jurisdictions are preparing machine learning regulatory frameworks. Draft regulations indicate that model trainers are required to show that training datasets have specific distributional properties. We propose the notion of property attestation allowing a prover to demonstrate relevant distributional properties of training data to a verifier without revealing the data.
arXiv Detail & Related papers (2023-08-18T13:33:02Z)
Example-Based Explainable AI and its Application for Remote Sensing Image Classification [0.0]
We show an example of an instance in a training dataset that is similar to the input data to be inferred. Using a remote sensing image dataset from the Sentinel-2 satellite, the concept was successfully demonstrated.
arXiv Detail & Related papers (2023-02-03T03:48:43Z)
Data-SUITE: Data-centric identification of in-distribution incongruous examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data. We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z)
Conformal prediction for the design problem [72.14982816083297]
In many real-world deployments of machine learning, we use a prediction algorithm to choose what data to test next. In such settings, there is a distinct type of distribution shift between the training and test data. We introduce a method to quantify predictive uncertainty in such settings.
arXiv Detail & Related papers (2022-02-08T02:59:12Z)
Attentive Prototypes for Source-free Unsupervised Domain Adaptive 3D Object Detection [85.11649974840758]
3D object detection networks tend to be biased towards the data they are trained on. We propose a single-frame approach for source-free, unsupervised domain adaptation of lidar-based 3D object detectors.
arXiv Detail & Related papers (2021-11-30T18:42:42Z)
DIVA: Dataset Derivative of a Learning Task [108.18912044384213]
We present a method to compute the derivative of a learning task with respect to a dataset. A learning task is a function from a training set to the validation error, which can be represented by a trained deep neural network (DNN) The "dataset derivative" is a linear operator, computed around the trained model, that informs how outliers of the weight of each training sample affect the validation error.
arXiv Detail & Related papers (2021-11-18T16:33:12Z)
When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data [84.87772675171412]
We study the circumstances under which explanations of individual data points can improve modeling performance. We make use of three existing datasets with explanations: e-SNLI, TACRED, SemEval.
arXiv Detail & Related papers (2021-02-03T18:57:08Z)
Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management. We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.