Open Data Quality
- URL: http://arxiv.org/abs/2007.06540v2
- Date: Wed, 15 Jun 2022 08:20:12 GMT
- Title: Open Data Quality
- Authors: Anastasija Nikiforova
- Abstract summary: The proposed approach is applied to several open data sets to evaluate their quality.
It is important to be sure that this data is trustable and error-free as its quality problems can lead to huge losses.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The research discusses how (open) data quality could be described, what
should be considered developing a data quality management solution and how it
could be applied to open data to check its quality. The proposed approach
focuses on development of data quality specification which can be executed to
get data quality evaluation results, find errors in data and possible problems
which must be solved. The proposed approach is applied to several open data
sets to evaluate their quality. Open data is very popular, free available for
every stakeholder - it is often used to make business decisions. It is
important to be sure that this data is trustable and error-free as its quality
problems can lead to huge losses.
Related papers
- A Guide to Misinformation Detection Datasets [5.673951146506489]
This guide aims to provide a roadmap for obtaining higher quality data and conducting more effective evaluations.
All datasets and other artifacts are available at https://misinfo-datasets.complexdatalab.com/.
arXiv Detail & Related papers (2024-11-07T18:47:39Z) - Private, Augmentation-Robust and Task-Agnostic Data Valuation Approach for Data Marketplace [56.78396861508909]
PriArTa is an approach for computing the distance between the distribution of the buyer's existing dataset and the seller's dataset.
PriArTa is communication-efficient, enabling the buyer to evaluate datasets without needing access to the entire dataset from each seller.
arXiv Detail & Related papers (2024-11-01T17:13:14Z) - AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration [0.0]
This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively.
Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment.
Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
arXiv Detail & Related papers (2024-05-06T21:36:45Z) - Enhancing Data Quality in Federated Fine-Tuning of Foundation Models [54.757324343062734]
We propose a data quality control pipeline for federated fine-tuning of foundation models.
This pipeline computes scores reflecting the quality of training data and determines a global threshold for a unified standard.
Our experiments show that the proposed quality control pipeline facilitates the effectiveness and reliability of the model training, leading to better performance.
arXiv Detail & Related papers (2024-03-07T14:28:04Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - Analyzing Dataset Annotation Quality Management in the Wild [63.07224587146207]
Even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts.
While practices and guidelines regarding dataset creation projects exist, large-scale analysis has yet to be performed on how quality management is conducted.
arXiv Detail & Related papers (2023-07-16T21:22:40Z) - QI2 -- an Interactive Tool for Data Quality Assurance [63.379471124899915]
The planned AI Act from the European commission defines challenging legal requirements for data quality.
We introduce a novel approach that supports the data quality assurance process of multiple data quality aspects.
arXiv Detail & Related papers (2023-07-07T07:06:38Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Detecting Quality Problems in Data Models by Clustering Heterogeneous
Data Values [1.143020642249583]
We propose a bottom-up approach to detecting quality problems in data models that manifest in heterogeneous data values.
All values of a selected data field are clustered by syntactic similarity.
It shall help domain experts to understand how the data model is used in practice and to derive potential quality problems of the data model.
arXiv Detail & Related papers (2021-11-12T11:05:18Z) - Data Quality Evaluation using Probability Models [0.0]
It is shown that for the data examined, the ability to predict the quality of data based on simple good/bad pre-labelled learning examples is accurate.
arXiv Detail & Related papers (2020-09-14T18:12:19Z) - Open Data Quality Evaluation: A Comparative Analysis of Open Data in
Latvia [0.0]
The research discusses how (open) data quality could be assessed.
One specific approach is applied to several Latvian open data sets.
There are also underlined common data quality problems detected in Latvian open data and in open data of 3 European countries.
arXiv Detail & Related papers (2020-07-09T10:43:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.