On some elusive aspects of databases hindering AI based discovery: A
case study on superconducting materials
- URL: http://arxiv.org/abs/2311.09891v1
- Date: Thu, 16 Nov 2023 13:38:00 GMT
- Title: On some elusive aspects of databases hindering AI based discovery: A
case study on superconducting materials
- Authors: Giovanni Trezza, Eliodoro Chiavazzo
- Abstract summary: We discuss three aspects, namely intrinsically biased sample selection, possible hidden variables, disparate data age.
To our knowledge, we suggest and test a first strategy capable of detecting and quantifying the presence of the intrinsic data bias.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: It stands to reason that the amount and the quality of big data is of key
importance for setting up accurate AI-driven models. Nonetheless, we believe
there are still critical roadblocks in the inherent generation of databases,
that are often underestimated and poorly discussed in the literature. In our
view, such issues can seriously hinder the AI-based discovery process, even
when high quality, sufficiently large and highly reputable data sources are
available. Here, considering superconducting and thermoelectric materials as
two representative case studies, we specifically discuss three aspects, namely
intrinsically biased sample selection, possible hidden variables, disparate
data age. Importantly, to our knowledge, we suggest and test a first strategy
capable of detecting and quantifying the presence of the intrinsic data bias.
Related papers
- AI-Driven Frameworks for Enhancing Data Quality in Big Data Ecosystems: Error_Detection, Correction, and Metadata Integration [0.0]
This thesis proposes a novel set of interconnected frameworks aimed at enhancing big data quality comprehensively.
Firstly, we introduce new quality metrics and a weighted scoring system for precise data quality assessment.
Thirdly, we present a generic framework for detecting various quality anomalies using AI models.
arXiv Detail & Related papers (2024-05-06T21:36:45Z) - Investigating the Quality of DermaMNIST and Fitzpatrick17k Dermatological Image Datasets [17.01966057343415]
Several factors can impact data quality, such as the presence of duplicates, data leakage across train-test partitions, mislabeled images, and the absence of a well-defined test partition.
We conduct meticulous analyses of three popular dermatological image datasets: DermaMNIST, its source HAM10000, and Fitzpatrick17k.
arXiv Detail & Related papers (2024-01-25T20:29:01Z) - UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models [55.22048505787125]
This paper contributes a comprehensive dataset, called UNK-VQA.
We first augment the existing data via deliberate perturbations on either the image or question.
We then extensively evaluate the zero- and few-shot performance of several emerging multi-modal large models.
arXiv Detail & Related papers (2023-10-17T02:38:09Z) - Towards Generalizable Data Protection With Transferable Unlearnable
Examples [50.628011208660645]
We present a novel, generalizable data protection method by generating transferable unlearnable examples.
To the best of our knowledge, this is the first solution that examines data privacy from the perspective of data distribution.
arXiv Detail & Related papers (2023-05-18T04:17:01Z) - On the Possibilities of AI-Generated Text Detection [76.55825911221434]
We argue that as machine-generated text approximates human-like quality, the sample size needed for detection bounds increases.
We test various state-of-the-art text generators, including GPT-2, GPT-3.5-Turbo, Llama, Llama-2-13B-Chat-HF, and Llama-2-70B-Chat-HF, against detectors, including oBERTa-Large/Base-Detector, GPTZero.
arXiv Detail & Related papers (2023-04-10T17:47:39Z) - ET-AL: Entropy-Targeted Active Learning for Bias Mitigation in Materials
Data [8.623994950369127]
Growing materials data and data-centric informatics tools drastically promote the discovery and design of materials.
Data-driven models, such as machine learning, have drawn much attention and observed significant progress.
We focus on bias mitigation, an important aspect of materials data quality.
arXiv Detail & Related papers (2022-11-15T04:12:00Z) - Do Deep Neural Networks Always Perform Better When Eating More Data? [82.6459747000664]
We design experiments from Identically Independent Distribution(IID) and Out of Distribution(OOD)
Under IID condition, the amount of information determines the effectivity of each sample, the contribution of samples and difference between classes determine the amount of class information.
Under OOD condition, the cross-domain degree of samples determine the contributions, and the bias-fitting caused by irrelevant elements is a significant factor of cross-domain.
arXiv Detail & Related papers (2022-05-30T15:40:33Z) - Data Smells: Categories, Causes and Consequences, and Detection of
Suspicious Data in AI-based Systems [3.793596705511303]
Article conceptualizes data smells and elaborates on their causes, consequences, detection, and use in the context of AI-based systems.
In addition, a catalogue of 36 data smells divided into three categories (i.e., Believability Smells, Understandability Smells, Consistency Smells) is presented.
arXiv Detail & Related papers (2022-03-19T19:21:52Z) - Deep neural networks approach to microbial colony detection -- a
comparative analysis [52.77024349608834]
This study investigates the performance of three deep learning approaches for object detection on the AGAR dataset.
The achieved results may serve as a benchmark for future experiments.
arXiv Detail & Related papers (2021-08-23T12:06:00Z) - Occams Razor for Big Data? On Detecting Quality in Large Unstructured
Datasets [0.0]
New trend towards analytic complexity represents a severe challenge for the principle of parsimony or Occams Razor in science.
Computational building block approaches for data clustering can help to deal with large unstructured datasets in minimized computation time.
The review concludes on how cultural differences between East and West are likely to affect the course of big data analytics.
arXiv Detail & Related papers (2020-11-12T16:06:01Z) - Data Mining with Big Data in Intrusion Detection Systems: A Systematic
Literature Review [68.15472610671748]
Cloud computing has become a powerful and indispensable technology for complex, high performance and scalable computation.
The rapid rate and volume of data creation has begun to pose significant challenges for data management and security.
The design and deployment of intrusion detection systems (IDS) in the big data setting has, therefore, become a topic of importance.
arXiv Detail & Related papers (2020-05-23T20:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.