ET-AL: Entropy-Targeted Active Learning for Bias Mitigation in Materials
Data
- URL: http://arxiv.org/abs/2211.07881v2
- Date: Wed, 16 Nov 2022 22:23:21 GMT
- Title: ET-AL: Entropy-Targeted Active Learning for Bias Mitigation in Materials
Data
- Authors: Hengrui Zhang, Wei Wayne Chen, James M. Rondinelli, Wei Chen
- Abstract summary: Growing materials data and data-centric informatics tools drastically promote the discovery and design of materials.
Data-driven models, such as machine learning, have drawn much attention and observed significant progress.
We focus on bias mitigation, an important aspect of materials data quality.
- Score: 8.623994950369127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Growing materials data and data-centric informatics tools drastically promote
the discovery and design of materials. While data-driven models, such as
machine learning, have drawn much attention and observed significant progress,
the quality of data resources is equally important but less studied. In this
work, we focus on bias mitigation, an important aspect of materials data
quality. Quantifying the diversity of stability in different crystal systems,
we propose a metric for measuring structure-stability bias in materials data.
To mitigate the bias, we develop an entropy-target active learning (ET-AL)
framework, guiding the acquisition of new data so that diversities of
underrepresented crystal systems are improved, thus mitigating the bias. With
experiments on materials datasets, we demonstrate the capability of ET-AL and
the improvement in machine learning models through bias mitigation. The
approach is applicable to data-centric informatics in other scientific domains.
Related papers
- Oriented Tiny Object Detection: A Dataset, Benchmark, and Dynamic Unbiased Learning [51.170479006249195]
We introduce a new dataset, benchmark, and a dynamic coarse-to-fine learning scheme in this study.
Our proposed dataset, AI-TOD-R, features the smallest object sizes among all oriented object detection datasets.
We present a benchmark spanning a broad range of detection paradigms, including both fully-supervised and label-efficient approaches.
arXiv Detail & Related papers (2024-12-16T09:14:32Z) - Evaluating Language Models as Synthetic Data Generators [74.80905172696366]
AgoraBench is a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities.
Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities.
arXiv Detail & Related papers (2024-12-04T19:20:32Z) - PairCFR: Enhancing Model Training on Paired Counterfactually Augmented Data through Contrastive Learning [49.60634126342945]
Counterfactually Augmented Data (CAD) involves creating new data samples by applying minimal yet sufficient modifications to flip the label of existing data samples to other classes.
Recent research reveals that training with CAD may lead models to overly focus on modified features while ignoring other important contextual information.
We employ contrastive learning to promote global feature alignment in addition to learning counterfactual clues.
arXiv Detail & Related papers (2024-06-09T07:29:55Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - Towards Understanding How Data Augmentation Works with Imbalanced Data [17.478900028887537]
We study the effect of data augmentation on three different classifiers, convolutional neural networks, support vector machines, and logistic regression models.
Our research indicates that DA, when applied to imbalanced data, produces substantial changes in model weights, support vectors and feature selection.
We hypothesize that DA works by facilitating variances in data, so that machine learning models can associate changes in the data with labels.
arXiv Detail & Related papers (2023-04-12T15:01:22Z) - Large Language Models as Master Key: Unlocking the Secrets of Materials
Science with GPT [9.33544942080883]
This article presents a new natural language processing (NLP) task called structured information inference (SII) to address the complexities of information extraction at the device level in materials science.
We accomplished this task by tuning GPT-3 on an existing perovskite solar cell FAIR dataset with 91.8% F1-score and extended the dataset with data published since its release.
We also designed experiments to predict the electrical performance of solar cells and design materials or devices with targeted parameters using large language models (LLMs)
arXiv Detail & Related papers (2023-04-05T04:01:52Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - CMW-Net: Learning a Class-Aware Sample Weighting Mapping for Robust Deep
Learning [55.733193075728096]
Modern deep neural networks can easily overfit to biased training data containing corrupted labels or class imbalance.
Sample re-weighting methods are popularly used to alleviate this data bias issue.
We propose a meta-model capable of adaptively learning an explicit weighting scheme directly from data.
arXiv Detail & Related papers (2022-02-11T13:49:51Z) - Fix your Models by Fixing your Datasets [0.6058427379240697]
Current machine learning tools lack streamlined processes for improving the data quality.
We introduce a systematic framework for finding noisy or mislabelled samples in the dataset.
We demonstrate the efficacy of our framework on public as well as private enterprise datasets of two Fortune 500 companies.
arXiv Detail & Related papers (2021-12-15T02:41:50Z) - Data Curation and Quality Assurance for Machine Learning-based Cyber
Intrusion Detection [1.0276024900942873]
This article first summarizes existing machine learning-based intrusion detection systems and the datasets used for building these systems.
The experimental results show that BERT and GPT were the best algorithms for HIDS on all of the datasets.
We then evaluate the data quality of the 11 datasets based on quality dimensions proposed in this paper to determine the best characteristics that a HIDS dataset should possess in order to yield the best possible result.
arXiv Detail & Related papers (2021-05-20T21:31:46Z) - On the Use of Interpretable Machine Learning for the Management of Data
Quality [13.075880857448059]
We propose the use of interpretable machine learning to deliver the features that are important to be based for any data processing activity.
Our aim is to secure data quality, at least, for those features that are detected as significant in the collected datasets.
arXiv Detail & Related papers (2020-07-29T08:49:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.