DQI: Measuring Data Quality in NLP
- URL: http://arxiv.org/abs/2005.00816v1
- Date: Sat, 2 May 2020 12:34:17 GMT
- Title: DQI: Measuring Data Quality in NLP
- Authors: Swaroop Mishra, Anjana Arunkumar, Bhavdeep Sachdeva, Chris Bryan,
Chitta Baral
- Abstract summary: We introduce a generic formula for Data Quality Index (DQI) to help dataset creators create datasets free of unwanted biases.
We show that models trained on the renovated SNLI dataset generalize better to out of distribution tasks.
- Score: 22.54066527822898
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural language models have achieved human level performance across several
NLP datasets. However, recent studies have shown that these models are not
truly learning the desired task; rather, their high performance is attributed
to overfitting using spurious biases, which suggests that the capabilities of
AI systems have been over-estimated. We introduce a generic formula for Data
Quality Index (DQI) to help dataset creators create datasets free of such
unwanted biases. We evaluate this formula using a recently proposed approach
for adversarial filtering, AFLite. We propose a new data creation paradigm
using DQI to create higher quality data. The data creation paradigm consists of
several data visualizations to help data creators (i) understand the quality of
data and (ii) visualize the impact of the created data instance on the overall
quality. It also has a couple of automation methods to (i) assist data creators
and (ii) make the model more robust to adversarial attacks. We use DQI along
with these automation methods to renovate biased examples in SNLI. We show that
models trained on the renovated SNLI dataset generalize better to out of
distribution tasks. Renovation results in reduced model performance, exposing a
large gap with respect to human performance. DQI systematically helps in
creating harder benchmarks using active learning. Our work takes the process of
dynamic dataset creation forward, wherein datasets evolve together with the
evolving state of the art, therefore serving as a means of benchmarking the
true progress of AI.
Related papers
- Forewarned is Forearmed: Leveraging LLMs for Data Synthesis through Failure-Inducing Exploration [90.41908331897639]
Large language models (LLMs) have significantly benefited from training on diverse, high-quality task-specific data.
We present a novel approach, ReverseGen, designed to automatically generate effective training samples.
arXiv Detail & Related papers (2024-10-22T06:43:28Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Genie: Achieving Human Parity in Content-Grounded Datasets Generation [15.535753443076002]
We propose Genie, a novel method for automatically generating high-quality content-grounded data.
We showcase this methodology by generating three large-scale synthetic data.
In a human evaluation, our generated data was found to be natural and of high quality.
arXiv Detail & Related papers (2024-01-25T18:14:57Z) - STAR: Boosting Low-Resource Information Extraction by Structure-to-Text
Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances.
We design fine-grained step-by-step instructions to obtain the initial data instances.
Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z) - RLBoost: Boosting Supervised Models using Deep Reinforcement Learning [0.0]
We present RLBoost, an algorithm that uses deep reinforcement learning strategies to evaluate a particular dataset and obtain a model capable of estimating the quality of any new data.
The results of the article show that this model obtains better and more stable results than other state-of-the-art algorithms such as LOO, DataShapley or DVRL.
arXiv Detail & Related papers (2023-05-23T14:38:33Z) - Discover, Explanation, Improvement: An Automatic Slice Detection
Framework for Natural Language Processing [72.14557106085284]
slice detection models (SDM) automatically identify underperforming groups of datapoints.
This paper proposes a benchmark named "Discover, Explain, improve (DEIM)" for classification NLP tasks.
Our evaluation shows that Edisa can accurately select error-prone datapoints with informative semantic features.
arXiv Detail & Related papers (2022-11-08T19:00:00Z) - PromDA: Prompt-based Data Augmentation for Low-Resource NLU Tasks [61.51515750218049]
This paper focuses on the Data Augmentation for low-resource Natural Language Understanding (NLU) tasks.
We propose Prompt-based Data Augmentation model (PromDA) which only trains small-scale Soft Prompt.
PromDA generates synthetic data via two different views and filters out the low-quality data using NLU models.
arXiv Detail & Related papers (2022-02-25T05:09:27Z) - Exploring the Efficacy of Automatically Generated Counterfactuals for
Sentiment Analysis [17.811597734603144]
We propose an approach to automatically generating counterfactual data for data augmentation and explanation.
A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance.
arXiv Detail & Related papers (2021-06-29T10:27:01Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.