Adversarially Constructed Evaluation Sets Are More Challenging, but May
Not Be Fair
- URL: http://arxiv.org/abs/2111.08181v1
- Date: Tue, 16 Nov 2021 01:45:26 GMT
- Title: Adversarially Constructed Evaluation Sets Are More Challenging, but May
Not Be Fair
- Authors: Jason Phang, Angelica Chen, William Huang, Samuel R. Bowman
- Abstract summary: Adversarial dataset creation has been proposed as a strategy to construct more challenging datasets.
We adapt the AFLite algorithm to filter evaluation data, and run experiments against 18 different adversary models.
We find that AFLite indeed selects more challenging examples, lowering the performance of evaluated models more as stronger adversary models are used.
- Score: 23.87794015063672
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: More capable language models increasingly saturate existing task benchmarks,
in some cases outperforming humans. This has left little headroom with which to
measure further progress. Adversarial dataset creation has been proposed as a
strategy to construct more challenging datasets, and two common approaches are:
(1) filtering out easy examples and (2) model-in-the-loop data collection. In
this work, we study the impact of applying each approach to create more
challenging evaluation datasets. We adapt the AFLite algorithm to filter
evaluation data, and run experiments against 18 different adversary models. We
find that AFLite indeed selects more challenging examples, lowering the
performance of evaluated models more as stronger adversary models are used.
However, the resulting ranking of models can also be unstable and highly
sensitive to the choice of adversary model used. Moreover, AFLite oversamples
examples with low annotator agreement, meaning that model comparisons hinge on
the most contentiously labeled examples. Smaller-scale experiments on the
adversarially collected datasets ANLI and AdversarialQA show similar findings,
broadly lowering performance with stronger adversaries while disproportionately
affecting the adversary model.
Related papers
- EpiCoDe: Boosting Model Performance Beyond Training with Extrapolation and Contrastive Decoding [50.29046178980637]
EpiCoDe is a method that boosts model performance in data-scarcity scenarios without extra training.<n>We show that EpiCoDe consistently outperforms existing methods with significant and robust improvement.
arXiv Detail & Related papers (2025-06-04T02:11:54Z) - Relation Extraction or Pattern Matching? Unravelling the Generalisation Limits of Language Models for Biographical RE [18.616344314400244]
We show that relation extraction models struggle with unseen data, even within similar domains.<n>Our results also show that data quality, rather than lexical similarity, is key to robust transfer.
arXiv Detail & Related papers (2025-05-18T20:22:14Z) - More is Less: The Pitfalls of Multi-Model Synthetic Preference Data in DPO Safety Alignment [80.04449725137177]
Direct Preference Optimization (DPO) has emerged as a simple, yet effective alternative to reinforcement learning from human feedback.
Our study reveals a striking, safety-specific phenomenon associated with DPO alignment.
Using solely self-generated responses for both chosen and rejected pairs significantly outperforms configurations that incorporate responses from stronger models.
arXiv Detail & Related papers (2025-04-03T00:36:40Z) - Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.<n>They generate only a limited range of perturbations for a single Information Extraction (IE) task.<n>Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.<n>We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z) - Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models [56.02275285521847]
We propose to evaluate models using a Panel of LLm evaluators (PoLL)
We find that using a PoLL composed of a larger number of smaller models outperforms a single large judge, exhibits less intra-model bias due to its composition of disjoint model families, and does so while being over seven times less expensive.
arXiv Detail & Related papers (2024-04-29T15:33:23Z) - Non-Invasive Fairness in Learning through the Lens of Data Drift [88.37640805363317]
We show how to improve the fairness of Machine Learning models without altering the data or the learning algorithm.
We use a simple but key insight: the divergence of trends between different populations, and, consecutively, between a learned model and minority populations, is analogous to data drift.
We explore two strategies (model-splitting and reweighing) to resolve this drift, aiming to improve the overall conformance of models to the underlying data.
arXiv Detail & Related papers (2023-03-30T17:30:42Z) - Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss.
Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z) - Simplicity Bias Leads to Amplified Performance Disparities [8.60453031364566]
We show that SGD-trained models have a bias towards simplicity, leading them to prioritize learning a majority class.
A model may prioritize any class or group of the dataset that it finds simple-at the expense of what it finds complex.
arXiv Detail & Related papers (2022-12-13T15:24:41Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Improving Question Answering Model Robustness with Synthetic Adversarial
Data Generation [41.9785159975426]
State-of-the-art question answering models remain susceptible to a variety of adversarial attacks and are still far from obtaining human-level language understanding.
One proposed way forward is dynamic adversarial data collection, in which a human annotator attempts to create examples for which a model-in-the-loop fails.
In this work, we investigate several answer selection, question generation, and filtering methods that form a synthetic adversarial data generation pipeline.
Models trained on both synthetic and human-generated data outperform models not trained on synthetic adversarial data, and obtain state-of-the-art results on the Adversarial
arXiv Detail & Related papers (2021-04-18T02:00:06Z) - Doubly Contrastive Deep Clustering [135.7001508427597]
We present a novel Doubly Contrastive Deep Clustering (DCDC) framework, which constructs contrastive loss over both sample and class views.
Specifically, for the sample view, we set the class distribution of the original sample and its augmented version as positive sample pairs.
For the class view, we build the positive and negative pairs from the sample distribution of the class.
In this way, two contrastive losses successfully constrain the clustering results of mini-batch samples in both sample and class level.
arXiv Detail & Related papers (2021-03-09T15:15:32Z) - Estimating Example Difficulty Using Variance of Gradients [5.69361786082969]
Variance of Gradients (VoG) is a metric to rank data by difficulty.
We show that VoG is a valuable and efficient metric to rank data by difficulty.
arXiv Detail & Related papers (2020-08-26T14:53:24Z) - Adversarial Filters of Dataset Biases [96.090959788952]
Large neural models have demonstrated human-level performance on language and vision benchmarks.
Their performance degrades considerably on adversarial or out-of-distribution samples.
We propose AFLite, which adversarially filters such dataset biases.
arXiv Detail & Related papers (2020-02-10T21:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.