Improving Model Evaluation using SMART Filtering of Benchmark Datasets
- URL: http://arxiv.org/abs/2410.20245v2
- Date: Mon, 10 Feb 2025 21:17:54 GMT
- Title: Improving Model Evaluation using SMART Filtering of Benchmark Datasets
- Authors: Vipul Gupta, Candace Ross, David Pantoja, Rebecca J. Passonneau, Megan Ung, Adina Williams,
- Abstract summary: We propose a novel approach to select high-quality subsets of examples from existing benchmark datasets.<n>Our approach applies three filtering criteria, removing (i) easy examples, (ii) data-contaminated examples, and (iii) examples that are similar to each other.<n>We demonstrate the effectiveness of SMART on three multiple choice QA datasets.
- Score: 19.731378662304497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One of the most challenging problems facing NLP today is evaluation. Some of the most pressing issues pertain to benchmark saturation, data contamination, and diversity in the quality of test examples. To address these concerns, we propose Selection Methodology for Accurate, Reduced, and Targeted (SMART) filtering, a novel approach to select a high-quality subset of examples from existing benchmark datasets by systematically removing less informative and less challenging examples. Our approach applies three filtering criteria, removing (i) easy examples, (ii) data-contaminated examples, and (iii) examples that are similar to each other based on distance in an embedding space. We demonstrate the effectiveness of SMART on three multiple choice QA datasets, where our methodology increases efficiency by reducing dataset size by 48\% on average, while increasing Pearson correlation with rankings from ChatBot Arena, a more open-ended human evaluation setting. Our method enables us to be more efficient, whether using SMART to make new benchmarks more challenging or to revitalize older datasets, while still preserving the relative model rankings.
Related papers
- TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection [9.020110377060153]
We present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection.<n>To capture data diversity, we leverage LLMs to assign open-domain tags to human queries.<n>We suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations.
arXiv Detail & Related papers (2025-07-04T15:46:07Z) - FastMCTS: A Simple Sampling Strategy for Data Synthesis [67.60823802317141]
We introduce FastMCTS, an innovative data synthesis strategy inspired by Monte Carlo Tree Search.
FastMCTS provides a more efficient sampling method for multi-step reasoning data, offering step-level evaluation signals.
Experiments on both English and Chinese reasoning datasets demonstrate that FastMCTS generates over 30% more correct reasoning paths.
arXiv Detail & Related papers (2025-02-17T06:27:57Z) - Improving the Efficiency of Self-Supervised Adversarial Training through Latent Clustering-Based Selection [2.7554677967598047]
adversarially robust learning is widely recognized to demand significantly more training examples.
Recent works propose the use of self-supervised adversarial training with external or synthetically generated unlabeled data to enhance model robustness.
We propose novel methods to strategically select a small subset of unlabeled data essential for SSAT and robustness improvement.
arXiv Detail & Related papers (2025-01-15T15:47:49Z) - Optimized Conformal Selection: Powerful Selective Inference After Conformity Score Optimization [4.984656106595651]
This paper presents OptCS, a framework that allows valid statistical testing (selection) after flexible data-driven model optimization.
We introduce general conditions under which OptCS constructs valid conformal p-values despite substantial data reuse.
We propose three FDR-controlling procedures, each optimizing the models differently.
arXiv Detail & Related papers (2024-11-27T01:40:50Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [56.24431208419858]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.
We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Data Efficient Evaluation of Large Language Models and Text-to-Image Models via Adaptive Sampling [3.7467864495337624]
SubLIME is a data-efficient evaluation framework for text-to-image models.
Our approach ensures statistically aligned model rankings compared to full datasets.
We leverage the HEIM leaderboard to cover 25 text-to-image models on 17 different benchmarks.
arXiv Detail & Related papers (2024-06-21T07:38:55Z) - Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models [38.39395973523944]
We propose a three-stage scheme for data selection and review existing works according to this scheme.
We find that the more targeted method with data-specific and model-specific quality labels has higher efficiency.
arXiv Detail & Related papers (2024-06-20T08:58:58Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Evaluating Graph Neural Networks for Link Prediction: Current Pitfalls
and New Benchmarking [66.83273589348758]
Link prediction attempts to predict whether an unseen edge exists based on only a portion of edges of a graph.
A flurry of methods have been introduced in recent years that attempt to make use of graph neural networks (GNNs) for this task.
New and diverse datasets have also been created to better evaluate the effectiveness of these new models.
arXiv Detail & Related papers (2023-06-18T01:58:59Z) - Pareto Optimization for Active Learning under Out-of-Distribution Data
Scenarios [79.02009938011447]
We propose a sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool.
Experimental results show its effectiveness on both classical Machine Learning (ML) and Deep Learning (DL) tasks.
arXiv Detail & Related papers (2022-07-04T04:11:44Z) - Adversarially Constructed Evaluation Sets Are More Challenging, but May
Not Be Fair [23.87794015063672]
Adversarial dataset creation has been proposed as a strategy to construct more challenging datasets.
We adapt the AFLite algorithm to filter evaluation data, and run experiments against 18 different adversary models.
We find that AFLite indeed selects more challenging examples, lowering the performance of evaluated models more as stronger adversary models are used.
arXiv Detail & Related papers (2021-11-16T01:45:26Z) - GOLD: Improving Out-of-Scope Detection in Dialogues using Data
Augmentation [41.04593978694591]
Gold technique augments existing data to train better OOS detectors operating in low-data regimes.
In experiments across three target benchmarks, the top GOLD model outperforms all existing methods on all key metrics.
arXiv Detail & Related papers (2021-09-07T13:35:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.