The Validity of Evaluation Results: Assessing Concurrence Across
Compositionality Benchmarks
- URL: http://arxiv.org/abs/2310.17514v1
- Date: Thu, 26 Oct 2023 16:11:04 GMT
- Title: The Validity of Evaluation Results: Assessing Concurrence Across
Compositionality Benchmarks
- Authors: Kaiser Sun, Adina Williams, Dieuwke Hupkes
- Abstract summary: We examine the performance of six modeling approaches across 4 datasets, split according to 8 compositional splitting strategies.
Our results demonstrate that much work remains to be done when it comes to assessing whether popular evaluation datasets measure what they intend to measure.
- Score: 27.83907050770602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: NLP models have progressed drastically in recent years, according to numerous
datasets proposed to evaluate performance. Questions remain, however, about how
particular dataset design choices may impact the conclusions we draw about
model capabilities. In this work, we investigate this question in the domain of
compositional generalization. We examine the performance of six modeling
approaches across 4 datasets, split according to 8 compositional splitting
strategies, ranking models by 18 compositional generalization splits in total.
Our results show that: i) the datasets, although all designed to evaluate
compositional generalization, rank modeling approaches differently; ii)
datasets generated by humans align better with each other than they with
synthetic datasets, or than synthetic datasets among themselves; iii)
generally, whether datasets are sampled from the same source is more predictive
of the resulting model ranking than whether they maintain the same
interpretation of compositionality; and iv) which lexical items are used in the
data can strongly impact conclusions. Overall, our results demonstrate that
much work remains to be done when it comes to assessing whether popular
evaluation datasets measure what they intend to measure, and suggest that
elucidating more rigorous standards for establishing the validity of evaluation
sets could benefit the field.
Related papers
- Benchmarking community drug response prediction models: datasets, models, tools, and metrics for cross-dataset generalization analysis [36.689210473887904]
We introduce a benchmarking framework for evaluating cross-dataset prediction generalization in deep learning (DL) and machine learning (ML) models.
We quantify both absolute performance (e.g., predictive accuracy across datasets) and relative performance (e.g., performance drop compared to within-dataset results)
Our results reveal substantial performance drops when models are tested on unseen datasets, underscoring the importance of rigorous generalization assessments.
arXiv Detail & Related papers (2025-03-18T15:40:18Z) - Self-Rationalization in the Wild: A Large Scale Out-of-Distribution Evaluation on NLI-related tasks [59.47851630504264]
Free-text explanations are expressive and easy to understand, but many datasets lack annotated explanation data.
We fine-tune T5-Large and OLMo-7B models and assess the impact of fine-tuning data quality, the number of fine-tuning samples, and few-shot selection methods.
The models are evaluated on 19 diverse OOD datasets across three tasks: natural language inference (NLI), fact-checking, and hallucination detection in abstractive summarization.
arXiv Detail & Related papers (2025-02-07T10:01:32Z) - On Evaluation of Vision Datasets and Models using Human Competency Frameworks [20.802372291783488]
Item Response Theory (IRT) is a framework that infers interpretable latent parameters for an ensemble of models and each dataset item.
We assess model calibration, select informative data subsets, and demonstrate the usefulness of its latent parameters for analyzing and comparing models and datasets in computer vision.
arXiv Detail & Related papers (2024-09-06T06:20:11Z) - Model-based Clustering of Individuals' Ecological Momentary Assessment
Time-series Data for Improving Forecasting Performance [5.312303275762104]
It is believed that additional information of similar individuals is likely to enhance these models leading to better individuals' description.
Two model-based clustering approaches are examined, where the first is using model-extracted parameters of personalized models.
The superiority of clustering-based methods is confirmed, indicating that the utilization of group-based information could be effectively enhance the overall performance of all individuals' data.
arXiv Detail & Related papers (2023-10-11T13:39:04Z) - On the Evaluation and Refinement of Vision-Language Instruction Tuning
Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets.
We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset.
Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z) - Is Synthetic Dataset Reliable for Benchmarking Generalizable Person
Re-Identification? [1.1041211464412568]
We show that a recent large-scale synthetic dataset ClonedPerson can be reliably used to benchmark GPReID, statistically the same as real-world datasets.
This study guarantees the usage of synthetic datasets for both source training set and target testing set, with completely no privacy concerns from real-world surveillance data.
arXiv Detail & Related papers (2022-09-12T06:54:54Z) - An Empirical Investigation of Commonsense Self-Supervision with
Knowledge Graphs [67.23285413610243]
Self-supervision based on the information extracted from large knowledge graphs has been shown to improve the generalization of language models.
We study the effect of knowledge sampling strategies and sizes that can be used to generate synthetic data for adapting language models.
arXiv Detail & Related papers (2022-05-21T19:49:04Z) - Data-driven Model Generalizability in Crosslinguistic Low-resource
Morphological Segmentation [4.339613097080119]
In low-resource scenarios, artifacts of the data collection can yield data sets that are outliers, potentially making conclusions about model performance coincidental.
We compare three broad classes of models with different parameterizations, taking data from 11 languages across 6 language families.
The results demonstrate that the extent of model generalization depends on the characteristics of the data set, and does not necessarily rely heavily on the data set size.
arXiv Detail & Related papers (2022-01-05T22:19:10Z) - Investigating Crowdsourcing Protocols for Evaluating the Factual
Consistency of Summaries [59.27273928454995]
Current pre-trained models applied to summarization are prone to factual inconsistencies which misrepresent the source text or introduce extraneous information.
We create a crowdsourcing evaluation framework for factual consistency using the rating-based Likert scale and ranking-based Best-Worst Scaling protocols.
We find that ranking-based protocols offer a more reliable measure of summary quality across datasets, while the reliability of Likert ratings depends on the target dataset and the evaluation design.
arXiv Detail & Related papers (2021-09-19T19:05:00Z) - Doing Great at Estimating CATE? On the Neglected Assumptions in
Benchmark Comparisons of Treatment Effect Estimators [91.3755431537592]
We show that even in arguably the simplest setting, estimation under ignorability assumptions can be misleading.
We consider two popular machine learning benchmark datasets for evaluation of heterogeneous treatment effect estimators.
We highlight that the inherent characteristics of the benchmark datasets favor some algorithms over others.
arXiv Detail & Related papers (2021-07-28T13:21:27Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - CDEvalSumm: An Empirical Study of Cross-Dataset Evaluation for Neural
Summarization Systems [121.78477833009671]
We investigate the performance of different summarization models under a cross-dataset setting.
A comprehensive study of 11 representative summarization systems on 5 datasets from different domains reveals the effect of model architectures and generation ways.
arXiv Detail & Related papers (2020-10-11T02:19:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.