Reassessing the Validity of Spurious Correlations Benchmarks
- URL: http://arxiv.org/abs/2409.04188v1
- Date: Fri, 6 Sep 2024 11:05:26 GMT
- Title: Reassessing the Validity of Spurious Correlations Benchmarks
- Authors: Samuel J. Bell, Diane Bouchacourt, Levent Sagun,
- Abstract summary: We present a recipe for practitioners to choose methods using the most similar benchmark to their given problem.
We find that certain benchmarks are not meaningful measures of method performance, and that several methods are not sufficiently robust for widespread use.
- Score: 13.056731788383257
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural networks can fail when the data contains spurious correlations. To understand this phenomenon, researchers have proposed numerous spurious correlations benchmarks upon which to evaluate mitigation methods. However, we observe that these benchmarks exhibit substantial disagreement, with the best methods on one benchmark performing poorly on another. We explore this disagreement, and examine benchmark validity by defining three desiderata that a benchmark should satisfy in order to meaningfully evaluate methods. Our results have implications for both benchmarks and mitigations: we find that certain benchmarks are not meaningful measures of method performance, and that several methods are not sufficiently robust for widespread use. We present a simple recipe for practitioners to choose methods using the most similar benchmark to their given problem.
Related papers
- PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models [41.772263447213234]
Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks.
This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications.
We introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs.
arXiv Detail & Related papers (2024-06-26T13:12:40Z) - Baseline Results for Selected Nonlinear System Identification Benchmarks [2.144088660722956]
This paper presents a set of ten baseline techniques and their relative performances on five popular benchmarks.
The aim of this contribution is to stimulate thought and discussion regarding objective comparison of identification methodologies.
arXiv Detail & Related papers (2024-05-17T13:40:59Z) - Efficient Benchmarking of Language Models [22.696230279151166]
We present the problem of Efficient Benchmarking, namely, intelligently reducing the costs of LM evaluation without compromising reliability.
Using the HELM benchmark as a test case, we investigate how different benchmark design choices affect the computation-reliability trade-off.
We propose an evaluation algorithm, that, when applied to the HELM benchmark, leads to dramatic cost savings with minimal loss of benchmark reliability.
arXiv Detail & Related papers (2023-08-22T17:59:30Z) - Risk Consistent Multi-Class Learning from Label Proportions [64.0125322353281]
This study addresses a multiclass learning from label proportions (MCLLP) setting in which training instances are provided in bags.
Most existing MCLLP methods impose bag-wise constraints on the prediction of instances or assign them pseudo-labels.
A risk-consistent method is proposed for instance classification using the empirical risk minimization framework.
arXiv Detail & Related papers (2022-03-24T03:49:04Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Adaptive Sampling for Heterogeneous Rank Aggregation from Noisy Pairwise
Comparisons [85.5955376526419]
In rank aggregation problems, users exhibit various accuracy levels when comparing pairs of items.
We propose an elimination-based active sampling strategy, which estimates the ranking of items via noisy pairwise comparisons.
We prove that our algorithm can return the true ranking of items with high probability.
arXiv Detail & Related papers (2021-10-08T13:51:55Z) - The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process.
We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z) - Scalable Personalised Item Ranking through Parametric Density Estimation [53.44830012414444]
Learning from implicit feedback is challenging because of the difficult nature of the one-class problem.
Most conventional methods use a pairwise ranking approach and negative samplers to cope with the one-class problem.
We propose a learning-to-rank approach, which achieves convergence speed comparable to the pointwise counterpart.
arXiv Detail & Related papers (2021-05-11T03:38:16Z) - A Statistical Analysis of Summarization Evaluation Metrics using
Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are.
Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z) - Rethinking Empirical Evaluation of Adversarial Robustness Using
First-Order Attack Methods [6.531546527140473]
We identify three common cases that lead to overestimation of adversarial accuracy against bounded first-order attack methods.
We propose compensation methods that address sources of inaccurate gradient computation.
Overall, our work shows that overestimated adversarial accuracy that is not indicative of robustness is prevalent even for conventionally trained deep neural networks.
arXiv Detail & Related papers (2020-06-01T22:55:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.