Related papers: Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

URL: http://arxiv.org/abs/2405.01719v2
Date: Mon, 6 May 2024 15:09:50 GMT
Title: Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks
Authors: Guanhua Zhang, Moritz Hardt,
Abstract summary: We examine multi-task benchmarks in machine learning through the lens of social choice theory. We show that the more diverse a benchmark, the more sensitive to trivial changes it is.
Score: 20.24270790628136
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We develop efficient approximation algorithms for both measures, as exact computation is computationally challenging. Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under irrelevant changes. The codes and data are available at https://socialfoundations.github.io/benchbench/.

Related papers

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning [101.62386137855704]
We present a novel Universal Multimodal Embedding (UniME-V2) model.<n>Our approach first constructs a potential hard negative set through global retrieval.<n>We then introduce the MLLM-as-a-Judge mechanism, which utilizes MLLMs to assess the semantic alignment of query-candidate pairs.<n>These scores serve as a foundation for hard negative mining, mitigating the impact of false negatives and enabling the identification of diverse, high-quality hard negatives.
arXiv Detail & Related papers (2025-10-15T13:07:00Z)
OrdRankBen: A Novel Ranking Benchmark for Ordinal Relevance in NLP [6.6002656593260225]
Benchmark datasets play a crucial role in providing standardized testbeds that ensure fair comparisons. Existing NLP ranking benchmarks typically use binary relevance labels or continuous relevance scores, neglecting ordinal relevance scores. We introduce OrdRankBen, a novel benchmark designed to capture multi-granularity relevance distinctions.
arXiv Detail & Related papers (2025-03-02T00:28:55Z)
ELSA: Evaluating Localization of Social Activities in Urban Streets using Open-Vocabulary Detection [22.420962041487282]
We present ELSA: evaluating localization of social activities. We introduce a novel confidence score computation method NLSE and a novel Dynamic Box Aggregation (DBA) algorithm to assess semantic consistency in overlapping predictions. We report our results on the widely-used SOTA models Grounding DINO, Detic, OWL, and MDETR.
arXiv Detail & Related papers (2024-06-03T17:32:23Z)
Bipartite Ranking Fairness through a Model Agnostic Ordering Adjustment [54.179859639868646]
We propose a model agnostic post-processing framework xOrder for achieving fairness in bipartite ranking. xOrder is compatible with various classification models and ranking fairness metrics, including supervised and unsupervised fairness metrics. We evaluate our proposed algorithm on four benchmark data sets and two real-world patient electronic health record repositories.
arXiv Detail & Related papers (2023-07-27T07:42:44Z)
Exposing and Addressing Cross-Task Inconsistency in Unified Vision-Language Models [80.23791222509644]
Inconsistent AI models are considered brittle and untrustworthy by human users. We find that state-of-the-art vision-language models suffer from a surprisingly high degree of inconsistent behavior across tasks. We propose a rank correlation-based auxiliary training objective, computed over large automatically created cross-task contrast sets.
arXiv Detail & Related papers (2023-03-28T16:57:12Z)
Vote'n'Rank: Revision of Benchmarking with Social Choice Theory [7.224599819499157]
This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory. We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields.
arXiv Detail & Related papers (2022-10-11T20:19:11Z)
The Curse of Low Task Diversity: On the Failure of Transfer Learning to Outperform MAML and Their Empirical Equivalence [20.965759895300327]
We propose a novel metric -- the diversity coefficient -- to measure the diversity of tasks in a few-shot learning benchmark. Using the diversity coefficient, we show that the popular MiniImageNet and CIFAR-FS few-shot learning benchmarks have low diversity.
arXiv Detail & Related papers (2022-08-02T15:49:11Z)
Few-Shot Image Classification Benchmarks are Too Far From Reality: Build Back Better with Semantic Task Sampling [4.855663359344748]
We introduce a new benchmark for Few-Shot Image Classification using the Danish Fungi 2020 dataset. This benchmark proposes a wide variety of evaluation tasks with various fine-graininess. Our experiments bring out the correlation between the difficulty of a task and the semantic similarity between its classes.
arXiv Detail & Related papers (2022-05-10T20:25:43Z)
Towards QD-suite: developing a set of benchmarks for Quality-Diversity algorithms [0.0]
Existing benchmarks are not standardized, and there is currently no MNIST equivalent for Quality-Diversity (QD) We argue that the identification of challenges faced by QD methods and the development of targeted, challenging, scalable benchmarks is an important step.
arXiv Detail & Related papers (2022-05-06T13:33:50Z)
A Theoretically Grounded Benchmark for Evaluating Machine Commonsense [6.725087407394836]
Theoretically-answered Commonsense Reasoning (TG-CSR) is based on discriminative question answering, but with questions designed to evaluate diverse aspects of commonsense. TG-CSR is based on a subset of commonsense categories first proposed as a viable theory of commonsense by Gordon and Hobbs. Preliminary results suggest that the benchmark is challenging even for advanced language representation models designed for discriminative CSR question answering tasks.
arXiv Detail & Related papers (2022-03-23T04:06:01Z)
The Benchmark Lottery [114.43978017484893]
"A benchmark lottery" describes the overall fragility of the machine learning benchmarking process. We show that the relative performance of algorithms may be altered significantly simply by choosing different benchmark tasks.
arXiv Detail & Related papers (2021-07-14T21:08:30Z)
MultiFair: Multi-Group Fairness in Machine Learning [52.24956510371455]
We study multi-group fairness in machine learning (MultiFair) We propose a generic end-to-end algorithmic framework to solve it. Our proposed framework is generalizable to many different settings.
arXiv Detail & Related papers (2021-05-24T02:30:22Z)
Towards Model-Agnostic Post-Hoc Adjustment for Balancing Ranking Fairness and Algorithm Utility [54.179859639868646]
Bipartite ranking aims to learn a scoring function that ranks positive individuals higher than negative ones from labeled data. There have been rising concerns on whether the learned scoring function can cause systematic disparity across different protected groups. We propose a model post-processing framework for balancing them in the bipartite ranking scenario.
arXiv Detail & Related papers (2020-06-15T10:08:39Z)
Robust Learning Through Cross-Task Consistency [92.42534246652062]
We propose a broadly applicable and fully computational method for augmenting learning with Cross-Task Consistency. We observe that learning with cross-task consistency leads to more accurate predictions and better generalization to out-of-distribution inputs.
arXiv Detail & Related papers (2020-06-07T09:24:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.