Metritocracy: Representative Metrics for Lite Benchmarks
- URL: http://arxiv.org/abs/2506.09813v2
- Date: Mon, 16 Jun 2025 12:43:27 GMT
- Title: Metritocracy: Representative Metrics for Lite Benchmarks
- Authors: Ariel Procaccia, Benjamin Schiffer, Serena Wang, Shirley Zhang,
- Abstract summary: We use ideas from social choice theory to formalize two notions of representation for the selection of a subset of evaluation metrics.<n>We first introduce positional representation, which guarantees every alternative is sufficiently represented at every position cutoff.<n>We then introduce positional proportionality, which guarantees no alternative is proportionally over- or under-represented by more than a small error at any position.
- Score: 3.0936354370614607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A common problem in LLM evaluation is how to choose a subset of metrics from a full suite of possible metrics. Subset selection is usually done for efficiency or interpretability reasons, and the goal is often to select a ``representative'' subset of metrics. However, ``representative'' is rarely clearly defined. In this work, we use ideas from social choice theory to formalize two notions of representation for the selection of a subset of evaluation metrics. We first introduce positional representation, which guarantees every alternative is sufficiently represented at every position cutoff. We then introduce positional proportionality, which guarantees no alternative is proportionally over- or under-represented by more than a small error at any position. We prove upper and lower bounds on the smallest number of metrics needed to guarantee either of these properties in the worst case. We also study a generalized form of each property that allows for additional input on groups of metrics that must be represented. Finally, we tie theory to practice through real-world case studies on both LLM evaluation and hospital quality evaluation.
Related papers
- Ethical AI on the Waitlist: Group Fairness Evaluation of LLM-Aided Organ Allocation [19.66750942418172]
Using organ allocation as a case study, we introduce two tasks: (1) Choose-One and (2) Rank-All.<n>In Rank-All, LLMs rank all candidates for a kidney, reflecting real-world allocation processes.<n>Since traditional fairness metrics do not account for ranking, we propose a novel application of Borda scoring to capture biases.
arXiv Detail & Related papers (2025-03-29T04:36:25Z) - Multi-Group Proportional Representation in Retrieval [46.00781543425424]
We introduce Multi-Group Proportional Representation (MPR), a novel metric that measures representation across intersectional groups.
MPR yields more proportional representation across multiple intersectional groups specified by a rich function class, often with minimal compromise in retrieval accuracy.
arXiv Detail & Related papers (2024-07-11T14:59:17Z) - A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice [6.091702876917282]
Classification systems are evaluated in a countless number of papers.
However, we find that evaluation practice is often nebulous.
Many works use so-called'macro' metrics to rank systems but do not clearly specify what they would expect from such a metric.
arXiv Detail & Related papers (2024-04-25T18:12:43Z) - Revisiting Evaluation Metrics for Semantic Segmentation: Optimization
and Evaluation of Fine-grained Intersection over Union [113.20223082664681]
We propose the use of fine-grained mIoUs along with corresponding worst-case metrics.
These fine-grained metrics offer less bias towards large objects, richer statistical information, and valuable insights into model and dataset auditing.
Our benchmark study highlights the necessity of not basing evaluations on a single metric and confirms that fine-grained mIoUs reduce the bias towards large objects.
arXiv Detail & Related papers (2023-10-30T03:45:15Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - Synergies between Disentanglement and Sparsity: Generalization and
Identifiability in Multi-Task Learning [79.83792914684985]
We prove a new identifiability result that provides conditions under which maximally sparse base-predictors yield disentangled representations.
Motivated by this theoretical result, we propose a practical approach to learn disentangled representations based on a sparsity-promoting bi-level optimization problem.
arXiv Detail & Related papers (2022-11-26T21:02:09Z) - Relational Proxies: Emergent Relationships as Fine-Grained
Discriminators [52.17542855760418]
We propose a novel approach that leverages information between the global and local part of an object for encoding its label.
We design Proxies based on our theoretical findings and evaluate it on seven challenging fine-grained benchmark datasets.
We also experimentally validate our theory and obtain consistent results across multiple benchmarks.
arXiv Detail & Related papers (2022-10-05T11:08:04Z) - Optimizing Partial Area Under the Top-k Curve: Theory and Practice [151.5072746015253]
We develop a novel metric named partial Area Under the top-k Curve (AUTKC)
AUTKC has a better discrimination ability, and its Bayes optimal score function could give a correct top-K ranking with respect to the conditional probability.
We present an empirical surrogate risk minimization framework to optimize the proposed metric.
arXiv Detail & Related papers (2022-09-03T11:09:13Z) - Evaluating Metrics for Bias in Word Embeddings [44.14639209617701]
We formalize a bias definition based on the ideas from previous works and derive conditions for bias metrics.
We propose a new metric, SAME, to address the shortcomings of existing metrics and mathematically prove that SAME behaves appropriately.
arXiv Detail & Related papers (2021-11-15T16:07:15Z) - Evaluation Metrics for Conditional Image Generation [100.69766435176557]
We present two new metrics for evaluating generative models in the class-conditional image generation setting.
A theoretical analysis shows the motivation behind each proposed metric and links the novel metrics to their unconditional counterparts.
We provide an extensive empirical evaluation, comparing the metrics to their unconditional variants and to other metrics, and utilize them to analyze existing generative models.
arXiv Detail & Related papers (2020-04-26T12:15:16Z) - Asymmetric Distribution Measure for Few-shot Learning [82.91276814477126]
metric-based few-shot image classification aims to measure the relations between query images and support classes.
We propose a novel Asymmetric Distribution Measure (ADM) network for few-shot learning.
We achieve $3.02%$ and $1.56%$ gains over the state-of-the-art method on the $5$-way $1$-shot task.
arXiv Detail & Related papers (2020-02-01T06:41:52Z) - AMR Similarity Metrics from Principles [21.915057426589748]
We establish criteria that enable researchers to perform a principled assessment of metrics comparing meaning representations like AMR.
We propose a novel metric S$2$match that is more benevolent to only very slight meaning deviations and targets the fulfilment of all established criteria.
arXiv Detail & Related papers (2020-01-29T16:19:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.