Score Design for Multi-Criteria Incentivization
- URL: http://arxiv.org/abs/2410.06290v1
- Date: Tue, 8 Oct 2024 18:47:08 GMT
- Title: Score Design for Multi-Criteria Incentivization
- Authors: Anmol Kabra, Mina Karzand, Tosca Lechner, Nathan Srebro, Serena Wang,
- Abstract summary: We present a framework for designing scores to summarize performance metrics.
We formulate our design to minimize the dimensionality of scores while satisfying the objectives.
This framework draws motivation from real-world practices in hospital rating systems, where misaligned scores and performance metrics lead to unintended consequences.
- Score: 24.140631944678336
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a framework for designing scores to summarize performance metrics. Our design has two multi-criteria objectives: (1) improving on scores should improve all performance metrics, and (2) achieving pareto-optimal scores should achieve pareto-optimal metrics. We formulate our design to minimize the dimensionality of scores while satisfying the objectives. We give algorithms to design scores, which are provably minimal under mild assumptions on the structure of performance metrics. This framework draws motivation from real-world practices in hospital rating systems, where misaligned scores and performance metrics lead to unintended consequences.
Related papers
- The Pitfalls of Benchmarking in Algorithm Selection: What We Are Getting Wrong [1.973144426163543]
We highlight methodological issues that frequently occur in the community and should be addressed when evaluating algorithm selection approaches.<n>We show that non-informative features and meta-models can achieve high accuracy, which should not be the case with a well-designed evaluation framework.
arXiv Detail & Related papers (2025-05-12T16:57:45Z) - Resampling Benchmark for Efficient Comprehensive Evaluation of Large Vision-Language Models [18.309464845180237]
We propose an efficient evaluation protocol for large vision-language models (VLMs)
We construct a subset that yields results comparable to full benchmark evaluations.
Applying FPS to an existing benchmark improves correlation with overall evaluation results.
arXiv Detail & Related papers (2025-04-14T08:43:00Z) - Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge [78.28188747489769]
We propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge.
In a self-training loop, EvalPlanner iteratively optimize over synthetically constructed evaluation plans and executions.
Our method achieves a new state-of-the-art performance for generative reward models on RewardBench.
arXiv Detail & Related papers (2025-01-30T02:21:59Z) - Foundations of the Theory of Performance-Based Ranking [10.89980029564174]
We establish the foundations of a universal theory for performance-based ranking.
A universal parametric family of scores, called ranking scores, can be used to establish rankings satisfying our axioms.
We show, in the case of two-class classification, that the family of ranking scores encompasses well-known performance scores.
arXiv Detail & Related papers (2024-12-05T15:05:25Z) - Benchmarking End-To-End Performance of AI-Based Chip Placement Algorithms [77.71341200638416]
ChiPBench is a benchmark designed to evaluate the effectiveness of AI-based chip placement algorithms.
We have gathered 20 circuits from various domains (e.g., CPU, GPU, and microcontrollers) for evaluation.
Results show that even if intermediate metric of a single-point algorithm is dominant, the final PPA results are unsatisfactory.
arXiv Detail & Related papers (2024-07-03T03:29:23Z) - Improving the Validity and Practical Usefulness of AI/ML Evaluations Using an Estimands Framework [2.4861619769660637]
We propose an estimands framework adapted from international clinical trials guidelines.
This framework provides a systematic structure for inference and reporting in evaluations.
We demonstrate how the framework can help uncover underlying issues, their causes, and potential solutions.
arXiv Detail & Related papers (2024-06-14T18:47:37Z) - Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence [51.54175067684008]
This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks.
We first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes.
Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
arXiv Detail & Related papers (2024-03-17T07:02:55Z) - Lower-Left Partial AUC: An Effective and Efficient Optimization Metric
for Recommendation [52.45394284415614]
We propose a new optimization metric, Lower-Left Partial AUC (LLPAUC), which is computationally efficient like AUC but strongly correlates with Top-K ranking metrics.
LLPAUC considers only the partial area under the ROC curve in the Lower-Left corner to push the optimization focus on Top-K.
arXiv Detail & Related papers (2024-02-29T13:58:33Z) - Design and Architecture for a Centralized, Extensible, and Configurable
Scoring Application [0.0]
In modern-day organizations, many software applications require critical input to decide the next steps in the application workflow.
We will discuss in this article how to envision and design a generic, optimized scoring engine.
arXiv Detail & Related papers (2023-12-10T02:31:23Z) - Do Performance Aspirations Matter for Guiding Software Configuration
Tuning? [6.492599077364121]
We show that the realism of aspirations is the key factor that determines whether they should be used to guide the tuning.
The available tuning budget can also influence the choice for aspirations but it is under insignificant realistic ones.
arXiv Detail & Related papers (2023-01-09T12:11:05Z) - Design Target Achievement Index: A Differentiable Metric to Enhance Deep
Generative Models in Multi-Objective Inverse Design [4.091593765662773]
Design Target Achievement Index (DTAI) is a differentiable, tunable metric that scores a design's ability to achieve designer-specified minimum performance targets.
We apply DTAI to a Performance-Augmented Diverse GAN (PaDGAN) and demonstrate superior generative performance compared to a set of baseline Deep Generative Models.
arXiv Detail & Related papers (2022-05-06T04:14:34Z) - QAFactEval: Improved QA-Based Factual Consistency Evaluation for
Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance.
Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z) - Rethinking Counting and Localization in Crowds:A Purely Point-Based
Framework [59.578339075658995]
We propose a purely point-based framework for joint crowd counting and individual localization.
We design an intuitive solution under this framework, which is called Point to Point Network (P2PNet)
arXiv Detail & Related papers (2021-07-27T11:41:50Z) - MetricOpt: Learning to Optimize Black-Box Evaluation Metrics [21.608384691401238]
We study the problem of optimizing arbitrary non-differentiable task evaluation metrics such as misclassification rate and recall.
Our method, named MetricOpt, operates in a black-box setting where the computational details of the target metric are unknown.
We achieve this by learning a differentiable value function, which maps compact task-specific model parameters to metric observations.
arXiv Detail & Related papers (2021-04-21T16:50:01Z) - A Unified Framework of Surrogate Loss by Refactoring and Interpolation [65.60014616444623]
We introduce UniLoss, a unified framework to generate surrogate losses for training deep networks with gradient descent.
We validate the effectiveness of UniLoss on three tasks and four datasets.
arXiv Detail & Related papers (2020-07-27T21:16:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.