Evaluating Agents using Social Choice Theory
- URL: http://arxiv.org/abs/2312.03121v2
- Date: Thu, 7 Dec 2023 02:16:24 GMT
- Title: Evaluating Agents using Social Choice Theory
- Authors: Marc Lanctot, Kate Larson, Yoram Bachrach, Luke Marris, Zun Li,
Avishkar Bhoopchand, Thomas Anthony, Brian Tanner, Anna Koop
- Abstract summary: We argue that many general evaluation problems can be viewed through the lens of voting theory.
Each task is interpreted as a separate voter, which requires only ordinal rankings or pairwise comparisons of agents to produce an overall evaluation.
These evaluations are interpretable and flexible, while avoiding many of the problems currently facing cross-task evaluation.
- Score: 21.26784305333596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We argue that many general evaluation problems can be viewed through the lens
of voting theory. Each task is interpreted as a separate voter, which requires
only ordinal rankings or pairwise comparisons of agents to produce an overall
evaluation. By viewing the aggregator as a social welfare function, we are able
to leverage centuries of research in social choice theory to derive principled
evaluation frameworks with axiomatic foundations. These evaluations are
interpretable and flexible, while avoiding many of the problems currently
facing cross-task evaluation. We apply this Voting-as-Evaluation (VasE)
framework across multiple settings, including reinforcement learning, large
language models, and humans. In practice, we observe that VasE can be more
robust than popular evaluation frameworks (Elo and Nash averaging), discovers
properties in the evaluation data not evident from scores alone, and can
predict outcomes better than Elo in a complex seven-player game. We identify
one particular approach, maximal lotteries, that satisfies important
consistency properties relevant to evaluation, is computationally efficient
(polynomial in the size of the evaluation data), and identifies game-theoretic
cycles.
Related papers
- PeerRank: Autonomous LLM Evaluation Through Web-Grounded, Bias-Controlled Peer Review [1.2178992475191557]
We introduce PeerRank, a fully autonomous end-to-end evaluation framework.<n>Models generate evaluation tasks, answer them with category-scoped live web grounding.<n>PeerRank treats evaluation as a multi-agent process where each model participates symmetrically as task designer, respondent, and evaluator.
arXiv Detail & Related papers (2026-02-01T06:01:28Z) - LLMs Judge Themselves: A Game-Theoretic Framework for Human-Aligned Evaluation [41.42324204820521]
This work explores whether principles from game theory can be effectively applied to the evaluation of large language models (LLMs)<n>We propose a novel alternative: automatic mutual evaluation, where LLMs assess each other's output through self-play and peer review.<n>Our framework incorporates game-theoretic voting algorithms to aggregate peer reviews, enabling a principled investigation into whether model-generated rankings reflect human preferences.
arXiv Detail & Related papers (2025-10-17T15:34:25Z) - CPO: Addressing Reward Ambiguity in Role-playing Dialogue via Comparative Policy Optimization [53.79487826635141]
Reinforcement Learning Fine-Tuning (RLFT) has achieved notable success in tasks with objectively verifiable answers.<n>But it struggles with open-ended subjective tasks like role-playing dialogue.<n>Traditional reward modeling approaches, which rely on independent sample-wise scoring, face dual challenges: subjective evaluation criteria and unstable reward signals.<n>Motivated by the insight that human evaluation inherently combines explicit criteria with implicit comparative judgments, we propose Comparative Policy Optimization.
arXiv Detail & Related papers (2025-08-12T16:49:18Z) - From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback [36.68929551237421]
We introduce bftextFeedbacker, an evaluation framework that provides comprehensive and fine-grained results.<n>Our project homepage and dataset are available at https://liudan193.io/Feedbacker.
arXiv Detail & Related papers (2025-05-10T16:52:40Z) - Bayesian Active Learning for Multi-Criteria Comparative Judgement in Educational Assessment [3.0098452499209705]
Comparative Judgement (CJ) provides an alternative assessment approach by evaluating work holistically rather than breaking it into discrete criteria.
This method leverages human ability to make nuanced comparisons, yielding more reliable and valid assessments.
rubrics remain widely used in education, offering structured criteria for grading and detailed feedback.
This creates a gap between CJ's holistic ranking and the need for criterion-based performance breakdowns.
arXiv Detail & Related papers (2025-03-01T13:12:41Z) - Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework [61.38174427966444]
Large Language Models (LLMs) are being used more and more extensively for automated evaluation in various scenarios.
Previous studies have attempted to fine-tune open-source LLMs to replicate the evaluation explanations and judgments of powerful proprietary models.
We propose a novel evaluation framework, ARJudge, that adaptively formulates evaluation criteria and synthesizes both text-based and code-driven analyses.
arXiv Detail & Related papers (2025-02-26T06:31:45Z) - CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles [20.18736445118689]
We introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit lateral thinking of Large Language Models (LLMs)
This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation.
Experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy.
arXiv Detail & Related papers (2024-10-09T10:09:11Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Style Over Substance: Evaluation Biases for Large Language Models [17.13064447978519]
This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs)
Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors.
We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
arXiv Detail & Related papers (2023-07-06T14:42:01Z) - Do You Hear The People Sing? Key Point Analysis via Iterative Clustering
and Abstractive Summarisation [12.548947151123555]
Argument summarisation is a promising but currently under-explored field.
One of the main challenges in Key Point Analysis is finding high-quality key point candidates.
evaluating key points is crucial in ensuring that the automatically generated summaries are useful.
arXiv Detail & Related papers (2023-05-25T12:43:29Z) - Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models.
Reference-free evaluators are more suitable for open-ended examples with different semantics responses.
There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - Vote'n'Rank: Revision of Benchmarking with Social Choice Theory [7.224599819499157]
This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory.
We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields.
arXiv Detail & Related papers (2022-10-11T20:19:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.