Evaluating Agents using Social Choice Theory
- URL: http://arxiv.org/abs/2312.03121v2
- Date: Thu, 7 Dec 2023 02:16:24 GMT
- Title: Evaluating Agents using Social Choice Theory
- Authors: Marc Lanctot, Kate Larson, Yoram Bachrach, Luke Marris, Zun Li,
Avishkar Bhoopchand, Thomas Anthony, Brian Tanner, Anna Koop
- Abstract summary: We argue that many general evaluation problems can be viewed through the lens of voting theory.
Each task is interpreted as a separate voter, which requires only ordinal rankings or pairwise comparisons of agents to produce an overall evaluation.
These evaluations are interpretable and flexible, while avoiding many of the problems currently facing cross-task evaluation.
- Score: 21.26784305333596
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We argue that many general evaluation problems can be viewed through the lens
of voting theory. Each task is interpreted as a separate voter, which requires
only ordinal rankings or pairwise comparisons of agents to produce an overall
evaluation. By viewing the aggregator as a social welfare function, we are able
to leverage centuries of research in social choice theory to derive principled
evaluation frameworks with axiomatic foundations. These evaluations are
interpretable and flexible, while avoiding many of the problems currently
facing cross-task evaluation. We apply this Voting-as-Evaluation (VasE)
framework across multiple settings, including reinforcement learning, large
language models, and humans. In practice, we observe that VasE can be more
robust than popular evaluation frameworks (Elo and Nash averaging), discovers
properties in the evaluation data not evident from scores alone, and can
predict outcomes better than Elo in a complex seven-player game. We identify
one particular approach, maximal lotteries, that satisfies important
consistency properties relevant to evaluation, is computationally efficient
(polynomial in the size of the evaluation data), and identifies game-theoretic
cycles.
Related papers
- CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - Weak-eval-Strong: Evaluating and Eliciting Lateral Thinking of LLMs with Situation Puzzles [20.18736445118689]
We introduce SPLAT, a benchmark leveraging Situation Puzzles to evaluate and elicit lateral thinking of Large Language Models (LLMs)
This benchmark, containing 975 graded situation puzzles across three difficulty levels, employs a new multi-turn player-judge framework instead of the traditional model-based evaluation.
Experiments demonstrate that a robust evaluation model, such as WizardLM-2, closely matches human judgements in both intermediate question-answering and final scenario accuracy.
arXiv Detail & Related papers (2024-10-09T10:09:11Z) - The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.
A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z) - F-Eval: Assessing Fundamental Abilities with Refined Evaluation Methods [102.98899881389211]
We propose F-Eval, a bilingual evaluation benchmark to evaluate the fundamental abilities, including expression, commonsense and logic.
For reference-free subjective tasks, we devise new evaluation methods, serving as alternatives to scoring by API models.
arXiv Detail & Related papers (2024-01-26T13:55:32Z) - FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets [69.91340332545094]
We introduce FLASK, a fine-grained evaluation protocol for both human-based and model-based evaluation.
We experimentally observe that the fine-graininess of evaluation is crucial for attaining a holistic view of model performance.
arXiv Detail & Related papers (2023-07-20T14:56:35Z) - Style Over Substance: Evaluation Biases for Large Language Models [17.13064447978519]
This study investigates the behavior of crowd-sourced and expert annotators, as well as large language models (LLMs)
Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors.
We propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score.
arXiv Detail & Related papers (2023-07-06T14:42:01Z) - Do You Hear The People Sing? Key Point Analysis via Iterative Clustering
and Abstractive Summarisation [12.548947151123555]
Argument summarisation is a promising but currently under-explored field.
One of the main challenges in Key Point Analysis is finding high-quality key point candidates.
evaluating key points is crucial in ensuring that the automatically generated summaries are useful.
arXiv Detail & Related papers (2023-05-25T12:43:29Z) - Evaluate What You Can't Evaluate: Unassessable Quality for Generated Response [56.25966921370483]
There are challenges in using reference-free evaluators based on large language models.
Reference-free evaluators are more suitable for open-ended examples with different semantics responses.
There are risks in using eference-free evaluators based on LLMs to evaluate the quality of dialogue responses.
arXiv Detail & Related papers (2023-05-24T02:52:48Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - Vote'n'Rank: Revision of Benchmarking with Social Choice Theory [7.224599819499157]
This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory.
We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields.
arXiv Detail & Related papers (2022-10-11T20:19:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.