Rethinking Generative Large Language Model Evaluation for Semantic
Comprehension
- URL: http://arxiv.org/abs/2403.07872v1
- Date: Tue, 12 Mar 2024 17:59:48 GMT
- Title: Rethinking Generative Large Language Model Evaluation for Semantic
Comprehension
- Authors: Fangyun Wei, Xi Chen, Lin Luo
- Abstract summary: This paper revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for straightforward accuracy measurement.
We introduce an RWQ-Elo rating system, engaging 24 large language models (LLMs) in a two-player competitive format, with GPT-4 serving as the judge.
This system is designed to mirror real-world usage, and for this purpose, we have compiled a new benchmark called Real-world questions'' (RWQ)
Our analysis reveals the stability of our RWQ-Elo system, the feasibility of registering new models, and its potential to
- Score: 27.21438605541497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite their sophisticated capabilities, large language models (LLMs)
encounter a major hurdle in effective assessment. This paper first revisits the
prevalent evaluation method-multiple choice question answering (MCQA), which
allows for straightforward accuracy measurement. Through a comprehensive
evaluation of 24 models across 11 benchmarks, we highlight several potential
drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation
and the generation of open-ended responses in practical scenarios. In response,
we introduce an RWQ-Elo rating system, engaging 24 LLMs such as GPT-4, GPT-3.5,
Google-Gemini-Pro and LLaMA-1/-2, in a two-player competitive format, with
GPT-4 serving as the judge. Each LLM receives an Elo rating thereafter. This
system is designed to mirror real-world usage, and for this purpose, we have
compiled a new benchmark called ``Real-world questions'' (RWQ), comprising
20,772 authentic user inquiries. Additionally, we thoroughly analyze the
characteristics of our system and compare it with prior leaderboards like
AlpacaEval and MT-Bench. Our analysis reveals the stability of our RWQ-Elo
system, the feasibility of registering new models, and its potential to reshape
LLM leaderboards.
Related papers
- Judging the Judges: A Collection of LLM-Generated Relevance Judgements [37.103230004631996]
This paper benchmarks and reports on the results of a large-scale automatic relevance judgment evaluation, the LLMJudge challenge at SIGIR 2024.
We release and benchmark 42 LLM-generated labels of the TREC 2023 Deep Learning track relevance judgments produced by eight international teams.
arXiv Detail & Related papers (2025-02-19T17:40:32Z) - OmniEval: An Omnidirectional and Automatic RAG Evaluation Benchmark in Financial Domain [62.89809156574998]
We introduce an omnidirectional and automatic RAG benchmark, OmniEval, in the financial domain.
Our benchmark is characterized by its multi-dimensional evaluation framework.
Our experiments demonstrate the comprehensiveness of OmniEval, which includes extensive test datasets.
arXiv Detail & Related papers (2024-12-17T15:38:42Z) - CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM.
CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility.
textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations.
First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics.
Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z) - Prometheus: Inducing Fine-grained Evaluation Capability in Language
Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities.
Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics.
Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z) - Split and Merge: Aligning Position Biases in LLM-based Evaluators [22.265542509143756]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias.
Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested.
It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z) - SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension [27.53415400454066]
We introduce a benchmark named SEED-Bench to assess generative models.
SEED-Bench consists of 19K multiple choice questions with accurate human annotations.
We evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding.
arXiv Detail & Related papers (2023-07-30T04:25:16Z) - Large Language Models are Not Yet Human-Level Evaluators for Abstractive
Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization.
We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z) - A Clarifying Question Selection System from NTES_ALONG in Convai3
Challenge [8.656503175492375]
This paper presents the participation of NetEase Game AI Lab team for the ClariQ challenge at Search-oriented Conversational AI (SCAI) EMNLP workshop in 2020.
The challenge asks for a complete conversational information retrieval system that can understanding and generating clarification questions.
We propose a clarifying question selection system which consists of response understanding, candidate question recalling and clarifying question ranking.
arXiv Detail & Related papers (2020-10-27T11:22:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.