Related papers: Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

Rethinking Generative Large Language Model Evaluation for Semantic Comprehension

URL: http://arxiv.org/abs/2403.07872v1
Date: Tue, 12 Mar 2024 17:59:48 GMT
Title: Rethinking Generative Large Language Model Evaluation for Semantic Comprehension
Authors: Fangyun Wei, Xi Chen, Lin Luo
Abstract summary: This paper revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for straightforward accuracy measurement. We introduce an RWQ-Elo rating system, engaging 24 large language models (LLMs) in a two-player competitive format, with GPT-4 serving as the judge. This system is designed to mirror real-world usage, and for this purpose, we have compiled a new benchmark called Real-world questions'' (RWQ) Our analysis reveals the stability of our RWQ-Elo system, the feasibility of registering new models, and its potential to
Score: 27.21438605541497
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite their sophisticated capabilities, large language models (LLMs) encounter a major hurdle in effective assessment. This paper first revisits the prevalent evaluation method-multiple choice question answering (MCQA), which allows for straightforward accuracy measurement. Through a comprehensive evaluation of 24 models across 11 benchmarks, we highlight several potential drawbacks of MCQA, for instance, the inconsistency between the MCQA evaluation and the generation of open-ended responses in practical scenarios. In response, we introduce an RWQ-Elo rating system, engaging 24 LLMs such as GPT-4, GPT-3.5, Google-Gemini-Pro and LLaMA-1/-2, in a two-player competitive format, with GPT-4 serving as the judge. Each LLM receives an Elo rating thereafter. This system is designed to mirror real-world usage, and for this purpose, we have compiled a new benchmark called ``Real-world questions'' (RWQ), comprising 20,772 authentic user inquiries. Additionally, we thoroughly analyze the characteristics of our system and compare it with prior leaderboards like AlpacaEval and MT-Bench. Our analysis reveals the stability of our RWQ-Elo system, the feasibility of registering new models, and its potential to reshape LLM leaderboards.

Related papers

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
CompassJudger-1: All-in-one Judge Model Helps Model Evaluation and Evolution [74.41064280094064]
textbfJudger-1 is the first open-source textbfall-in-one judge LLM. CompassJudger-1 is a general-purpose LLM that demonstrates remarkable versatility. textbfJudgerBench is a new benchmark that encompasses various subjective evaluation tasks.
arXiv Detail & Related papers (2024-10-21T17:56:51Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors [64.9938658716425]
Existing evaluations of large language models' (LLMs) ability to recognize and reject unsafe user requests face three limitations. First, existing methods often use coarse-grained of unsafe topics, and are over-representing some fine-grained topics. Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations. Third, existing evaluations rely on large LLMs for evaluation, which can be expensive.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
Can multiple-choice questions really be useful in detecting the abilities of LLMs? [15.756543037102256]
Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy. We evaluate nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English.
arXiv Detail & Related papers (2024-03-26T14:43:48Z)
SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs. SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions. We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z)
Prometheus: Inducing Fine-grained Evaluation Capability in Language Models [66.12432440863816]
We propose Prometheus, a fully open-source Large Language Model (LLM) that is on par with GPT-4's evaluation capabilities. Prometheus scores a Pearson correlation of 0.897 with human evaluators when evaluating with 45 customized score rubrics. Prometheus achieves the highest accuracy on two human preference benchmarks.
arXiv Detail & Related papers (2023-10-12T16:50:08Z)
Split and Merge: Aligning Position Biases in Large Language Model based Evaluators [23.38206418382832]
PORTIA is an alignment-based system designed to mimic human comparison strategies to calibrate position bias. Our results show that PORTIA markedly enhances the consistency rates for all the models and comparison forms tested. It rectifies around 80% of the position bias instances within the GPT-4 model, elevating its consistency rate up to 98%.
arXiv Detail & Related papers (2023-09-29T14:38:58Z)
SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension [27.53415400454066]
We introduce a benchmark named SEED-Bench to assess generative models. SEED-Bench consists of 19K multiple choice questions with accurate human annotations. We evaluate the performance of 18 models across all 12 dimensions, covering both the spatial and temporal understanding.
arXiv Detail & Related papers (2023-07-30T04:25:16Z)
Large Language Models are Not Yet Human-Level Evaluators for Abstractive Summarization [66.08074487429477]
We investigate the stability and reliability of large language models (LLMs) as automatic evaluators for abstractive summarization. We find that while ChatGPT and GPT-4 outperform the commonly used automatic metrics, they are not ready as human replacements.
arXiv Detail & Related papers (2023-05-22T14:58:13Z)
A Clarifying Question Selection System from NTES_ALONG in Convai3 Challenge [8.656503175492375]
This paper presents the participation of NetEase Game AI Lab team for the ClariQ challenge at Search-oriented Conversational AI (SCAI) EMNLP workshop in 2020. The challenge asks for a complete conversational information retrieval system that can understanding and generating clarification questions. We propose a clarifying question selection system which consists of response understanding, candidate question recalling and clarifying question ranking.
arXiv Detail & Related papers (2020-10-27T11:22:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.