Related papers: Metric assessment protocol in the context of answer fluctuation on MCQ tasks

Metric assessment protocol in the context of answer fluctuation on MCQ tasks

URL: http://arxiv.org/abs/2507.15581v1
Date: Mon, 21 Jul 2025 13:01:46 GMT
Title: Metric assessment protocol in the context of answer fluctuation on MCQ tasks
Authors: Ekaterina Goliakova, Xavier Renard, Marie-Jeanne Lesot, Thibault Laugel, Christophe Marsala, Marcin Detyniecki,
Abstract summary: Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently.<n>Previous research has not conducted a thorough assessment of them.<n>We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates.
Score: 4.453107218424601
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Using multiple-choice questions (MCQs) has become a standard for assessing LLM capabilities efficiently. A variety of metrics can be employed for this task. However, previous research has not conducted a thorough assessment of them. At the same time, MCQ evaluation suffers from answer fluctuation: models produce different results given slight changes in prompts. We suggest a metric assessment protocol in which evaluation methodologies are analyzed through their connection with fluctuation rates, as well as original performance. Our results show that there is a strong link between existing metrics and the answer changing, even when computed without any additional prompt variants. A novel metric, worst accuracy, demonstrates the highest association on the protocol.

Related papers

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
Multiple-Choice Question Answering (MCQA) is widely used to evaluate Large Language Models (LLMs)<n>We show that multiple factors can significantly impact the reported performance of LLMs.<n>We analyze whether existing answer extraction methods are aligned with human judgment.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z)
Evaluation of RAG Metrics for Question Answering in the Telecom Domain [0.650923326742559]
Retrieval Augmented Generation (RAG) is widely used to enable Large Language Models (LLMs) perform Question Answering (QA) tasks. This work is a modified version of this package for few metrics (faithfulness, context relevance, answer relevance, answer correctness, answer similarity and factual correctness) through which we provide the intermediate outputs of the prompts. Next, we analyse the expert evaluations of the output of the modified RAGAS package and observe the challenges of using it in the telecom domain.
arXiv Detail & Related papers (2024-07-15T17:40:15Z)
SQUARE: Automatic Question Answering Evaluation using Multiple Positive and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation) We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z)
DISTO: Evaluating Textual Distractors for Multi-Choice Questions using Negative Sampling based Approach [5.033269502052902]
Multiple choice questions (MCQs) are an efficient and common way to assess reading comprehension (RC) Distractor generation (DG) models have been proposed, and their performance is typically evaluated using machine translation (MT) metrics. We propose DISTO: the first learned evaluation metric for generated distractors.
arXiv Detail & Related papers (2023-04-10T22:03:00Z)
Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics [74.28810048824519]
Question answering-based summarization evaluation metrics must automatically determine whether the QA model's prediction is correct or not. We benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods.
arXiv Detail & Related papers (2022-04-21T15:43:45Z)
QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization [116.56171113972944]
We show that carefully choosing the components of a QA-based metric is critical to performance. Our solution improves upon the best-performing entailment-based metric and achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-16T00:38:35Z)
A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.