Can multiple-choice questions really be useful in detecting the abilities of LLMs?
- URL: http://arxiv.org/abs/2403.17752v3
- Date: Thu, 23 May 2024 13:32:25 GMT
- Title: Can multiple-choice questions really be useful in detecting the abilities of LLMs?
- Authors: Wangyue Li, Liangzhi Li, Tong Xiang, Xiao Liu, Wei Deng, Noa Garcia,
- Abstract summary: Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs)
The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy.
We evaluate nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English.
- Score: 15.756543037102256
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) due to their simplicity and efficiency. However, there are concerns about whether MCQs can truly measure LLM's capabilities, particularly in knowledge-intensive scenarios where long-form generation (LFG) answers are required. The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy, which we undertake in this paper by evaluating nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English. We identify a significant issue: LLMs exhibit an order sensitivity in bilingual MCQs, favoring answers located at specific positions, i.e., the first position. We further quantify the gap between MCQs and long-form generation questions (LFGQs) by comparing their direct outputs, token logits, and embeddings. Our results reveal a relatively low correlation between answers from MCQs and LFGQs for identical questions. Additionally, we propose two methods to quantify the consistency and confidence of LLMs' output, which can be generalized to other QA evaluation benchmarks. Notably, our analysis challenges the idea that the higher the consistency, the greater the accuracy. We also find MCQs to be less reliable than LFGQs in terms of expected calibration error. Finally, the misalignment between MCQs and LFGQs is not only reflected in the evaluation performance but also in the embedding space. Our code and models can be accessed at https://github.com/Meetyou-AI-Lab/Can-MC-Evaluate-LLMs.
Related papers
- Compound-QA: A Benchmark for Evaluating LLMs on Compound Questions [10.783827859678892]
We introduce Compound Question Synthesis (CQ-Syn) to create the Compound-QA benchmark.
This benchmark is derived from existing QA datasets, annotated with proprietary large language models.
It evaluates the LLM capability in terms of three dimensions including understanding, reasoning, and knowledge.
arXiv Detail & Related papers (2024-11-15T13:12:29Z) - CLR-Bench: Evaluating Large Language Models in College-level Reasoning [17.081788240112417]
Large language models (LLMs) have demonstrated their remarkable performance across various language understanding tasks.
We present CLR-Bench to comprehensively evaluate the LLMs in complex college-level reasoning.
arXiv Detail & Related papers (2024-10-23T04:55:08Z) - LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion.
Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks.
We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification (UQ) is a critical component of machine learning (ML) applications.
We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.
We conduct a large-scale empirical investigation of UQ and normalization techniques across nine tasks, and identify the most promising approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - Let LLMs Take on the Latest Challenges! A Chinese Dynamic Question
Answering Benchmark [69.3415799675046]
We introduce CDQA, a Chinese Dynamic QA benchmark containing question-answer pairs related to the latest news on the Chinese Internet.
We obtain high-quality data through a pipeline that combines humans and models.
We have also evaluated and analyzed mainstream and advanced Chinese LLMs on CDQA.
arXiv Detail & Related papers (2024-02-29T15:22:13Z) - Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? [15.308093827770474]
We probe if large language models (LLMs) can perform multiple-choice question answering (MCQA) with choices-only prompts.
This prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain.
We conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference.
arXiv Detail & Related papers (2024-02-19T19:38:58Z) - Cofca: A Step-Wise Counterfactual Multi-hop QA benchmark [39.64489055580211]
We introduce a Step-wise Counterfactual benchmark (CofCA), a novel evaluation benchmark consisting of factual data and counterfactual data.
Our experimental results reveal a significant performance gap between Wikipedia-based factual data and counterfactual data, deeming data contamination issues in existing benchmarks.
arXiv Detail & Related papers (2024-02-19T08:12:30Z) - Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers.
We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z) - Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models [29.202758753639078]
This study investigates the limitations of Multiple Choice Question Answering (MCQA) as an evaluation method for Large Language Models (LLMs)
We propose a dataset augmenting method for Multiple-Choice Questions (MCQs), MCQA+, that can more accurately reflect the performance of the model.
arXiv Detail & Related papers (2024-02-02T12:07:00Z) - InfiMM-Eval: Complex Open-Ended Reasoning Evaluation For Multi-Modal
Large Language Models [50.03163753638256]
Multi-modal Large Language Models (MLLMs) are increasingly prominent in the field of artificial intelligence.
Our benchmark comprises three key reasoning categories: deductive, abductive, and analogical reasoning.
We evaluate a selection of representative MLLMs using this rigorously developed open-ended multi-step elaborate reasoning benchmark.
arXiv Detail & Related papers (2023-11-20T07:06:31Z) - Attributed Question Answering: Evaluation and Modeling for Attributed
Large Language Models [68.37431984231338]
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision.
We believe the ability of an LLM to an attribute to the text that it generates is likely to be crucial for both system developers and users in this setting.
arXiv Detail & Related papers (2022-12-15T18:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.