Related papers: Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena

URL: http://arxiv.org/abs/2406.07545v1
Date: Tue, 11 Jun 2024 17:59:47 GMT
Title: Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena
Authors: Aidar Myrzakhan, Sondos Mahmoud Bsharat, Zhiqiang Shen,
Abstract summary: Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs) LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions.
Score: 23.264049073539663
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs). Typically, an LLM is given a question and selects the answer deemed most probable after adjustments for factors like length. Unfortunately, LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities, influencing the prediction of answers based on these IDs. Previous research has introduced methods to reduce this ''selection bias'' by simply permutating options on a few test samples and applying to new ones. Another problem of MCQ is the lottery ticket choice by ''random guessing''. The LLM does not learn particular knowledge, but the option is guessed correctly. This situation is especially serious for those small-scale LLMs. To address them, a more thorough approach involves shifting from MCQ to open-style questions, which can fundamentally eliminate selection bias and random guessing issues. However, transitioning causes its own set of challenges in (1) identifying suitable open-style questions and (2) validating the correctness of LLM open-style responses against human-annotated ground-truths. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions. Consequently, we introduce the Open-LLM-Leaderboard to track various LLMs' performance and reflect true capability of them, such as GPT-4o/4/3.5, Claude 3, Gemini, etc. Our code and dataset are available at https://github.com/VILA-Lab/Open-LLM-Leaderboard.

Related papers

Self-ensemble: Mitigating Confidence Distortion for Large Language Models [89.03110940871765]
Large Language Models exhibit a confidence distortion problem on multi-choice question-answering.<n>We propose Self-ensemble to solve this problem.<n> Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem.
arXiv Detail & Related papers (2025-06-02T17:59:29Z)
B-score: Detecting biases in large language models using response history [2.944057642865492]
Large language models (LLMs) often exhibit strong biases.<n>We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question.<n>We propose B-score, a novel metric that is effective in detecting biases to Subjective, Random, Easy, and Hard questions.
arXiv Detail & Related papers (2025-05-24T06:23:52Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
Option-ID Based Elimination For Multiple Choice Questions [12.30777266124562]
Multiple choice questions (MCQs) are a popular and important task for evaluating large language models (LLMs) Based on common strategies people use when answering MCQs, the process of elimination (PoE) has been proposed as an effective problem-solving method. This paper proposes a PoE based on option ID. Specifically, our method eliminates option by selecting the option ID with the lowest probability.
arXiv Detail & Related papers (2025-01-25T11:06:37Z)
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong [2.8367942280334493]
We study how the confidence in the answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer.
arXiv Detail & Related papers (2025-01-16T10:27:51Z)
Modeling Future Conversation Turns to Teach LLMs to Ask Clarifying Questions [45.04582353648683]
We propose to assign preference labels by simulating expected outcomes in the future turns. This allows LLMs to learn to ask clarifying questions when it can generate responses tailored to each user interpretation in future turns. We evaluate systems based on their ability to ask clarifying questions that can recover each user's interpretation and expected answer.
arXiv Detail & Related papers (2024-10-17T17:29:04Z)
Utilize the Flow before Stepping into the Same River Twice: Certainty Represented Knowledge Flow for Refusal-Aware Instruction Tuning [68.57166425493283]
Refusal-Aware Instruction Tuning (RAIT) enables Large Language Models (LLMs) to refuse to answer unknown questions. RAIT modifies training samples based on the correctness of the initial LLM's response. This crude approach can cause LLMs to excessively refuse answering questions they could have correctly answered.
arXiv Detail & Related papers (2024-10-09T14:12:51Z)
Are LLMs Aware that Some Questions are not Open-ended? [58.93124686141781]
We study whether Large Language Models are aware that some questions have limited answers and need to respond more deterministically. The lack of question awareness in LLMs leads to two phenomena: (1) too casual to answer non-open-ended questions or (2) too boring to answer open-ended questions.
arXiv Detail & Related papers (2024-10-01T06:07:00Z)
Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering [67.94354589215637]
Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations. In this paper, we perceive the LLMs' knowledge boundary (KB) with semi-open-ended questions (SoeQ) We find that GPT-4 performs poorly on SoeQ and is often unaware of its KB. Our auxiliary model, LLaMA-2-13B, is effective in discovering more ambiguous answers.
arXiv Detail & Related papers (2024-05-23T10:00:14Z)
UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions [25.877058354902953]
This work explores a novel data augmentation method based on Large Language Models (LLMs) for predicting item difficulty and response time of retired USMLE Multiple-Choice Questions (MCQs) in the BEA 2024 Shared Task. Our approach is based on augmenting the dataset with answers from zero-shot LLMs and employing transformer-based models based on six alternative feature combinations.
arXiv Detail & Related papers (2024-04-20T10:41:02Z)
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? [15.308093827770474]
We probe if large language models (LLMs) can perform multiple-choice question answering (MCQA) with choices-only prompts. This prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. We conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference.
arXiv Detail & Related papers (2024-02-19T19:38:58Z)
Small Models, Big Insights: Leveraging Slim Proxy Models To Decide When and What to Retrieve for LLMs [60.40396361115776]
This paper introduces a novel collaborative approach, namely SlimPLM, that detects missing knowledge in large language models (LLMs) with a slim proxy model. We employ a proxy model which has far fewer parameters, and take its answers as answers. Heuristic answers are then utilized to predict the knowledge required to answer the user question, as well as the known and unknown knowledge within the LLM.
arXiv Detail & Related papers (2024-02-19T11:11:08Z)
Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans. RaR serves as a simple yet effective prompting method for improving performance. We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z)
Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs) This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias" We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z)
Leveraging Large Language Models for Multiple Choice Question Answering [6.198523595657983]
We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach.
arXiv Detail & Related papers (2022-10-22T05:04:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.