Option-ID Based Elimination For Multiple Choice Questions
- URL: http://arxiv.org/abs/2501.15175v2
- Date: Sat, 15 Feb 2025 17:04:57 GMT
- Title: Option-ID Based Elimination For Multiple Choice Questions
- Authors: Zhenhao Zhu, Bulou Liu, Qingyao Ai, Yiqun Liu,
- Abstract summary: Multiple choice questions (MCQs) are a popular and important task for evaluating large language models (LLMs)<n>Based on common strategies people use when answering MCQs, the process of elimination (PoE) has been proposed as an effective problem-solving method.<n>This paper proposes a PoE based on option ID. Specifically, our method eliminates option by selecting the option ID with the lowest probability.
- Score: 12.30777266124562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multiple choice questions (MCQs) are a popular and important task for evaluating large language models (LLMs). Based on common strategies people use when answering MCQs, the process of elimination (PoE) has been proposed as an effective problem-solving method. Existing methods to the PoE generally fall into two categories: one involves having the LLM directly select the incorrect options, while the other involves scoring the options. However, both methods incur high computational costs and often perform worse than methods that directly answer the MCQs with the option IDs. To address this issue, this paper proposes a PoE based on option ID. Specifically, our method eliminates option by selecting the option ID with the lowest probability. We conduct experiments with 10 different LLMs in zero-shot settings on 7 publicly available datasets. The experimental results demonstrate that our method significantly improves the LLM's performance. Further analysis reveals that the sequential elimination strategy can effectively enhance the LLM's reasoning ability. Additionally, we find that sequential elimination is also applicable to few-shot settings and can be combined with debias methods to further improve LLM's performance.
Related papers
- Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA)
In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z) - LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering [1.0874597293913013]
Multiple Choice Question Answering (MCQA) is an important problem with numerous real-world applications, such as medicine, law, and education.
We propose a simple yet effective approach that uses Large Language Models for data generation and scoring.
Our method improves accuracy from 28.9% to 39.3%, representing a gain of over 10% compared to a baseline finetuned directly on 5-shot examples.
arXiv Detail & Related papers (2024-12-13T02:48:36Z) - MM-PoE: Multiple Choice Reasoning via. Process of Elimination using Multi-Modal Models [0.0]
This paper introduces Multiple Choice Reasoning via. Process of Elimination using Multi-Modal models.<n>The novel methodology is engineered to augment the efficacy of Vision-Language Models (VLMs) in multiple-choice visual reasoning tasks.<n>Our empirical evaluations, conducted across three benchmark datasets, reveal that MM-PoE significantly improves both zero-shot and few-shot performance.
arXiv Detail & Related papers (2024-12-10T03:13:41Z) - Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena [23.264049073539663]
Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs)
LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities.
This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions.
arXiv Detail & Related papers (2024-06-11T17:59:47Z) - Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors.
We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z) - Optimising Calls to Large Language Models with Uncertainty-Based Two-Tier Selection [80.63946798650653]
Decision centers on whether to use a large LLM with better performance or a smaller one with reduced costs.
We propose a simpler solution; we use only the uncertainty of the generations of the small LLM as the decision criterion.
Our experiments reveal this simple solution optimally balances cost and performance, outperforming existing methods on 25 out of 27 experimental setups.
arXiv Detail & Related papers (2024-05-03T14:38:59Z) - Transfer Learning Enhanced Single-choice Decision for Multi-choice Question Answering [27.601353412882258]
Multi-choice Machine Reading (MMRC) aims to select the correct answer from a set of options based on a given passage and question.
In this paper, we reconstruct multi-choice to single-choice by training a binary classification to distinguish whether a certain answer is correct.
Our proposed method gets rid of the multi-choice framework and can leverage resources of other tasks.
arXiv Detail & Related papers (2024-04-27T16:02:55Z) - Are More LLM Calls All You Need? Towards Scaling Laws of Compound Inference Systems [76.69936664916061]
We study how the number of LM calls affects the performance of Vote and Filter-Vote.
We find, surprisingly, that across multiple language tasks, the performance of both Vote and Filter-Vote can first increase but then decrease as a function of the number of LM calls.
arXiv Detail & Related papers (2024-03-04T19:12:48Z) - DsDm: Model-Aware Dataset Selection with Datamodels [81.01744199870043]
Standard practice is to filter for examples that match human notions of data quality.
We find that selecting according to similarity with "high quality" data sources may not increase (and can even hurt) performance compared to randomly selecting data.
Our framework avoids handpicked notions of data quality, and instead models explicitly how the learning process uses train datapoints to predict on the target tasks.
arXiv Detail & Related papers (2024-01-23T17:22:00Z) - DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs)
We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z) - Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans.
RaR serves as a simple yet effective prompting method for improving performance.
We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z) - POE: Process of Elimination for Multiple Choice Reasoning [19.65826015840337]
We argue a similar two-step strategy can make LMs better at multiple choice reasoning tasks.
In the first step, POE scores each option, and eliminates seemingly wrong options.
In the second step, POE masks these wrong options, and makes the final prediction from the remaining options.
arXiv Detail & Related papers (2023-10-24T07:38:43Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Large Language Models Sensitivity to The Order of Options in
Multiple-Choice Questions [5.187383020960245]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks.
Previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order.
This paper investigates sensitivity of LLMs towards the order of options in multiple-choice questions.
arXiv Detail & Related papers (2023-08-22T14:54:59Z) - RetICL: Sequential Retrieval of In-Context Examples with Reinforcement Learning [53.52699766206808]
We propose Retrieval for In-Context Learning (RetICL), a learnable method for modeling and optimally selecting examples sequentially for in-context learning.
We evaluate RetICL on math word problem solving and scientific question answering tasks and show that it consistently outperforms or matches and learnable baselines.
arXiv Detail & Related papers (2023-05-23T20:15:56Z) - A model-free feature selection technique of feature screening and random
forest based recursive feature elimination [0.0]
We propose a model-free feature selection method for ultra-high dimensional data with mass features.
We show that the proposed method is selection consistent and $L$ consistent under weak regularity conditions.
arXiv Detail & Related papers (2023-02-15T03:39:16Z) - Meta-Learning Approaches for a One-Shot Collective-Decision Aggregation:
Correctly Choosing how to Choose Correctly [0.7874708385247353]
We present two one-shot machine-learning-based aggregation approaches.
The first predicts, given multiple features about the collective's choices, which aggregation method will be best for a given case.
The second directly predicts which decision is optimal, given, among other things, the selection made by each method.
arXiv Detail & Related papers (2022-04-03T15:06:59Z) - A Mutual Information Maximization Approach for the Spurious Solution
Problem in Weakly Supervised Question Answering [60.768146126094955]
Weakly supervised question answering usually has only the final answers as supervision signals.
There may exist many spurious solutions that coincidentally derive the correct answer, but training on such solutions can hurt model performance.
We propose to explicitly exploit such semantic correlations by maximizing the mutual information between question-answer pairs and predicted solutions.
arXiv Detail & Related papers (2021-06-14T05:47:41Z) - Lookahead and Hybrid Sample Allocation Procedures for Multiple Attribute
Selection Decisions [0.9137554315375922]
This paper considers settings in which each measurement yields one sample of one attribute for one alternative.
When given a fixed number of samples to collect, the decision-maker must determine which samples to obtain, make the measurements, update prior beliefs about the attribute magnitudes, and then select an alternative.
arXiv Detail & Related papers (2020-07-31T15:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.