Related papers: Leveraging Large Language Models for Multiple Choice Question Answering

Leveraging Large Language Models for Multiple Choice Question Answering

URL: http://arxiv.org/abs/2210.12353v1
Date: Sat, 22 Oct 2022 05:04:54 GMT
Title: Leveraging Large Language Models for Multiple Choice Question Answering
Authors: Joshua Robinson, Christopher Michael Rytting, David Wingate
Abstract summary: We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach.
Score: 6.198523595657983
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large language models (LLMs) like GPT-3 have achieved impressive results on multiple choice question answering (MCQA) tasks in the zero, one, and few-shot settings, they generally lag behind the MCQA state of the art (SOTA). MCQA tasks have traditionally been presented to LLMs like cloze tasks. An LLM is conditioned on a question (without the associated answer options) and its chosen option is the one assigned the highest probability after normalization (for length, etc.). A more natural prompting approach is to present the question and answer options to the LLM jointly and have it output the symbol (e.g., "A") associated with its chosen answer option. This approach allows the model to explicitly compare answer options, reduces computational costs, and mitigates the effects of tokenization scheme and answer option representations on answer selection. For the natural approach to be effective the LLM it is used with must be able to associate answer options with the symbols that represent them. The LLM needs what we term multiple choice symbol binding (MCSB) ability. This ability varies greatly by model. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse datasets and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated.

Related papers

Self-ensemble: Mitigating Confidence Distortion for Large Language Models [89.03110940871765]
Large Language Models exhibit a confidence distortion problem on multi-choice question-answering.<n>We propose Self-ensemble to solve this problem.<n> Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem.
arXiv Detail & Related papers (2025-06-02T17:59:29Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
LLMs Can Generate a Better Answer by Aggregating Their Own Responses [83.69632759174405]
Large Language Models (LLMs) have shown remarkable capabilities across tasks, yet they often require additional prompting techniques when facing complex problems. We argue this limitation stems from the fact that common LLM post-training procedures lack explicit supervision for discriminative judgment tasks. We propose Generative Self-Aggregation (GSA), a novel prompting method that improves answer quality without requiring the model's discriminative capabilities.
arXiv Detail & Related papers (2025-03-06T05:25:43Z)
Differentiating Choices via Commonality for Multiple-Choice Question Answering [54.04315943420376]
Multiple-choice question answering can provide valuable clues for choosing the right answer. Existing models often rank each choice separately, overlooking the context provided by other choices. We propose a novel model by differentiating choices through identifying and eliminating their commonality, called DCQA.
arXiv Detail & Related papers (2024-08-21T12:05:21Z)
Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions [103.20281438405111]
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models. We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information. We show that prediction of a specific answer symbol is causally attributed to a single middle layer, and specifically its multi-head self-attention mechanism.
arXiv Detail & Related papers (2024-07-21T00:10:23Z)
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? [16.384333600053342]
Recent work shows that large language models (LLMs) can answer multiple-choice questions using only the choices. We use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA. After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices.
arXiv Detail & Related papers (2024-07-02T07:06:53Z)
Multi-LLM QA with Embodied Exploration [55.581423861790945]
We investigate the use of Multi-Embodied LLM Explorers (MELE) for question-answering in an unknown environment. Multiple LLM-based agents independently explore and then answer queries about a household environment. We analyze different aggregation methods to generate a single, final answer for each query.
arXiv Detail & Related papers (2024-06-16T12:46:40Z)
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena [23.264049073539663]
Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs) LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions.
arXiv Detail & Related papers (2024-06-11T17:59:47Z)
UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions [25.877058354902953]
This work explores a novel data augmentation method based on Large Language Models (LLMs) for predicting item difficulty and response time of retired USMLE Multiple-Choice Questions (MCQs) in the BEA 2024 Shared Task. Our approach is based on augmenting the dataset with answers from zero-shot LLMs and employing transformer-based models based on six alternative feature combinations.
arXiv Detail & Related papers (2024-04-20T10:41:02Z)
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? [15.308093827770474]
We probe if large language models (LLMs) can perform multiple-choice question answering (MCQA) with choices-only prompts. This prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. We conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference.
arXiv Detail & Related papers (2024-02-19T19:38:58Z)
Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models [29.202758753639078]
This study investigates the limitations of Multiple Choice Question Answering (MCQA) as an evaluation method for Large Language Models (LLMs) We propose a dataset augmenting method for Multiple-Choice Questions (MCQs), MCQA+, that can more accurately reflect the performance of the model.
arXiv Detail & Related papers (2024-02-02T12:07:00Z)
Enhancing Answer Selection in Community Question Answering with Pre-trained and Large Language Models [0.9065034043031668]
We first propose the Question-Answer cross attention networks (QAN) with pre-trained models for answer selection. We then utilize large language model (LLM) to perform answer selection with knowledge augmentation. Experiments show that the QAN model state-of-the-art performance on two datasets, SemEval2015 and SemEval 2017.
arXiv Detail & Related papers (2023-11-29T10:24:50Z)
Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs) This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias" We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z)
LaGR-SEQ: Language-Guided Reinforcement Learning with Sample-Efficient Querying [71.86163159193327]
Large language models (LLMs) have recently demonstrated their impressive ability to provide context-aware responses via text. This ability could potentially be used to predict plausible solutions in sequential decision making tasks pertaining to pattern completion. We introduce LaGR, which uses this predictive ability of LLMs to propose solutions to tasks that have been partially completed by a primary reinforcement learning (RL) agent.
arXiv Detail & Related papers (2023-08-21T02:07:35Z)
Question Answering as Programming for Solving Time-Sensitive Questions [84.07553016489769]
Question answering plays a pivotal role in human daily life because it involves our acquisition of knowledge about the world. Recently, Large Language Models (LLMs) have shown remarkable intelligence in question answering. This can be attributed to the LLMs' inability to perform rigorous reasoning based on surface-level text semantics. We propose a novel approach where we reframe the $textbfQ$uestion $textbfA$rogrogering task.
arXiv Detail & Related papers (2023-05-23T16:35:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.