Related papers: Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA

Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA

URL: http://arxiv.org/abs/2410.02343v1
Date: Thu, 3 Oct 2024 09:53:48 GMT
Title: Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA
Authors: Eduard Tulchinskii, Laida Kushnareva, Kristian Kuznetsov, Anastasia Voznyuk, Andrei Andriiainen, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov,
Abstract summary: We introduce new scores that better capture and reveal model's underlying knowledge. Based on these scores, our method improves knowledge extraction, yielding up to 16% gain for LLaMA2-7B. The accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60%.
Score: 19.78468832417275
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A standard way to evaluate the abilities of LLM involves presenting a multiple-choice question and selecting the option with the highest logit as the model's predicted answer. However, such a format for evaluating LLMs has limitations, since even if the model knows the correct answer, it may struggle to select the corresponding letter simply due to difficulties in following this rigid format. To address this, we introduce new scores that better capture and reveal model's underlying knowledge: the Query-Key Score (QK-score), derived from the interaction between query and key representations in attention heads, and the Attention Score, based on attention weights. These scores are extracted from specific \textit{select-and-copy} heads, which show consistent performance across popular Multi-Choice Question Answering (MCQA) datasets. Based on these scores, our method improves knowledge extraction, yielding up to 16\% gain for LLaMA2-7B and up to 10\% for larger models on popular MCQA benchmarks. At the same time, the accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60\%, achieving nearly perfect accuracy, therefore demonstrating the method's efficiency in mitigating MCQA format limitations. To support our claims, we conduct experiments on models ranging from 7 billion to 70 billion parameters in both zero- and few-shot setups.

Related papers

Reasoning Models are Test Exploiters: Rethinking Multiple-Choice [10.085788712670487]
Large Language Models (LLMs) are asked to choose among a fixed set of choices in question-answering domains.<n>Multiple-choice question-answering (McQCA) is a good proxy for the downstream performance of models as long as they were allowed to perform chain-of-thought reasoning.<n>We conclude that MCQA is no longer a good proxy for assessing downstream performance of state-of-the-art models.
arXiv Detail & Related papers (2025-07-21T07:49:32Z)
SATA-BENCH: Select All That Apply Benchmark for Multiple Choice Questions [10.570975662243862]
Large language models (LLMs) are increasingly evaluated on single-answer multiple-choice tasks.<n>Many real-world problems require identifying all correct answers from a set of options.<n>We introduce SATA-BENCH, the first dedicated benchmark for evaluating LLMs on Select All That Apply questions.
arXiv Detail & Related papers (2025-05-31T17:14:21Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks [0.9831489366502301]
We introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts.<n>Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish.<n>Results show that all models experience remarkable accuracy drops, with an average loss of 57% on MMLU and 50% on UNED-Access 2024.
arXiv Detail & Related papers (2025-02-18T14:32:44Z)
Option-ID Based Elimination For Multiple Choice Questions [12.30777266124562]
Multiple choice questions (MCQs) are a popular and important task for evaluating large language models (LLMs) Based on common strategies people use when answering MCQs, the process of elimination (PoE) has been proposed as an effective problem-solving method. This paper proposes a PoE based on option ID. Specifically, our method eliminates option by selecting the option ID with the lowest probability.
arXiv Detail & Related papers (2025-01-25T11:06:37Z)
LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering [1.0874597293913013]
Multiple Choice Question Answering (MCQA) is an important problem with numerous real-world applications, such as medicine, law, and education. We propose a simple yet effective approach that uses Large Language Models for data generation and scoring. Our method improves accuracy from 28.9% to 39.3%, representing a gain of over 10% compared to a baseline finetuned directly on 5-shot examples.
arXiv Detail & Related papers (2024-12-13T02:48:36Z)
Differentiating Choices via Commonality for Multiple-Choice Question Answering [54.04315943420376]
Multiple-choice question answering can provide valuable clues for choosing the right answer. Existing models often rank each choice separately, overlooking the context provided by other choices. We propose a novel model by differentiating choices through identifying and eliminating their commonality, called DCQA.
arXiv Detail & Related papers (2024-08-21T12:05:21Z)
Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions [103.20281438405111]
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models. We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information. We show that prediction of a specific answer symbol is causally attributed to a single middle layer, and specifically its multi-head self-attention mechanism.
arXiv Detail & Related papers (2024-07-21T00:10:23Z)
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? [16.384333600053342]
Recent work shows that large language models (LLMs) can answer multiple-choice questions using only the choices. We use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA. After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices.
arXiv Detail & Related papers (2024-07-02T07:06:53Z)
UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions [25.877058354902953]
This work explores a novel data augmentation method based on Large Language Models (LLMs) for predicting item difficulty and response time of retired USMLE Multiple-Choice Questions (MCQs) in the BEA 2024 Shared Task. Our approach is based on augmenting the dataset with answers from zero-shot LLMs and employing transformer-based models based on six alternative feature combinations.
arXiv Detail & Related papers (2024-04-20T10:41:02Z)
DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs) We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers. In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z)
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model. We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z)
Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs) This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias" We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z)
Improving Selective Visual Question Answering by Learning from Your Peers [74.20167944693424]
Visual Question Answering (VQA) models can have difficulties abstaining from answering when they are wrong. We propose Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions. Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model.
arXiv Detail & Related papers (2023-06-14T21:22:01Z)
Leveraging Large Language Models for Multiple Choice Question Answering [6.198523595657983]
We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach.
arXiv Detail & Related papers (2022-10-22T05:04:54Z)
Few-Shot Question Answering by Pretraining Span Selection [58.31911597824848]
We explore the more realistic few-shot setting, where only a few hundred training examples are available. We show that standard span selection models perform poorly, highlighting the fact that current pretraining objective are far removed from question answering. Our findings indicate that careful design of pretraining schemes and model architecture can have a dramatic effect on performance in the few-shot settings.
arXiv Detail & Related papers (2021-01-02T11:58:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.