Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA
- URL: http://arxiv.org/abs/2410.02343v1
- Date: Thu, 3 Oct 2024 09:53:48 GMT
- Title: Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA
- Authors: Eduard Tulchinskii, Laida Kushnareva, Kristian Kuznetsov, Anastasia Voznyuk, Andrei Andriiainen, Irina Piontkovskaya, Evgeny Burnaev, Serguei Barannikov,
- Abstract summary: We introduce new scores that better capture and reveal model's underlying knowledge.
Based on these scores, our method improves knowledge extraction, yielding up to 16% gain for LLaMA2-7B.
The accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60%.
- Score: 19.78468832417275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A standard way to evaluate the abilities of LLM involves presenting a multiple-choice question and selecting the option with the highest logit as the model's predicted answer. However, such a format for evaluating LLMs has limitations, since even if the model knows the correct answer, it may struggle to select the corresponding letter simply due to difficulties in following this rigid format. To address this, we introduce new scores that better capture and reveal model's underlying knowledge: the Query-Key Score (QK-score), derived from the interaction between query and key representations in attention heads, and the Attention Score, based on attention weights. These scores are extracted from specific \textit{select-and-copy} heads, which show consistent performance across popular Multi-Choice Question Answering (MCQA) datasets. Based on these scores, our method improves knowledge extraction, yielding up to 16\% gain for LLaMA2-7B and up to 10\% for larger models on popular MCQA benchmarks. At the same time, the accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60\%, achieving nearly perfect accuracy, therefore demonstrating the method's efficiency in mitigating MCQA format limitations. To support our claims, we conduct experiments on models ranging from 7 billion to 70 billion parameters in both zero- and few-shot setups.
Related papers
- None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks [0.9831489366502301]
We introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts.
Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish.
Results show that all models experience remarkable accuracy drops, with an average loss of 57% on MMLU and 50% on UNED-Access 2024.
arXiv Detail & Related papers (2025-02-18T14:32:44Z) - Option-ID Based Elimination For Multiple Choice Questions [12.30777266124562]
Multiple choice questions (MCQs) are a popular and important task for evaluating large language models (LLMs)
Based on common strategies people use when answering MCQs, the process of elimination (PoE) has been proposed as an effective problem-solving method.
This paper proposes a PoE based on option ID. Specifically, our method eliminates option by selecting the option ID with the lowest probability.
arXiv Detail & Related papers (2025-01-25T11:06:37Z) - LLM Distillation for Efficient Few-Shot Multiple Choice Question Answering [1.0874597293913013]
Multiple Choice Question Answering (MCQA) is an important problem with numerous real-world applications, such as medicine, law, and education.
We propose a simple yet effective approach that uses Large Language Models for data generation and scoring.
Our method improves accuracy from 28.9% to 39.3%, representing a gain of over 10% compared to a baseline finetuned directly on 5-shot examples.
arXiv Detail & Related papers (2024-12-13T02:48:36Z) - Differentiating Choices via Commonality for Multiple-Choice Question Answering [54.04315943420376]
Multiple-choice question answering can provide valuable clues for choosing the right answer.
Existing models often rank each choice separately, overlooking the context provided by other choices.
We propose a novel model by differentiating choices through identifying and eliminating their commonality, called DCQA.
arXiv Detail & Related papers (2024-08-21T12:05:21Z) - Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions [103.20281438405111]
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models.
We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information.
We show that prediction of a specific answer symbol is causally attributed to a single middle layer, and specifically its multi-head self-attention mechanism.
arXiv Detail & Related papers (2024-07-21T00:10:23Z) - Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? [16.384333600053342]
Recent work shows that large language models (LLMs) can answer multiple-choice questions using only the choices.
We use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA.
After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices.
arXiv Detail & Related papers (2024-07-02T07:06:53Z) - UnibucLLM: Harnessing LLMs for Automated Prediction of Item Difficulty and Response Time for Multiple-Choice Questions [25.877058354902953]
This work explores a novel data augmentation method based on Large Language Models (LLMs) for predicting item difficulty and response time of retired USMLE Multiple-Choice Questions (MCQs) in the BEA 2024 Shared Task.
Our approach is based on augmenting the dataset with answers from zero-shot LLMs and employing transformer-based models based on six alternative feature combinations.
arXiv Detail & Related papers (2024-04-20T10:41:02Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Improving Selective Visual Question Answering by Learning from Your
Peers [74.20167944693424]
Visual Question Answering (VQA) models can have difficulties abstaining from answering when they are wrong.
We propose Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions.
Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model.
arXiv Detail & Related papers (2023-06-14T21:22:01Z) - Few-Shot Question Answering by Pretraining Span Selection [58.31911597824848]
We explore the more realistic few-shot setting, where only a few hundred training examples are available.
We show that standard span selection models perform poorly, highlighting the fact that current pretraining objective are far removed from question answering.
Our findings indicate that careful design of pretraining schemes and model architecture can have a dramatic effect on performance in the few-shot settings.
arXiv Detail & Related papers (2021-01-02T11:58:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.