Predictions from language models for multiple-choice tasks are not
robust under variation of scoring methods
- URL: http://arxiv.org/abs/2403.00998v1
- Date: Fri, 1 Mar 2024 21:48:08 GMT
- Title: Predictions from language models for multiple-choice tasks are not
robust under variation of scoring methods
- Authors: Polina Tsvilodub, Hening Wang, Sharon Grosch and Michael Franke
- Abstract summary: This paper systematically compares different methods of deriving item-level predictions of language models for multiple-choice tasks.
It compares scoring methods for answer options based on free generation of responses, various probability-based scores, a Likert-scale style rating method, and embedding similarity.
- Score: 5.5711773076846365
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper systematically compares different methods of deriving item-level
predictions of language models for multiple-choice tasks. It compares scoring
methods for answer options based on free generation of responses, various
probability-based scores, a Likert-scale style rating method, and embedding
similarity. In a case study on pragmatic language interpretation, we find that
LLM predictions are not robust under variation of method choice, both within a
single LLM and across different LLMs. As this variability entails pronounced
researcher degrees of freedom in reporting results, knowledge of the
variability is crucial to secure robustness of results and research integrity.
Related papers
- A statistically consistent measure of Semantic Variability using Language Models [3.4933610074113464]
We present a measure of semantic variability that is statistically consistent under mild assumptions.
This measure, denoted as semantic spectral entropy, is a easy to implement algorithm that requires just off the shelf language models.
arXiv Detail & Related papers (2025-02-01T17:55:58Z) - Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation [60.493180081319785]
We propose a systematic way to estimate the capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step.
Our work offers a comprehensive comparison of existing truncation sampling methods and serves as a practical user guideline for their parameter selection.
arXiv Detail & Related papers (2024-08-24T14:14:32Z) - In-Context Example Selection via Similarity Search Improves Low-Resource Machine Translation [20.704153242284114]
We focus on machine translation (MT), a task that has been shown to benefit from in-context translation examples.
No systematic studies have been published on how best to select examples, and mixed results have been reported on the usefulness of similarity-based selection.
We find that sentence embedding similarity can improve MT, especially for low-resource language directions.
arXiv Detail & Related papers (2024-08-01T09:07:32Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - Revisiting Demonstration Selection Strategies in In-Context Learning [66.11652803887284]
Large language models (LLMs) have shown an impressive ability to perform a wide range of tasks using in-context learning (ICL)
In this work, we first revisit the factors contributing to this variance from both data and model aspects, and find that the choice of demonstration is both data- and model-dependent.
We propose a data- and model-dependent demonstration selection method, textbfTopK + ConE, based on the assumption that textitthe performance of a demonstration positively correlates with its contribution to the model's understanding of the test samples.
arXiv Detail & Related papers (2024-01-22T16:25:27Z) - Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks.
We instruct an LLM to self-evaluate its answers.
We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z) - Multilingual Few-Shot Learning via Language Model Retrieval [18.465566186549072]
Transformer-based language models have achieved remarkable success in few-shot in-context learning.
We conduct a study of retrieving semantically similar few-shot samples and using them as the context.
We evaluate the proposed method on five natural language understanding datasets related to intent detection, question classification, sentiment analysis, and topic classification.
arXiv Detail & Related papers (2023-06-19T14:27:21Z) - Active Learning Principles for In-Context Learning with Large Language
Models [65.09970281795769]
This paper investigates how Active Learning algorithms can serve as effective demonstration selection methods for in-context learning.
We show that in-context example selection through AL prioritizes high-quality examples that exhibit low uncertainty and bear similarity to the test examples.
arXiv Detail & Related papers (2023-05-23T17:16:04Z) - Greedy Search Algorithms for Unsupervised Variable Selection: A
Comparative Study [3.4888132404740797]
This paper focuses on unsupervised variable selection based dimensionality reduction.
We present a critical evaluation of seven unsupervised greedy variable selection algorithms.
We introduce and evaluate for the first time, a lazy implementation of the variance explained based forward selection component analysis (FSCA) algorithm.
arXiv Detail & Related papers (2021-03-03T21:10:26Z) - Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task.
The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them.
By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.