Do Smaller Language Models Answer Contextualised Questions Through
Memorisation Or Generalisation?
- URL: http://arxiv.org/abs/2311.12337v1
- Date: Tue, 21 Nov 2023 04:06:08 GMT
- Title: Do Smaller Language Models Answer Contextualised Questions Through
Memorisation Or Generalisation?
- Authors: Tim Hartill, Joshua Bensemann, Michael Witbrock and Patricia J. Riddle
- Abstract summary: A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples.
We propose a method of identifying evaluation samples for which it is very unlikely our model would have memorised the answers.
- Score: 8.51696622847778
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A distinction is often drawn between a model's ability to predict a label for
an evaluation sample that is directly memorised from highly similar training
samples versus an ability to predict the label via some method of
generalisation. In the context of using Language Models for question-answering,
discussion continues to occur as to the extent to which questions are answered
through memorisation. We consider this issue for questions that would ideally
be answered through reasoning over an associated context. We propose a method
of identifying evaluation samples for which it is very unlikely our model would
have memorised the answers. Our method is based on semantic similarity of input
tokens and label tokens between training and evaluation samples. We show that
our method offers advantages upon some prior approaches in that it is able to
surface evaluation-train pairs that have overlap in either contiguous or
discontiguous sequences of tokens. We use this method to identify unmemorisable
subsets of our evaluation datasets. We train two Language Models in a multitask
fashion whereby the second model differs from the first only in that it has two
additional datasets added to the training regime that are designed to impart
simple numerical reasoning strategies of a sort known to improve performance on
some of our evaluation datasets but not on others. We then show that there is
performance improvement between the two models on the unmemorisable subsets of
the evaluation datasets that were expected to benefit from the additional
training datasets. Specifically, performance on unmemorisable subsets of two of
our evaluation datasets, DROP and ROPES significantly improves by 9.0%, and
25.7% respectively while other evaluation datasets have no significant change
in performance.
Related papers
- SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation [75.56845750400116]
Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems.
We develop SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models.
Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE)
arXiv Detail & Related papers (2024-11-14T17:53:35Z) - Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance.
We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z) - Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - FRACTAL: Fine-Grained Scoring from Aggregate Text Labels [17.052047103156372]
Large language models (LLMs) are increasingly tuned to power complex generation tasks such as writing, fact-seeking, querying and reasoning.
Traditionally, human or model feedback for evaluating and tuning LLM performance has been provided at the response level.
Recent works indicate that sentence-level labels may provide more accurate and interpretable feedback for LLM optimization.
arXiv Detail & Related papers (2024-04-07T05:54:28Z) - Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score.
Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score.
Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z) - DualView: Data Attribution from the Dual Perspective [16.083769847895336]
We present DualView, a novel method for post-hoc data attribution based on surrogate modelling.
We find that DualView requires considerably lower computational resources than other methods, while demonstrating comparable performance across evaluation metrics.
arXiv Detail & Related papers (2024-02-19T13:13:16Z) - ACTOR: Active Learning with Annotator-specific Classification Heads to
Embrace Human Label Variation [35.10805667891489]
Active learning, as an annotation cost-saving strategy, has not been fully explored in the context of learning from disagreement.
We show that in the active learning setting, a multi-head model performs significantly better than a single-head model in terms of uncertainty estimation.
arXiv Detail & Related papers (2023-10-23T14:26:43Z) - Phoneme Segmentation Using Self-Supervised Speech Models [13.956691231452336]
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task.
Our model extends transformer-style encoders with strategically placed convolutions that manipulate features learned in pre-training.
arXiv Detail & Related papers (2022-11-02T19:57:31Z) - Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles [66.15398165275926]
We propose a method that can automatically detect and ignore dataset-specific patterns, which we call dataset biases.
Our method trains a lower capacity model in an ensemble with a higher capacity model.
We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
arXiv Detail & Related papers (2020-11-07T22:20:03Z) - Few-shot Visual Reasoning with Meta-analogical Contrastive Learning [141.2562447971]
We propose to solve a few-shot (or low-shot) visual reasoning problem, by resorting to analogical reasoning.
We extract structural relationships between elements in both domains, and enforce them to be as similar as possible with analogical learning.
We validate our method on RAVEN dataset, on which it outperforms state-of-the-art method, with larger gains when the training data is scarce.
arXiv Detail & Related papers (2020-07-23T14:00:34Z) - Pointwise Paraphrase Appraisal is Potentially Problematic [21.06607915149245]
We show that the standard way of fine-tuning BERT for paraphrase identification by pairing two sentences as one sequence results in a model with state-of-the-art performance.
We also show that these models may even predict a pair of randomly-selected sentences with higher paraphrase score than a pair of identical ones.
arXiv Detail & Related papers (2020-05-25T09:27:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.