Related papers: Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation?

Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation?

URL: http://arxiv.org/abs/2311.12337v1
Date: Tue, 21 Nov 2023 04:06:08 GMT
Title: Do Smaller Language Models Answer Contextualised Questions Through Memorisation Or Generalisation?
Authors: Tim Hartill, Joshua Bensemann, Michael Witbrock and Patricia J. Riddle
Abstract summary: A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples. We propose a method of identifying evaluation samples for which it is very unlikely our model would have memorised the answers.
Score: 8.51696622847778
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A distinction is often drawn between a model's ability to predict a label for an evaluation sample that is directly memorised from highly similar training samples versus an ability to predict the label via some method of generalisation. In the context of using Language Models for question-answering, discussion continues to occur as to the extent to which questions are answered through memorisation. We consider this issue for questions that would ideally be answered through reasoning over an associated context. We propose a method of identifying evaluation samples for which it is very unlikely our model would have memorised the answers. Our method is based on semantic similarity of input tokens and label tokens between training and evaluation samples. We show that our method offers advantages upon some prior approaches in that it is able to surface evaluation-train pairs that have overlap in either contiguous or discontiguous sequences of tokens. We use this method to identify unmemorisable subsets of our evaluation datasets. We train two Language Models in a multitask fashion whereby the second model differs from the first only in that it has two additional datasets added to the training regime that are designed to impart simple numerical reasoning strategies of a sort known to improve performance on some of our evaluation datasets but not on others. We then show that there is performance improvement between the two models on the unmemorisable subsets of the evaluation datasets that were expected to benefit from the additional training datasets. Specifically, performance on unmemorisable subsets of two of our evaluation datasets, DROP and ROPES significantly improves by 9.0%, and 25.7% respectively while other evaluation datasets have no significant change in performance.

Related papers

SureMap: Simultaneous Mean Estimation for Single-Task and Multi-Task Disaggregated Evaluation [75.56845750400116]
Disaggregated evaluation -- estimation of performance of a machine learning model on different subpopulations -- is a core task when assessing performance and group-fairness of AI systems. We develop SureMap that has high estimation accuracy for both multi-task and single-task disaggregated evaluations of blackbox models. Our method combines maximum a posteriori (MAP) estimation using a well-chosen prior together with cross-validation-free tuning via Stein's unbiased risk estimate (SURE)
arXiv Detail & Related papers (2024-11-14T17:53:35Z)
Likelihood as a Performance Gauge for Retrieval-Augmented Generation [78.28197013467157]
We show that likelihoods serve as an effective gauge for language model performance. We propose two methods that use question likelihood as a gauge for selecting and constructing prompts that lead to better performance.
arXiv Detail & Related papers (2024-11-12T13:14:09Z)
Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z)
FRACTAL: Fine-Grained Scoring from Aggregate Text Labels [17.052047103156372]
Large language models (LLMs) are increasingly tuned to power complex generation tasks such as writing, fact-seeking, querying and reasoning. Traditionally, human or model feedback for evaluating and tuning LLM performance has been provided at the response level. Recent works indicate that sentence-level labels may provide more accurate and interpretable feedback for LLM optimization.
arXiv Detail & Related papers (2024-04-07T05:54:28Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
DualView: Data Attribution from the Dual Perspective [16.083769847895336]
We present DualView, a novel method for post-hoc data attribution based on surrogate modelling. We find that DualView requires considerably lower computational resources than other methods, while demonstrating comparable performance across evaluation metrics.
arXiv Detail & Related papers (2024-02-19T13:13:16Z)
ACTOR: Active Learning with Annotator-specific Classification Heads to Embrace Human Label Variation [35.10805667891489]
Active learning, as an annotation cost-saving strategy, has not been fully explored in the context of learning from disagreement. We show that in the active learning setting, a multi-head model performs significantly better than a single-head model in terms of uncertainty estimation.
arXiv Detail & Related papers (2023-10-23T14:26:43Z)
Phoneme Segmentation Using Self-Supervised Speech Models [13.956691231452336]
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task. Our model extends transformer-style encoders with strategically placed convolutions that manipulate features learned in pre-training.
arXiv Detail & Related papers (2022-11-02T19:57:31Z)
Interpretable Multi-dataset Evaluation for Named Entity Recognition [110.64368106131062]
We present a general methodology for interpretable evaluation for the named entity recognition (NER) task. The proposed evaluation method enables us to interpret the differences in models and datasets, as well as the interplay between them. By making our analysis tool available, we make it easy for future researchers to run similar analyses and drive progress in this area.
arXiv Detail & Related papers (2020-11-13T10:53:27Z)
Learning to Model and Ignore Dataset Bias with Mixed Capacity Ensembles [66.15398165275926]
We propose a method that can automatically detect and ignore dataset-specific patterns, which we call dataset biases. Our method trains a lower capacity model in an ensemble with a higher capacity model. We show improvement in all settings, including a 10 point gain on the visual question answering dataset.
arXiv Detail & Related papers (2020-11-07T22:20:03Z)
Few-shot Visual Reasoning with Meta-analogical Contrastive Learning [141.2562447971]
We propose to solve a few-shot (or low-shot) visual reasoning problem, by resorting to analogical reasoning. We extract structural relationships between elements in both domains, and enforce them to be as similar as possible with analogical learning. We validate our method on RAVEN dataset, on which it outperforms state-of-the-art method, with larger gains when the training data is scarce.
arXiv Detail & Related papers (2020-07-23T14:00:34Z)
Pointwise Paraphrase Appraisal is Potentially Problematic [21.06607915149245]
We show that the standard way of fine-tuning BERT for paraphrase identification by pairing two sentences as one sequence results in a model with state-of-the-art performance. We also show that these models may even predict a pair of randomly-selected sentences with higher paraphrase score than a pair of identical ones.
arXiv Detail & Related papers (2020-05-25T09:27:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.