Reinforcement Learning Fine-tuning of Language Models is Biased Towards
More Extractable Features
- URL: http://arxiv.org/abs/2311.04046v1
- Date: Tue, 7 Nov 2023 15:00:39 GMT
- Title: Reinforcement Learning Fine-tuning of Language Models is Biased Towards
More Extractable Features
- Authors: Diogo Cruz, Edoardo Pona, Alex Holness-Tofts, Elias Schmied, V\'ictor
Abia Alonso, Charlie Griffin, Bogdan-Ionut Cirstea
- Abstract summary: We investigate whether principles governing inductive biases in the supervised fine-tuning of large language models also apply when the fine-tuning process uses reinforcement learning.
We find statistically significant correlations which constitute strong evidence for these hypotheses.
- Score: 0.5937476291232802
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many capable large language models (LLMs) are developed via self-supervised
pre-training followed by a reinforcement-learning fine-tuning phase, often
based on human or AI feedback. During this stage, models may be guided by their
inductive biases to rely on simpler features which may be easier to extract, at
a cost to robustness and generalisation. We investigate whether principles
governing inductive biases in the supervised fine-tuning of LLMs also apply
when the fine-tuning process uses reinforcement learning. Following Lovering et
al (2021), we test two hypotheses: that features more $\textit{extractable}$
after pre-training are more likely to be utilised by the final policy, and that
the evidence for/against a feature predicts whether it will be utilised.
Through controlled experiments on synthetic and natural language tasks, we find
statistically significant correlations which constitute strong evidence for
these hypotheses.
Related papers
- AutoElicit: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling [53.54623137152208]
We introduce AutoElicit to extract knowledge from large language models and construct priors for predictive models.
We show these priors are informative and can be refined using natural language.
We find that AutoElicit yields priors that can substantially reduce error over uninformative priors, using fewer labels, and consistently outperform in-context learning.
arXiv Detail & Related papers (2024-11-26T10:13:39Z) - Aggregation Artifacts in Subjective Tasks Collapse Large Language Models' Posteriors [74.04775677110179]
In-context Learning (ICL) has become the primary method for performing natural language tasks with Large Language Models (LLMs)
In this work, we examine whether this is the result of the aggregation used in corresponding datasets, where trying to combine low-agreement, disparate annotations might lead to annotation artifacts that create detrimental noise in the prompt.
Our results indicate that aggregation is a confounding factor in the modeling of subjective tasks, and advocate focusing on modeling individuals instead.
arXiv Detail & Related papers (2024-10-17T17:16:00Z) - On the Inductive Bias of Stacking Towards Improving Reasoning [50.225873619537765]
We propose a variant of gradual stacking called MIDAS that can speed up language model training by up to 40%.
MIDAS is not only training-efficient but surprisingly also has an inductive bias towards improving downstream tasks.
We conjecture the underlying reason for this inductive bias by exploring the connection of stacking to looped models.
arXiv Detail & Related papers (2024-09-27T17:58:21Z) - An Emulator for Fine-Tuning Large Language Models using Small Language
Models [91.02498576056057]
We introduce emulated fine-tuning (EFT), a principled and practical method for sampling from a distribution that approximates the result of pre-training and fine-tuning at different scales.
We show that EFT enables test-time adjustment of competing behavioral traits like helpfulness and harmlessness without additional training.
Finally, a special case of emulated fine-tuning, which we call LM up-scaling, avoids resource-intensive fine-tuning of large pre-trained models by ensembling them with small fine-tuned models.
arXiv Detail & Related papers (2023-10-19T17:57:16Z) - Measuring Inductive Biases of In-Context Learning with Underspecified
Demonstrations [35.16904555065152]
In-context learning (ICL) is an important paradigm for adapting large language models to new tasks.
We investigate the inductive biases of ICL from the perspective of feature bias.
arXiv Detail & Related papers (2023-05-22T17:56:31Z) - Explaining Language Models' Predictions with High-Impact Concepts [11.47612457613113]
We propose a complete framework for extending concept-based interpretability methods to NLP.
We optimize for features whose existence causes the output predictions to change substantially.
Our method achieves superior results on predictive impact, usability, and faithfulness compared to the baselines.
arXiv Detail & Related papers (2023-05-03T14:48:27Z) - Fairness-guided Few-shot Prompting for Large Language Models [93.05624064699965]
In-context learning can suffer from high instability due to variations in training examples, example order, and prompt formats.
We introduce a metric to evaluate the predictive bias of a fixed prompt against labels or a given attributes.
We propose a novel search strategy based on the greedy search to identify the near-optimal prompt for improving the performance of in-context learning.
arXiv Detail & Related papers (2023-03-23T12:28:25Z) - Learning with Latent Structures in Natural Language Processing: A Survey [0.0]
Recent interest in learning with latent discrete structures to incorporate better inductive biases for improved end-task performance and better interpretability.
This work surveys three main families of methods to learn such models: surrogate gradients, continuous relaxation, and marginal likelihood via sampling.
We conclude with a review of applications of these methods and an inspection of the learned latent structure that they induce.
arXiv Detail & Related papers (2022-01-03T06:16:17Z) - Pragmatic competence of pre-trained language models through the lens of
discourse connectives [4.917317902787791]
As pre-trained language models (LMs) continue to dominate NLP, it is increasingly important that we understand the depth of language capabilities in these models.
We focus on testing models' ability to use pragmatic cues to predict discourse connectives.
We find that although models predict connectives reasonably well in the context of naturally-occurring data, when we control contexts to isolate high-level pragmatic cues, model sensitivity is much lower.
arXiv Detail & Related papers (2021-09-27T11:04:41Z) - Active Learning for Sequence Tagging with Deep Pre-trained Models and
Bayesian Uncertainty Estimates [52.164757178369804]
Recent advances in transfer learning for natural language processing in conjunction with active learning open the possibility to significantly reduce the necessary annotation budget.
We conduct an empirical study of various Bayesian uncertainty estimation methods and Monte Carlo dropout options for deep pre-trained models in the active learning framework.
We also demonstrate that to acquire instances during active learning, a full-size Transformer can be substituted with a distilled version, which yields better computational performance.
arXiv Detail & Related papers (2021-01-20T13:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.