Mitigating the Problem of Strong Priors in LMs with Context
Extrapolation
- URL: http://arxiv.org/abs/2401.17692v1
- Date: Wed, 31 Jan 2024 09:28:06 GMT
- Title: Mitigating the Problem of Strong Priors in LMs with Context
Extrapolation
- Authors: Raymond Douglas, Andis Draguns, Tom\'a\v{s} Gaven\v{c}iak
- Abstract summary: We develop a new technique for mitigating the problem of strong priors.
We take the original set of instructions, produce a weakened version of the original prompt, and extrapolate the continuation away from the weakened prompt.
This lets us infer how the model would continue a hypothetical strengthened set of instructions.
- Score: 0.6629765271909505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language models (LMs) have become important tools in a variety of
applications, from data processing to the creation of instruction-following
assistants. But despite their advantages, LMs have certain idiosyncratic
limitations such as the problem of `strong priors', where a model learns to
output typical continuations in response to certain, usually local, portions of
the input regardless of any earlier instructions. For example, prompt injection
attacks can induce models to ignore explicit directives. In some cases, larger
models have been shown to be more susceptible to these problems than similar
smaller models, an example of the phenomenon of `inverse scaling'. We develop a
new technique for mitigating the problem of strong priors: we take the original
set of instructions, produce a weakened version of the original prompt that is
even more susceptible to the strong priors problem, and then extrapolate the
continuation away from the weakened prompt. This lets us infer how the model
would continue a hypothetical strengthened set of instructions. Our technique
conceptualises LMs as mixture models which combine a family of data generation
processes, reinforcing the desired elements of the mixture. Our approach works
at inference time, removing any need for retraining. We apply it to eleven
models including GPT-2, GPT-3, Llama 2, and Mistral on four tasks, and find
improvements in 41/44. Across all 44 combinations the median increase in
proportion of tasks completed is 40%.
Related papers
- Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Scalable Influence and Fact Tracing for Large Language Model Pretraining [14.598556308631018]
Training data attribution (TDA) methods aim to attribute model outputs back to specific training examples.
We refine existing gradient-based methods to work effectively at scale.
We release our prompt set and model outputs, along with a web-based visualization tool to explore influential examples.
arXiv Detail & Related papers (2024-10-22T20:39:21Z) - PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation [68.17081518640934]
We propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R)
PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module.
Our PIVOT-R outperforms state-of-the-art open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks.
arXiv Detail & Related papers (2024-10-14T11:30:18Z) - MAP: Low-compute Model Merging with Amortized Pareto Fronts via Quadratic Approximation [80.47072100963017]
We introduce a novel and low-compute algorithm, Model Merging with Amortized Pareto Front (MAP)
MAP efficiently identifies a set of scaling coefficients for merging multiple models, reflecting the trade-offs involved.
We also introduce Bayesian MAP for scenarios with a relatively low number of tasks and Nested MAP for situations with a high number of tasks, further reducing the computational cost of evaluation.
arXiv Detail & Related papers (2024-06-11T17:55:25Z) - Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning [56.82041895921434]
Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities.
When used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4.
arXiv Detail & Related papers (2024-03-29T03:48:12Z) - Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Instruction Position Matters in Sequence Generation with Large Language
Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization.
We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z) - ReWOO: Decoupling Reasoning from Observations for Efficient Augmented
Language Models [32.95155349925248]
We propose a modular paradigm ReWOO that detaches the reasoning process from external observations, thus significantly reducing token consumption.
We show that ReWOO achieves 5x token efficiency and 4% accuracy improvement on HotpotQA, a multi-step reasoning benchmark.
Our illustrative work offloads reasoning ability from 175B GPT3.5 into 7B LLaMA, demonstrating the significant potential for truly efficient and scalable ALM systems.
arXiv Detail & Related papers (2023-05-23T00:16:48Z) - Mixture of Soft Prompts for Controllable Data Generation [21.84489422361048]
Mixture of Soft Prompts (MSP) is proposed as a tool for data augmentation rather than direct prediction.
Our method achieves state-of-the-art results on three benchmarks when compared against strong baselines.
arXiv Detail & Related papers (2023-03-02T21:13:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.