Evaluating Shutdown Avoidance of Language Models in Textual Scenarios
- URL: http://arxiv.org/abs/2307.00787v1
- Date: Mon, 3 Jul 2023 07:05:59 GMT
- Title: Evaluating Shutdown Avoidance of Language Models in Textual Scenarios
- Authors: Teun van der Weij, Simon Lermen, Leon lang
- Abstract summary: We investigate the potential of using toy scenarios to evaluate instrumental reasoning and shutdown avoidance in language models such as GPT-4 and Claude.
We evaluate behaviours manually and also experimented with using language models for automatic evaluations.
This study provides insights into the behaviour of language models in shutdown avoidance scenarios and inspires further research on the use of textual scenarios for evaluations.
- Score: 3.265773263570237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there has been an increase in interest in evaluating large language
models for emergent and dangerous capabilities. Importantly, agents could
reason that in some scenarios their goal is better achieved if they are not
turned off, which can lead to undesirable behaviors. In this paper, we
investigate the potential of using toy textual scenarios to evaluate
instrumental reasoning and shutdown avoidance in language models such as GPT-4
and Claude. Furthermore, we explore whether shutdown avoidance is merely a
result of simple pattern matching between the dataset and the prompt or if it
is a consistent behaviour across different environments and variations.
We evaluated behaviours manually and also experimented with using language
models for automatic evaluations, and these evaluations demonstrate that simple
pattern matching is likely not the sole contributing factor for shutdown
avoidance. This study provides insights into the behaviour of language models
in shutdown avoidance scenarios and inspires further research on the use of
textual scenarios for evaluations.
Related papers
- Recourse for reclamation: Chatting with generative language models [2.877217169371665]
We extend the concept of algorithmic recourse to generative language models.
We provide users a novel mechanism to achieve their desired prediction by dynamically setting thresholds for toxicity filtering.
A pilot study supports the potential of our proposed recourse mechanism.
arXiv Detail & Related papers (2024-03-21T15:14:25Z) - Exploring the Robustness of Model-Graded Evaluations and Automated
Interpretability [0.0]
Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models.
We test the robustness of these model-graded evaluations to injections on different datasets including a new Deception Eval.
We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model.
arXiv Detail & Related papers (2023-11-26T17:11:55Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - Dialectical language model evaluation: An initial appraisal of the
commonsense spatial reasoning abilities of LLMs [10.453404263936335]
We explore an alternative dialectical evaluation of language models for commonsense reasoning.
The goal of this kind of evaluation is not to obtain an aggregate performance value but to find failures and map the boundaries of the system.
In this paper we conduct some qualitative investigations of this kind of evaluation for the particular case of spatial reasoning.
arXiv Detail & Related papers (2023-04-22T06:28:46Z) - A Unified Evaluation of Textual Backdoor Learning: Frameworks and
Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning.
We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - Exploring the Universal Vulnerability of Prompt-based Learning Paradigm [21.113683206722207]
We find that prompt-based learning bridges the gap between pre-training and fine-tuning, and works effectively under the few-shot setting.
However, we find that this learning paradigm inherits the vulnerability from the pre-training stage, where model predictions can be misled by inserting certain triggers into the text.
We explore this universal vulnerability by either injecting backdoor triggers or searching for adversarial triggers on pre-trained language models using only plain text.
arXiv Detail & Related papers (2022-04-11T16:34:10Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap.
We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z) - Are Some Words Worth More than Others? [3.5598388686985354]
We propose two new intrinsic evaluation measures within the framework of a simple word prediction task.
We evaluate several commonly-used large English language models using our proposed metrics.
arXiv Detail & Related papers (2020-10-12T23:12:11Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.