Evaluating Shutdown Avoidance of Language Models in Textual Scenarios
- URL: http://arxiv.org/abs/2307.00787v1
- Date: Mon, 3 Jul 2023 07:05:59 GMT
- Title: Evaluating Shutdown Avoidance of Language Models in Textual Scenarios
- Authors: Teun van der Weij, Simon Lermen, Leon lang
- Abstract summary: We investigate the potential of using toy scenarios to evaluate instrumental reasoning and shutdown avoidance in language models such as GPT-4 and Claude.
We evaluate behaviours manually and also experimented with using language models for automatic evaluations.
This study provides insights into the behaviour of language models in shutdown avoidance scenarios and inspires further research on the use of textual scenarios for evaluations.
- Score: 3.265773263570237
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, there has been an increase in interest in evaluating large language
models for emergent and dangerous capabilities. Importantly, agents could
reason that in some scenarios their goal is better achieved if they are not
turned off, which can lead to undesirable behaviors. In this paper, we
investigate the potential of using toy textual scenarios to evaluate
instrumental reasoning and shutdown avoidance in language models such as GPT-4
and Claude. Furthermore, we explore whether shutdown avoidance is merely a
result of simple pattern matching between the dataset and the prompt or if it
is a consistent behaviour across different environments and variations.
We evaluated behaviours manually and also experimented with using language
models for automatic evaluations, and these evaluations demonstrate that simple
pattern matching is likely not the sole contributing factor for shutdown
avoidance. This study provides insights into the behaviour of language models
in shutdown avoidance scenarios and inspires further research on the use of
textual scenarios for evaluations.
Related papers
- CONTESTS: a Framework for Consistency Testing of Span Probabilities in Language Models [16.436592723426305]
It is unclear whether language models produce the same value for different ways of assigning joint probabilities to word spans.
Our work introduces a novel framework, ConTestS, involving statistical tests to assess score consistency across interchangeable completion and conditioning orders.
arXiv Detail & Related papers (2024-09-30T06:24:43Z) - Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models [6.394084132117747]
We propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language models.
Our technique generates fluent, in-distribution counterfactuals, making the evaluation protocol more reliable.
arXiv Detail & Related papers (2024-08-21T00:17:59Z) - What could go wrong? Discovering and describing failure modes in computer vision [27.6114923305978]
We formalize the problem of Language-Based Error Explainability (LBEE)
We propose solutions that operate in a joint vision-and-language embedding space.
We show that the proposed methodology isolates nontrivial sentences associated with specific error causes.
arXiv Detail & Related papers (2024-08-08T14:01:12Z) - Recourse for reclamation: Chatting with generative language models [2.877217169371665]
We extend the concept of algorithmic recourse to generative language models.
We provide users a novel mechanism to achieve their desired prediction by dynamically setting thresholds for toxicity filtering.
A pilot study supports the potential of our proposed recourse mechanism.
arXiv Detail & Related papers (2024-03-21T15:14:25Z) - Exploring the Robustness of Model-Graded Evaluations and Automated
Interpretability [0.0]
Evaluations relying on natural language understanding for grading can often be performed at scale by using other language models.
We test the robustness of these model-graded evaluations to injections on different datasets including a new Deception Eval.
We extrapolate that future, more intelligent models might manipulate or cooperate with their evaluation model.
arXiv Detail & Related papers (2023-11-26T17:11:55Z) - Bring Your Own Data! Self-Supervised Evaluation for Large Language
Models [52.15056231665816]
We propose a framework for self-supervised evaluation of Large Language Models (LLMs)
We demonstrate self-supervised evaluation strategies for measuring closed-book knowledge, toxicity, and long-range context dependence.
We find strong correlations between self-supervised and human-supervised evaluations.
arXiv Detail & Related papers (2023-06-23T17:59:09Z) - A Unified Evaluation of Textual Backdoor Learning: Frameworks and
Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning.
We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z) - A Generative Language Model for Few-shot Aspect-Based Sentiment Analysis [90.24921443175514]
We focus on aspect-based sentiment analysis, which involves extracting aspect term, category, and predicting their corresponding polarities.
We propose to reformulate the extraction and prediction tasks into the sequence generation task, using a generative language model with unidirectional attention.
Our approach outperforms the previous state-of-the-art (based on BERT) on average performance by a large margins in few-shot and full-shot settings.
arXiv Detail & Related papers (2022-04-11T18:31:53Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap.
We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z) - Limits of Detecting Text Generated by Large-Scale Language Models [65.46403462928319]
Some consider large-scale language models that can generate long and coherent pieces of text as dangerous, since they may be used in misinformation campaigns.
Here we formulate large-scale language model output detection as a hypothesis testing problem to classify text as genuine or generated.
arXiv Detail & Related papers (2020-02-09T19:53:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.