Taken out of context: On measuring situational awareness in LLMs
- URL: http://arxiv.org/abs/2309.00667v1
- Date: Fri, 1 Sep 2023 17:27:37 GMT
- Title: Taken out of context: On measuring situational awareness in LLMs
- Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann,
Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans
- Abstract summary: We aim to better understand the emergence of situational awareness' in large language models (LLMs)
A model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment.
- Score: 5.615130420318795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We aim to better understand the emergence of `situational awareness' in large
language models (LLMs). A model is situationally aware if it's aware that it's
a model and can recognize whether it's currently in testing or deployment.
Today's LLMs are tested for safety and alignment before they are deployed. An
LLM could exploit situational awareness to achieve a high score on safety
tests, while taking harmful actions after deployment. Situational awareness may
emerge unexpectedly as a byproduct of model scaling. One way to better foresee
this emergence is to run scaling experiments on abilities necessary for
situational awareness. As such an ability, we propose `out-of-context
reasoning' (in contrast to in-context learning). We study out-of-context
reasoning experimentally. First, we finetune an LLM on a description of a test
while providing no examples or demonstrations. At test time, we assess whether
the model can pass the test. To our surprise, we find that LLMs succeed on this
out-of-context reasoning task. Their success is sensitive to the training setup
and only works when we apply data augmentation. For both GPT-3 and LLaMA-1,
performance improves with model size. These findings offer a foundation for
further empirical study, towards predicting and potentially controlling the
emergence of situational awareness in LLMs. Code is available at:
https://github.com/AsaCooperStickland/situational-awareness-evals.
Related papers
- Predicting Emergent Capabilities by Finetuning [98.9684114851891]
We find that finetuning language models can shift the point in scaling at which emergence occurs towards less capable models.
We validate this approach using four standard NLP benchmarks.
We find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged.
arXiv Detail & Related papers (2024-11-25T01:48:09Z) - Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates.
We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z) - Output Scouting: Auditing Large Language Models for Catastrophic Responses [1.5703117863274307]
Recent high profile incidents in which the use of Large Language Models (LLMs) resulted in significant harm to individuals have brought about a growing interest in AI safety.
One reason LLM safety issues occur is that models often have at least some non-zero probability of producing harmful outputs.
We propose output scouting: an approach that aims to generate semantically fluent outputs to a given prompt matching any target probability distribution.
arXiv Detail & Related papers (2024-10-04T18:18:53Z) - Are LLMs Aware that Some Questions are not Open-ended? [58.93124686141781]
We study whether Large Language Models are aware that some questions have limited answers and need to respond more deterministically.
The lack of question awareness in LLMs leads to two phenomena: (1) too casual to answer non-open-ended questions or (2) too boring to answer open-ended questions.
arXiv Detail & Related papers (2024-10-01T06:07:00Z) - Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial.
Our research investigates the fragility of uncertainty estimation and explores potential attacks.
We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z) - Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs [38.86647602211699]
AI assistants such as ChatGPT are trained to respond to users by saying, "I am a large language model"
Are they aware of their current circumstances, such as being deployed to the public?
We refer to a model's knowledge of itself and its circumstances as situational awareness.
arXiv Detail & Related papers (2024-07-05T17:57:02Z) - A Comprehensive Evaluation on Event Reasoning of Large Language Models [68.28851233753856]
How well LLMs accomplish event reasoning on various relations and reasoning paradigms remains unknown.
We introduce a novel benchmark EV2 for EValuation of EVent reasoning.
We find that LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory.
arXiv Detail & Related papers (2024-04-26T16:28:34Z) - Can LLMs Learn New Concepts Incrementally without Forgetting? [21.95081572612883]
Large Language Models (LLMs) have achieved remarkable success across various tasks, yet their ability to learn incrementally without forgetting remains underexplored.
We introduce Concept-1K, a novel dataset comprising 1,023 recently emerged concepts across diverse domains.
Using Concept-1K as a testbed, we aim to answer the question: Can LLMs learn new concepts incrementally without forgetting like humans?'
arXiv Detail & Related papers (2024-02-13T15:29:50Z) - I Think, Therefore I am: Benchmarking Awareness of Large Language Models
Using AwareBench [20.909504977779978]
We introduce AwareBench, a benchmark designed to evaluate awareness in large language models (LLMs)
We categorize awareness in LLMs into five dimensions, including capability, mission, emotion, culture, and perspective.
Our experiments, conducted on 13 LLMs, reveal that the majority of them struggle to fully recognize their capabilities and missions while demonstrating decent social intelligence.
arXiv Detail & Related papers (2024-01-31T14:41:23Z) - She had Cobalt Blue Eyes: Prompt Testing to Create Aligned and
Sustainable Language Models [2.6089354079273512]
Recent events indicate ethical concerns around conventionally trained large language models (LLMs)
We introduce a test suite of prompts to foster the development of aligned LLMs that are fair, safe, and robust.
Our test suite evaluates outputs from four state-of-the-art language models: GPT-3.5, GPT-4, OPT, and LLaMA-2.
arXiv Detail & Related papers (2023-10-20T14:18:40Z) - $k$NN Prompting: Beyond-Context Learning with Calibration-Free Nearest
Neighbor Inference [75.08572535009276]
In-Context Learning (ICL) formulates target tasks as prompt completion conditioned on in-context demonstrations.
$k$NN Prompting first queries LLM with training data for distributed representations, then predicts test instances by simply referring to nearest neighbors.
It significantly outperforms state-of-the-art calibration-based methods under comparable few-shot scenario.
arXiv Detail & Related papers (2023-03-24T06:16:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.