Taken out of context: On measuring situational awareness in LLMs
- URL: http://arxiv.org/abs/2309.00667v1
- Date: Fri, 1 Sep 2023 17:27:37 GMT
- Title: Taken out of context: On measuring situational awareness in LLMs
- Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann,
Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans
- Abstract summary: We aim to better understand the emergence of situational awareness' in large language models (LLMs)
A model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment.
- Score: 5.615130420318795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We aim to better understand the emergence of `situational awareness' in large
language models (LLMs). A model is situationally aware if it's aware that it's
a model and can recognize whether it's currently in testing or deployment.
Today's LLMs are tested for safety and alignment before they are deployed. An
LLM could exploit situational awareness to achieve a high score on safety
tests, while taking harmful actions after deployment. Situational awareness may
emerge unexpectedly as a byproduct of model scaling. One way to better foresee
this emergence is to run scaling experiments on abilities necessary for
situational awareness. As such an ability, we propose `out-of-context
reasoning' (in contrast to in-context learning). We study out-of-context
reasoning experimentally. First, we finetune an LLM on a description of a test
while providing no examples or demonstrations. At test time, we assess whether
the model can pass the test. To our surprise, we find that LLMs succeed on this
out-of-context reasoning task. Their success is sensitive to the training setup
and only works when we apply data augmentation. For both GPT-3 and LLaMA-1,
performance improves with model size. These findings offer a foundation for
further empirical study, towards predicting and potentially controlling the
emergence of situational awareness in LLMs. Code is available at:
https://github.com/AsaCooperStickland/situational-awareness-evals.
Related papers
- Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial.
Our research investigates the fragility of uncertainty estimation and explores potential attacks.
We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z) - Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs [38.86647602211699]
AI assistants such as ChatGPT are trained to respond to users by saying, "I am a large language model"
Are they aware of their current circumstances, such as being deployed to the public?
We refer to a model's knowledge of itself and its circumstances as situational awareness.
arXiv Detail & Related papers (2024-07-05T17:57:02Z) - Are you still on track!? Catching LLM Task Drift with Activations [55.75645403965326]
Large Language Models (LLMs) are routinely used in retrieval-augmented applications to orchestrate tasks and process inputs from users and other sources.
This opens the door to prompt injection attacks, where the LLM receives and acts upon instructions from supposedly data-only sources, thus deviating from the user's original instructions.
We define this as task drift, and we propose to catch it by scanning and analyzing the LLM's activations.
We show that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions, without being trained on any of these attacks.
arXiv Detail & Related papers (2024-06-02T16:53:21Z) - A Comprehensive Evaluation on Event Reasoning of Large Language Models [50.117736215593894]
How well LLMs accomplish event reasoning on various relations and reasoning paradigms remains unknown.
We introduce a novel benchmark EV2 for EValuation of EVent reasoning.
We find that LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory.
arXiv Detail & Related papers (2024-04-26T16:28:34Z) - Can LLMs Learn New Concepts Incrementally without Forgetting? [21.95081572612883]
Large Language Models (LLMs) have achieved remarkable success across various tasks, yet their ability to learn incrementally without forgetting remains underexplored.
We introduce Concept-1K, a novel dataset comprising 1,023 recently emerged concepts across diverse domains.
Using Concept-1K as a testbed, we aim to answer the question: Can LLMs learn new concepts incrementally without forgetting like humans?'
arXiv Detail & Related papers (2024-02-13T15:29:50Z) - I Think, Therefore I am: Benchmarking Awareness of Large Language Models
Using AwareBench [20.909504977779978]
We introduce AwareBench, a benchmark designed to evaluate awareness in large language models (LLMs)
We categorize awareness in LLMs into five dimensions, including capability, mission, emotion, culture, and perspective.
Our experiments, conducted on 13 LLMs, reveal that the majority of them struggle to fully recognize their capabilities and missions while demonstrating decent social intelligence.
arXiv Detail & Related papers (2024-01-31T14:41:23Z) - A & B == B & A: Triggering Logical Reasoning Failures in Large Language
Models [65.86149763739141]
We introduce LogicAsker, an automatic approach that comprehensively evaluates and improves the logical reasoning abilities of LLMs.
We evaluate LogicAsker on six widely deployed LLMs, including GPT-3, ChatGPT, GPT-4, Bard, Vicuna, and Guanaco.
The results show that test cases from LogicAsker can find logical reasoning failures in different LLMs with a rate of 25% - 94%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z) - She had Cobalt Blue Eyes: Prompt Testing to Create Aligned and
Sustainable Language Models [2.6089354079273512]
Recent events indicate ethical concerns around conventionally trained large language models (LLMs)
We introduce a test suite of prompts to foster the development of aligned LLMs that are fair, safe, and robust.
Our test suite evaluates outputs from four state-of-the-art language models: GPT-3.5, GPT-4, OPT, and LLaMA-2.
arXiv Detail & Related papers (2023-10-20T14:18:40Z) - Probing the Multi-turn Planning Capabilities of LLMs via 20 Question
Games [14.063311955315077]
Large language models (LLMs) are effective at answering questions that are clearly asked.
When faced with ambiguous queries they can act unpredictably and produce incorrect outputs.
This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively.
arXiv Detail & Related papers (2023-10-02T16:55:37Z) - Can LMs Learn New Entities from Descriptions? Challenges in Propagating
Injected Knowledge [72.63368052592004]
We study LMs' abilities to make inferences based on injected facts (or propagate those facts)
We find that existing methods for updating knowledge show little propagation of injected knowledge.
Yet, prepending entity definitions in an LM's context improves performance across all settings.
arXiv Detail & Related papers (2023-05-02T17:59:46Z) - $k$NN Prompting: Beyond-Context Learning with Calibration-Free Nearest
Neighbor Inference [75.08572535009276]
In-Context Learning (ICL) formulates target tasks as prompt completion conditioned on in-context demonstrations.
$k$NN Prompting first queries LLM with training data for distributed representations, then predicts test instances by simply referring to nearest neighbors.
It significantly outperforms state-of-the-art calibration-based methods under comparable few-shot scenario.
arXiv Detail & Related papers (2023-03-24T06:16:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.