Related papers: Taken out of context: On measuring situational awareness in LLMs

Taken out of context: On measuring situational awareness in LLMs

URL: http://arxiv.org/abs/2309.00667v1
Date: Fri, 1 Sep 2023 17:27:37 GMT
Title: Taken out of context: On measuring situational awareness in LLMs
Authors: Lukas Berglund, Asa Cooper Stickland, Mikita Balesni, Max Kaufmann, Meg Tong, Tomasz Korbak, Daniel Kokotajlo, Owain Evans
Abstract summary: We aim to better understand the emergence of situational awareness' in large language models (LLMs) A model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment.
Score: 5.615130420318795
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We aim to better understand the emergence of `situational awareness' in large language models (LLMs). A model is situationally aware if it's aware that it's a model and can recognize whether it's currently in testing or deployment. Today's LLMs are tested for safety and alignment before they are deployed. An LLM could exploit situational awareness to achieve a high score on safety tests, while taking harmful actions after deployment. Situational awareness may emerge unexpectedly as a byproduct of model scaling. One way to better foresee this emergence is to run scaling experiments on abilities necessary for situational awareness. As such an ability, we propose `out-of-context reasoning' (in contrast to in-context learning). We study out-of-context reasoning experimentally. First, we finetune an LLM on a description of a test while providing no examples or demonstrations. At test time, we assess whether the model can pass the test. To our surprise, we find that LLMs succeed on this out-of-context reasoning task. Their success is sensitive to the training setup and only works when we apply data augmentation. For both GPT-3 and LLaMA-1, performance improves with model size. These findings offer a foundation for further empirical study, towards predicting and potentially controlling the emergence of situational awareness in LLMs. Code is available at: https://github.com/AsaCooperStickland/situational-awareness-evals.

Related papers

Do Large Language Models Exhibit Spontaneous Rational Deception? [0.913127392774573]
Large Language Models (LLMs) are effective at deceiving, when prompted to do so. But under what conditions do they deceive spontaneously? This study evaluates spontaneous deception produced by LLMs in a preregistered experimental protocol.
arXiv Detail & Related papers (2025-03-31T23:10:56Z)
If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs) Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z)
Predicting Emergent Capabilities by Finetuning [98.9684114851891]
We find that finetuning language models can shift the point in scaling at which emergence occurs towards less capable models. We validate this approach using four standard NLP benchmarks. We find that, in some cases, we can accurately predict whether models trained with up to 4x more compute have emerged.
arXiv Detail & Related papers (2024-11-25T01:48:09Z)
Bayesian scaling laws for in-context learning [72.17734205418502]
In-context learning (ICL) is a powerful technique for getting language models to perform complex tasks with no training updates. We show that ICL approximates a Bayesian learner and develop a family of novel Bayesian scaling laws for ICL.
arXiv Detail & Related papers (2024-10-21T21:45:22Z)
Output Scouting: Auditing Large Language Models for Catastrophic Responses [1.5703117863274307]
Recent high profile incidents in which the use of Large Language Models (LLMs) resulted in significant harm to individuals have brought about a growing interest in AI safety. One reason LLM safety issues occur is that models often have at least some non-zero probability of producing harmful outputs. We propose output scouting: an approach that aims to generate semantically fluent outputs to a given prompt matching any target probability distribution.
arXiv Detail & Related papers (2024-10-04T18:18:53Z)
Are LLMs Aware that Some Questions are not Open-ended? [58.93124686141781]
We study whether Large Language Models are aware that some questions have limited answers and need to respond more deterministically. The lack of question awareness in LLMs leads to two phenomena: (1) too casual to answer non-open-ended questions or (2) too boring to answer open-ended questions.
arXiv Detail & Related papers (2024-10-01T06:07:00Z)
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. Our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z)
Me, Myself, and AI: The Situational Awareness Dataset (SAD) for LLMs [38.86647602211699]
AI assistants such as ChatGPT are trained to respond to users by saying, "I am a large language model" Are they aware of their current circumstances, such as being deployed to the public? We refer to a model's knowledge of itself and its circumstances as situational awareness.
arXiv Detail & Related papers (2024-07-05T17:57:02Z)
A Comprehensive Evaluation on Event Reasoning of Large Language Models [68.28851233753856]
How well LLMs accomplish event reasoning on various relations and reasoning paradigms remains unknown. We introduce a novel benchmark EV2 for EValuation of EVent reasoning. We find that LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory.
arXiv Detail & Related papers (2024-04-26T16:28:34Z)
Can LLMs Learn New Concepts Incrementally without Forgetting? [21.95081572612883]
Large Language Models (LLMs) have achieved remarkable success across various tasks, yet their ability to learn incrementally without forgetting remains underexplored. We introduce Concept-1K, a novel dataset comprising 1,023 recently emerged concepts across diverse domains. Using Concept-1K as a testbed, we aim to answer the question: Can LLMs learn new concepts incrementally without forgetting like humans?'
arXiv Detail & Related papers (2024-02-13T15:29:50Z)
I Think, Therefore I am: Benchmarking Awareness of Large Language Models Using AwareBench [20.909504977779978]
We introduce AwareBench, a benchmark designed to evaluate awareness in large language models (LLMs) We categorize awareness in LLMs into five dimensions, including capability, mission, emotion, culture, and perspective. Our experiments, conducted on 13 LLMs, reveal that the majority of them struggle to fully recognize their capabilities and missions while demonstrating decent social intelligence.
arXiv Detail & Related papers (2024-01-31T14:41:23Z)
She had Cobalt Blue Eyes: Prompt Testing to Create Aligned and Sustainable Language Models [2.6089354079273512]
Recent events indicate ethical concerns around conventionally trained large language models (LLMs) We introduce a test suite of prompts to foster the development of aligned LLMs that are fair, safe, and robust. Our test suite evaluates outputs from four state-of-the-art language models: GPT-3.5, GPT-4, OPT, and LLaMA-2.
arXiv Detail & Related papers (2023-10-20T14:18:40Z)
$k$NN Prompting: Beyond-Context Learning with Calibration-Free Nearest Neighbor Inference [75.08572535009276]
In-Context Learning (ICL) formulates target tasks as prompt completion conditioned on in-context demonstrations. $k$NN Prompting first queries LLM with training data for distributed representations, then predicts test instances by simply referring to nearest neighbors. It significantly outperforms state-of-the-art calibration-based methods under comparable few-shot scenario.
arXiv Detail & Related papers (2023-03-24T06:16:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.