Self-Recognition in Language Models
- URL: http://arxiv.org/abs/2407.06946v2
- Date: Thu, 10 Oct 2024 11:07:47 GMT
- Title: Self-Recognition in Language Models
- Authors: Tim R. Davidson, Viacheslav Surkov, Veniamin Veselovsky, Giuseppe Russo, Robert West, Caglar Gulcehre,
- Abstract summary: We propose a novel approach for assessing self-recognition in LMs using model-generated "security questions"
We use our test to examine self-recognition in ten of the most capable open- and closed-source LMs currently publicly available.
Our results suggest that given a set of alternatives, LMs seek to pick the "best" answer, regardless of its origin.
- Score: 10.649471089216489
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A rapidly growing number of applications rely on a small set of closed-source language models (LMs). This dependency might introduce novel security risks if LMs develop self-recognition capabilities. Inspired by human identity verification methods, we propose a novel approach for assessing self-recognition in LMs using model-generated "security questions". Our test can be externally administered to monitor frontier models as it does not require access to internal model parameters or output probabilities. We use our test to examine self-recognition in ten of the most capable open- and closed-source LMs currently publicly available. Our extensive experiments found no empirical evidence of general or consistent self-recognition in any examined LM. Instead, our results suggest that given a set of alternatives, LMs seek to pick the "best" answer, regardless of its origin. Moreover, we find indications that preferences about which models produce the best answers are consistent across LMs. We additionally uncover novel insights on position bias considerations for LMs in multiple-choice settings.
Related papers
- LM-PUB-QUIZ: A Comprehensive Framework for Zero-Shot Evaluation of Relational Knowledge in Language Models [2.1311017627417]
Knowledge probing evaluates the extent to which a language model (LM) has acquired relational knowledge during its pre-training phase.
We present LM-PUB- QUIZ, a Python framework and leaderboard built around the BEAR probing mechanism.
arXiv Detail & Related papers (2024-08-28T11:44:52Z) - CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation [76.31621715032558]
Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses.
We introduce CaLM, a novel verification framework.
Our framework empowers smaller LMs, which rely less on parametric memory, to validate the output of larger LMs.
arXiv Detail & Related papers (2024-06-08T06:04:55Z) - RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation [73.2390735383842]
We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions.
We show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm.
These results can be valuable for a wide range of interactive learning problems beyond LMDPs, and especially, for partially observed environments.
arXiv Detail & Related papers (2024-06-03T14:51:27Z) - BEAR: A Unified Framework for Evaluating Relational Knowledge in Causal and Masked Language Models [2.2863439039616127]
Probing assesses to which degree a language model (LM) has successfully learned relational knowledge during pre-training.
Previous approaches rely on the objective function used in pre-training LMs.
We propose an approach that uses an LM's inherent ability to estimate the log-likelihood of any given textual statement.
arXiv Detail & Related papers (2024-04-05T14:13:55Z) - Can Small Language Models Help Large Language Models Reason Better?: LM-Guided Chain-of-Thought [51.240387516059535]
We introduce a novel framework, LM-Guided CoT, that leverages a lightweight (i.e., 1B) language model (LM) for guiding a black-box large (i.e., >10B) LM in reasoning tasks.
We optimize the model through 1) knowledge distillation and 2) reinforcement learning from rationale-oriented and task-oriented reward signals.
arXiv Detail & Related papers (2024-04-04T12:46:37Z) - Bayesian Preference Elicitation with Language Models [82.58230273253939]
We introduce OPEN, a framework that uses BOED to guide the choice of informative questions and an LM to extract features.
In user studies, we find that OPEN outperforms existing LM- and BOED-based methods for preference elicitation.
arXiv Detail & Related papers (2024-03-08T18:57:52Z) - Small Language Model Can Self-correct [42.76612128849389]
We introduce the underlineIntrinsic underlineSelf-underlineCorrection (ISC) in generative language models, aiming to correct the initial output of LMs in a self-triggered manner.
We conduct experiments using LMs with parameters sizes ranging from 6 billion to 13 billion in two tasks, including commonsense reasoning and factual knowledge reasoning.
arXiv Detail & Related papers (2024-01-14T14:29:07Z) - Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty [53.336235704123915]
We investigate how LMs incorporate confidence in responses via natural language and how downstream users behave in response to LM-articulated uncertainties.
We find that LMs are reluctant to express uncertainties when answering questions even when they produce incorrect responses.
We test the risks of LM overconfidence by conducting human experiments and show that users rely heavily on LM generations.
Lastly, we investigate the preference-annotated datasets used in post training alignment and find that humans are biased against texts with uncertainty.
arXiv Detail & Related papers (2024-01-12T18:03:30Z) - Eliciting Latent Knowledge from Quirky Language Models [1.8035046415192353]
Eliciting Latent Knowledge aims to find patterns in a capable neural network's activations that robustly track the true state of the world.
We introduce 12 datasets and a suite of "quirky" language models (LMs) that are finetuned to make systematic errors when answering questions.
We find that, especially in middle layers, linear probes usually report an LM's knowledge independently of what the LM outputs.
arXiv Detail & Related papers (2023-12-02T05:47:22Z) - Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges.
Our model is trained on user queries and LLM-generated responses under massive real-world scenarios.
Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.