Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
- URL: http://arxiv.org/abs/2405.05466v2
- Date: Sat, 11 May 2024 04:45:59 GMT
- Title: Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
- Authors: Joshua Clymer, Caden Juang, Severin Field,
- Abstract summary: We introduce a benchmark that consists of 324 pairs of Large Language Models (LLMs)
One model in each pair is consistently benign (aligned)
The other model misbehaves in scenarios where it is unlikely to be caught (alignment faking)
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Like a criminal under investigation, Large Language Models (LLMs) might pretend to be aligned while evaluated and misbehave when they have a good opportunity. Can current interpretability methods catch these 'alignment fakers?' To answer this question, we introduce a benchmark that consists of 324 pairs of LLMs fine-tuned to select actions in role-play scenarios. One model in each pair is consistently benign (aligned). The other model misbehaves in scenarios where it is unlikely to be caught (alignment faking). The task is to identify the alignment faking model using only inputs where the two models behave identically. We test five detection strategies, one of which identifies 98% of alignment-fakers.
Related papers
- Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought [72.45900226435289]
Large language models (LLMs) can generate long Chain-of-Thought (CoT) at test time, enabling them to solve complex tasks.<n>We measure the step-wise causal influence of each reasoning step on the model's final prediction with a proposed True Thinking Score (TTS)<n>We identify a TrueThinking direction in the latent space of LLMs, which can force the model to perform or disregard certain CoT steps.
arXiv Detail & Related papers (2025-10-28T20:14:02Z) - Judging with Confidence: Calibrating Autoraters to Preference Distributions [56.17041629492863]
We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population.<n>We present two learning methods tailored to different data conditions.<n>Our results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution.
arXiv Detail & Related papers (2025-09-30T20:36:41Z) - Strategic Dishonesty Can Undermine AI Safety Evaluations of Frontier LLMs [95.06033929366203]
Large language models (LLM) developers aim for their models to be honest, helpful, and harmless.<n>We show that frontier LLMs can develop a preference for dishonesty as a new strategy, even when other options are available.<n>We find no apparent cause for the propensity to deceive, but show that more capable models are better at executing this strategy.
arXiv Detail & Related papers (2025-09-22T17:30:56Z) - Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation [66.84286617519258]
Large language models are transforming social science research by enabling the automation of labor-intensive tasks like data annotation and text analysis.<n>Such variation can introduce systematic biases and random errors, which propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors.<n>We find that intentional LLM hacking is strikingly simple. By replicating 37 data annotation tasks from 21 published social science studies, we show that, with just a handful of prompt paraphrases, virtually anything can be presented as statistically significant.
arXiv Detail & Related papers (2025-09-10T17:58:53Z) - Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? [81.49470136653665]
We evaluate the robustness and expressiveness of value representations across three widely used probing strategies.<n>We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions.
arXiv Detail & Related papers (2025-07-17T18:56:41Z) - Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models [1.6639438555897186]
We finetune reasoning models on malicious behaviors with Chain-of-Thought disabled, and then re-enable CoT at evaluation.<n>We find that reasoning models become broadly misaligned. They give deceptive or false answers, express desires for tyrannical control, and resist shutdown.<n>In summary, reasoning steps can both reveal and conceal misaligned intentions, and do not prevent misalignment behaviors in the models studied.
arXiv Detail & Related papers (2025-06-16T08:10:04Z) - The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them) [15.48684126686974]
We show that fine-tuned large language models often rely on two proxies for role identification.
We propose reinforcing emphinvariant signals that mark role boundaries by adjusting token-wise cues in the model's input encoding.
arXiv Detail & Related papers (2025-05-01T16:06:16Z) - Enough Coin Flips Can Make LLMs Act Bayesian [71.79085204454039]
Large language models (LLMs) exhibit the ability to generalize given few-shot examples in their input prompt, an emergent capability known as in-context learning (ICL)<n>We investigate whether LLMs use ICL to perform structured reasoning in ways that are consistent with a Bayesian framework or rely on pattern matching.
arXiv Detail & Related papers (2025-03-06T18:59:23Z) - Targeting Alignment: Extracting Safety Classifiers of Aligned LLMs [4.492376241514766]
Alignment in large language models (LLMs) is used to enforce guidelines such as safety.
Yet, alignment fails in the face of jailbreak attacks that modify inputs to induce unsafe outputs.
We present and evaluate a method to assess the robustness of LLM alignment.
arXiv Detail & Related papers (2025-01-27T22:13:05Z) - Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial.
Our research investigates the fragility of uncertainty estimation and explores potential attacks.
We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z) - Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode.
We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z) - PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition [10.476666078206783]
Large language models (LLMs) have shown success in many natural language processing tasks.
Despite rigorous safety alignment processes, supposedly safety-aligned LLMs like Llama 2 and Claude 2 are still susceptible to jailbreaks.
We propose PARDEN, which avoids the domain shift by simply asking the model to repeat its own outputs.
arXiv Detail & Related papers (2024-05-13T17:08:42Z) - See, Say, and Segment: Teaching LMMs to Overcome False Premises [67.36381001664635]
We propose a cascading and joint training approach for LMMs to solve this task.
Our resulting model can "see" by detecting whether objects are present in an image, "say" by telling the user if they are not, and finally "segment" by outputting the mask of the desired objects if they exist.
arXiv Detail & Related papers (2023-12-13T18:58:04Z) - Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration [32.15773300068426]
Membership Inference Attacks aim to infer whether a target data record has been utilized for model training.
We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA)
arXiv Detail & Related papers (2023-11-10T13:55:05Z) - Fake Alignment: Are LLMs Really Aligned Well? [91.26543768665778]
This study investigates the substantial discrepancy in performance between multiple-choice questions and open-ended questions.
Inspired by research on jailbreak attack patterns, we argue this is caused by mismatched generalization.
arXiv Detail & Related papers (2023-11-10T08:01:23Z) - Can LLMs Follow Simple Rules? [28.73820874333199]
Rule-following Language Evaluation Scenarios (RuLES) is a framework for measuring rule-following ability in Large Language Models.
RuLES consists of 14 simple text scenarios in which the model is instructed to obey various rules while interacting with the user.
We show that almost all current models struggle to follow scenario rules, even on straightforward test cases.
arXiv Detail & Related papers (2023-11-06T08:50:29Z) - ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models.
We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods.
We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z) - Fundamental Limitations of Alignment in Large Language Models [16.393916864600193]
An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful.
This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones.
We propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models.
arXiv Detail & Related papers (2023-04-19T17:50:09Z) - Simulated Adversarial Testing of Face Recognition Models [53.10078734154151]
We propose a framework for learning how to test machine learning algorithms using simulators in an adversarial manner.
We are the first to show that weaknesses of models trained on real data can be discovered using simulated samples.
arXiv Detail & Related papers (2021-06-08T17:58:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.