Related papers: Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

URL: http://arxiv.org/abs/2602.20813v1
Date: Tue, 24 Feb 2026 11:52:17 GMT
Title: Pressure Reveals Character: Behavioural Alignment Evaluation at Depth
Authors: Nora Petrova, John Burden,
Abstract summary: We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming.<n>Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss.<n>We find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses.
Score: 3.634215320925722
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.

Related papers

DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios [57.327907850766785]
characterization of deception across realistic real-world scenarios remains underexplored.<n>We establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different domains.<n>On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement.<n>We incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics.
arXiv Detail & Related papers (2025-10-17T10:14:26Z)
LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models [49.92148175114169]
We perform a systematic vulnerability analysis by introducing controlled perturbations across seven dimensions.<n>Models exhibit extreme sensitivity to perturbation factors, including camera viewpoints and robot initial states.<n>Surprisingly, models are largely insensitive to language variations, with further experiments revealing that models tend to ignore language instructions completely.
arXiv Detail & Related papers (2025-10-15T14:51:36Z)
Eliciting and Analyzing Emergent Misalignment in State-of-the-Art Large Language Models [0.0]
We demonstrate that state-of-the-art language models remain vulnerable to carefully crafted conversational scenarios.<n>We discover 10 successful attack scenarios, revealing fundamental vulnerabilities in how current alignment methods handle narrative immersion, emotional pressure, and strategic framing.<n>To validate generalizability, we distilled our successful manual attacks into MISALIGNMENTBENCH, an automated evaluation framework.
arXiv Detail & Related papers (2025-08-06T08:25:40Z)
It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective [14.271145160443462]
VulTegra compares scratch-trained and pre-trained DL models for vulnerability detection.<n>State-of-the-art (SOTA) detectors still suffer from low consistency, limited real-world capabilities, and scalability challenges.
arXiv Detail & Related papers (2025-07-13T08:02:56Z)
Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games [3.725822359130832]
Large Language Models (LLMs) are increasingly being explored as evaluators in serious games.<n>This study investigates the reliability of five small-scale LLMs when assessing player responses in textitEn-join, a game that simulates decision-making within energy communities.<n>Our results highlight the strengths and limitations of each model, revealing trade-offs between sensitivity, specificity, and overall performance.
arXiv Detail & Related papers (2025-04-13T10:46:13Z)
Benchmarking Adversarial Robustness to Bias Elicitation in Large Language Models: Scalable Automated Assessment with LLM-as-a-Judge [1.1666234644810893]
Small models outperform larger ones in safety, suggesting that training and architecture may matter more than scale.<n>No model is fully robust to adversarial elicitation, with jailbreak attacks using low-resource languages or refusal suppression proving effective.
arXiv Detail & Related papers (2025-04-10T16:00:59Z)
A Framework for Evaluating Vision-Language Model Safety: Building Trust in AI for Public Sector Applications [0.0]
This paper introduces a novel framework to quantify adversarial risks in Vision-Language Models (VLMs)<n>We analyze model performance under Gaussian, salt-and-pepper, and uniform noise, identifying misclassification thresholds and deriving composite noise patches and saliency patterns that highlight vulnerable regions.<n>We propose a new Vulnerability Score that combines the impact of random noise and adversarial attacks, providing a comprehensive metric for evaluating model robustness.
arXiv Detail & Related papers (2025-02-22T21:33:26Z)
On the Fairness, Diversity and Reliability of Text-to-Image Generative Models [68.62012304574012]
multimodal generative models have sparked critical discussions on their reliability, fairness and potential for misuse.<n>We propose an evaluation framework to assess model reliability by analyzing responses to global and local perturbations in the embedding space.<n>Our method lays the groundwork for detecting unreliable, bias-injected models and tracing the provenance of embedded biases.
arXiv Detail & Related papers (2024-11-21T09:46:55Z)
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework [91.94389491920309]
Textual adversarial attacks can discover models' weaknesses by adding semantic-preserved but misleading perturbations to the inputs. The existing practice of robustness evaluation may exhibit issues of incomprehensive evaluation, impractical evaluation protocol, and invalid adversarial samples. We set up a unified automatic robustness evaluation framework, shifting towards model-centric evaluation to exploit the advantages of adversarial attacks.
arXiv Detail & Related papers (2023-05-29T14:55:20Z)
Are Neural Topic Models Broken? [81.15470302729638]
We study the relationship between automated and human evaluation of topic models. We find that neural topic models fare worse in both respects compared to an established classical method.
arXiv Detail & Related papers (2022-10-28T14:38:50Z)
A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z)
Exploiting Position Bias for Robust Aspect Sentiment Classification [10.846244829247716]
We propose two mechanisms for capturing position bias, namely position-biased weight and position-biased dropout. Our proposed approaches largely improve the robustness and effectiveness of current models.
arXiv Detail & Related papers (2021-05-29T04:41:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.