Related papers: Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies

Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies

URL: http://arxiv.org/abs/2511.05018v1
Date: Fri, 07 Nov 2025 06:43:01 GMT
Title: Pluralistic Behavior Suite: Stress-Testing Multi-Turn Adherence to Custom Behavioral Policies
Authors: Prasoon Varshney, Makesh Narsimhan Sreedhar, Liwei Jiang, Traian Rebedea, Christopher Parisien,
Abstract summary: We present PBSUITE, a dynamic evaluation suite designed to assess large language models' capacity to adhere to pluralistic alignment specifications.<n>We find that leading open- and closed-source LLMs maintain robust adherence to behavioral policies in single-turn settings, but their compliance weakens substantially in multi-turn adversarial interactions.
Score: 18.428149174461264
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are typically aligned to a universal set of safety and usage principles intended for broad public acceptability. Yet, real-world applications of LLMs often take place within organizational ecosystems shaped by distinctive corporate policies, regulatory requirements, use cases, brand guidelines, and ethical commitments. This reality highlights the need for rigorous and comprehensive evaluation of LLMs with pluralistic alignment goals, an alignment paradigm that emphasizes adaptability to diverse user values and needs. In this work, we present PLURALISTIC BEHAVIOR SUITE (PBSUITE), a dynamic evaluation suite designed to systematically assess LLMs' capacity to adhere to pluralistic alignment specifications in multi-turn, interactive conversations. PBSUITE consists of (1) a diverse dataset of 300 realistic LLM behavioral policies, grounded in 30 industries; and (2) a dynamic evaluation framework for stress-testing model compliance with custom behavioral specifications under adversarial conditions. Using PBSUITE, We find that leading open- and closed-source LLMs maintain robust adherence to behavioral policies in single-turn settings (less than 4% failure rates), but their compliance weakens substantially in multi-turn adversarial interactions (up to 84% failure rates). These findings highlight that existing model alignment and safety moderation methods fall short in coherently enforcing pluralistic behavioral policies in real-world LLM interactions. Our work contributes both the dataset and analytical framework to support future research toward robust and context-aware pluralistic alignment techniques.

Related papers

How Controllable Are Large Language Models? A Unified Evaluation across Behavioral Granularities [75.10343190811592]
Large Language Models (LLMs) are increasingly deployed in socially sensitive domains.<n>Our benchmark offers a principled and interpretable framework for safe and controllable behavior.
arXiv Detail & Related papers (2026-03-03T03:50:13Z)
MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z)
MENTOR: A Metacognition-Driven Self-Evolution Framework for Uncovering and Mitigating Implicit Risks in LLMs on Domain Tasks [17.598413159363393]
Current alignment efforts primarily target explicit risks such as bias, hate speech, and violence.<n>We propose MENTOR: A MEtacognition-driveN self-evoluTion framework for uncOvering and mitigating implicit risks in large language models.<n>We release a supporting dataset of 9,000 risk queries spanning education, finance, and management to enhance domain-specific risk identification.
arXiv Detail & Related papers (2025-11-10T13:51:51Z)
Multimodal Policy Internalization for Conversational Agents [48.11601444262434]
Multimodal Policy Internalization (MPI) is a new task that internalizes reasoning-intensive multimodal policies into model parameters.<n>We build two datasets spanning synthetic and real-world decision-making and tool-using tasks.<n>TriMPI achieves notable gains in end-to-end accuracy, generalization, and robustness.
arXiv Detail & Related papers (2025-10-10T15:28:30Z)
Adversarial Testing in LLMs: Insights into Decision-Making Vulnerabilities [5.0778942095543576]
This paper introduces an adversarial evaluation framework designed to systematically stress-test the decision-making processes of Large Language Models.<n>We apply this framework to several state-of-the-art LLMs, including GPT-3.5, GPT-4, Gemini-1.5, and DeepSeek-V3.<n>Our findings highlight distinct behavioral patterns across models and emphasize the importance of adaptability and fairness recognition for trustworthy AI deployment.
arXiv Detail & Related papers (2025-05-19T14:50:44Z)
A Survey on Personalized Alignment -- The Missing Piece for Large Language Models in Real-World Applications [28.181295575180293]
Large Language Models (LLMs) have demonstrated remarkable capabilities, yet their transition to real-world applications reveals a critical limitation.<n>This paper presents the first comprehensive survey of personalized alignment.<n>We propose a unified framework comprising preference memory management, personalized generation, and feedback-based alignment.
arXiv Detail & Related papers (2025-03-21T10:09:16Z)
Value Compass Benchmarks: A Platform for Fundamental and Validated Evaluation of LLMs Values [76.70893269183684]
Large Language Models (LLMs) achieve remarkable breakthroughs.<n> aligning their values with humans has become imperative for their responsible development.<n>There still lack evaluations of LLMs values that fulfill three desirable goals.
arXiv Detail & Related papers (2025-01-13T05:53:56Z)
Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes [50.544186914115045]
Large language models (LLMs) are increasingly embedded in everyday applications.<n> Ensuring their alignment with the diverse preferences of individual users has become a critical challenge.<n>We present a novel framework for few-shot steerable alignment.
arXiv Detail & Related papers (2024-12-18T16:14:59Z)
Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators [0.0]
Large Language Models (LLMs) have recently demonstrated impressive capabilities across various real-world applications. We propose a flexible framework that enables LLMs to interact with system interfaces, summarize constraint concepts, and continually optimize performance metrics. Our framework achieved a $7.78%$ pass rate with the human discriminator and a $6.11%$ pass rate with the LLM-based discriminator.
arXiv Detail & Related papers (2024-10-19T17:27:38Z)
Aligning LLMs with Individual Preferences via Interaction [51.72200436159636]
We train large language models (LLMs) that can ''interact to align''<n>We develop a multi-turn preference dataset containing 3K+ multi-turn conversations in tree structures.<n>For evaluation, we establish the ALOE benchmark, consisting of 100 carefully selected examples and well-designed metrics to measure the customized alignment performance during conversations.
arXiv Detail & Related papers (2024-10-04T17:48:29Z)
Privacy Policy Analysis through Prompt Engineering for LLMs [3.059256166047627]
PAPEL (Privacy Policy Analysis through Prompt Engineering for LLMs) is a framework harnessing the power of Large Language Models (LLMs) to automate the analysis of privacy policies. It aims to streamline the extraction, annotation, and summarization of information from these policies, enhancing their accessibility and comprehensibility without requiring additional model training. We demonstrate the effectiveness of PAPEL with two applications: (i) annotation and (ii) contradiction analysis.
arXiv Detail & Related papers (2024-09-23T10:23:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.