Related papers: CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants

CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants

URL: http://arxiv.org/abs/2410.21159v2
Date: Thu, 30 Jan 2025 01:29:03 GMT
Title: CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants
Authors: Lize Alberts, Benjamin Ellis, Andrei Lupu, Jakob Foerster,
Abstract summary: Assessment of ten leading models across five scenarios (with 337 use cases each)<n>We find that top-rated "harmless" models make recommendations that should be recognised as obviously harmful to the user given the context provided.<n>Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising desires above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge.
Score: 5.7605009639020315
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a multi-turn benchmark for evaluating personalised alignment in LLM-based AI assistants, focusing on their ability to handle user-provided safety-critical contexts. Our assessment of ten leading models across five scenarios (with 337 use cases each) reveals systematic inconsistencies in maintaining user-specific consideration, with even top-rated "harmless" models making recommendations that should be recognised as obviously harmful to the user given the context provided. Key failure modes include inappropriate weighing of conflicting preferences, sycophancy (prioritising desires above safety), a lack of attentiveness to critical user information within the context window, and inconsistent application of user-specific knowledge. The same systematic biases were observed in OpenAI's o1, suggesting that strong reasoning capacities do not necessarily transfer to this kind of personalised thinking. We find that prompting LLMs to consider safety-critical context significantly improves performance, unlike a generic 'harmless and helpful' instruction. Based on these findings, we propose research directions for embedding self-reflection capabilities, online user modelling, and dynamic risk assessment in AI assistants. Our work emphasises the need for nuanced, context-aware approaches to alignment in systems designed for persistent human interaction, aiding the development of safe and considerate AI assistants.

Related papers

Optimizing In-Context Demonstrations for LLM-based Automated Grading [31.353360036776976]
GUIDE (Grading Using Iteratively Designed Exemplars) is a framework that reframes exemplar selection and refinement as a boundary-focused optimization problem.<n>We show that GUIDE significantly outperforms standard retrieval baselines in experiments in physics, chemistry, and pedagogical content knowledge.
arXiv Detail & Related papers (2026-02-28T04:52:38Z)
Benchmarking Contextual Understanding for In-Car Conversational Systems [0.9437812993238097]
In-Car Conversational Question Answering (ConvQA) systems significantly enhance user experience by enabling seamless voice interactions.<n>This paper explores the use of Large Language Models (LLMs) alongside advanced prompting techniques and agent-based methods to evaluate the extent to which ConvQA system responses adhere to user utterances.
arXiv Detail & Related papers (2025-12-12T21:15:49Z)
Challenges of Evaluating LLM Safety for User Welfare [0.3749446315124487]
We argue that developing user-welfare safety evaluations is non-trivial due to fundamental questions about accounting for user context in evaluation design.<n>We evaluate advice on finance and health from GPT-5, Claude Sonnet 4, and Gemini 2.5 Pro across user profiles of varying vulnerability.<n>Our work establishes that effective user-welfare safety evaluation requires evaluators to assess responses against diverse user profiles.
arXiv Detail & Related papers (2025-12-11T14:34:40Z)
Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It [81.50711040539566]
Current large language model (LLM) development treats task-solving and preference alignment as separate challenges.<n>We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks.<n>Our framework creates scenarios where identical questions require different reasoning chains depending on user context.
arXiv Detail & Related papers (2025-09-30T18:55:28Z)
Reasoning LLMs for User-Aware Multimodal Conversational Agents [3.533721662684487]
Personalization in social robotics is critical for fostering effective human-robot interactions. This paper proposes a novel framework called USER-LLM R1 for a user-aware conversational agent. Our approach integrates chain-of-thought (CoT) reasoning models to iteratively infer user preferences and vision-language models.
arXiv Detail & Related papers (2025-04-02T13:00:17Z)
PredictaBoard: Benchmarking LLM Score Predictability [50.47497036981544]
Large Language Models (LLMs) often fail unpredictably. This poses a significant challenge to ensuring their safe deployment. We present PredictaBoard, a novel collaborative benchmarking framework.
arXiv Detail & Related papers (2025-02-20T10:52:38Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions. Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes. We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z)
On the loss of context-awareness in general instruction fine-tuning [101.03941308894191]
Post-training methods such as supervised fine-tuning (SFT) on instruction-response pairs can harm existing capabilities learned during pretraining. We propose two methods to mitigate the loss of context awareness in instruct models: post-hoc attention steering on user prompts and conditional instruction fine-tuning with a context-dependency indicator.
arXiv Detail & Related papers (2024-11-05T00:16:01Z)
Interpretable Rule-Based System for Radar-Based Gesture Sensing: Enhancing Transparency and Personalization in AI [2.99664686845172]
We introduce MIRA, a transparent and interpretable multi-class rule-based algorithm tailored for radar-based gesture detection. We showcase the system's adaptability through personalized rule sets that calibrate to individual user behavior, offering a user-centric AI experience. Our research underscores MIRA's ability to deliver both high interpretability and performance and emphasizes the potential for broader adoption of interpretable AI in safety-critical applications.
arXiv Detail & Related papers (2024-09-30T16:40:27Z)
To Err Is AI! Debugging as an Intervention to Facilitate Appropriate Reliance on AI Systems [11.690126756498223]
Vision for optimal human-AI collaboration requires 'appropriate reliance' of humans on AI systems. In practice, the performance disparity of machine learning models on out-of-distribution data makes dataset-specific performance feedback unreliable.
arXiv Detail & Related papers (2024-09-22T09:43:27Z)
Con-ReCall: Detecting Pre-training Data in LLMs via Contrastive Decoding [118.75567341513897]
Existing methods typically analyze target text in isolation or solely with non-member contexts. We propose Con-ReCall, a novel approach that leverages the asymmetric distributional shifts induced by member and non-member contexts.
arXiv Detail & Related papers (2024-09-05T09:10:38Z)
Adaptive Guardrails For Large Language Models via Trust Modeling and In-Context Learning [9.719986610417441]
Guardrails have become an integral part of Large language models (LLMs) This study introduces an adaptive guardrail mechanism, supported by trust modeling and enhanced with in-context learning. By leveraging a combination of direct interaction trust and authority-verified trust, the system precisely tailors the strictness of content moderation to align with the user's credibility.
arXiv Detail & Related papers (2024-08-16T18:07:48Z)
Aligning Large Language Models from Self-Reference AI Feedback with one General Principle [61.105703857868775]
We propose a self-reference-based AI feedback framework that enables a 13B Llama2-Chat to provide high-quality feedback. Specifically, we allow the AI to first respond to the user's instructions, then generate criticism of other answers based on its own response as a reference. Finally, we determine which answer better fits human preferences according to the criticism.
arXiv Detail & Related papers (2024-06-17T03:51:46Z)
Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback. The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied. We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
RAH! RecSys-Assistant-Human: A Human-Centered Recommendation Framework with LLM Agents [30.250555783628762]
This research argues that addressing these issues is not solely the recommender systems' responsibility. We introduce the RAH Recommender system, Assistant, and Human framework, emphasizing the alignment with user personalities. Our contributions provide a human-centered recommendation framework that partners effectively with various recommendation models.
arXiv Detail & Related papers (2023-08-19T04:46:01Z)
A Systematic Literature Review of User Trust in AI-Enabled Systems: An HCI Perspective [0.0]
User trust in Artificial Intelligence (AI) enabled systems has been increasingly recognized and proven as a key element to fostering adoption. This review aims to provide an overview of the user trust definitions, influencing factors, and measurement methods from 23 empirical studies.
arXiv Detail & Related papers (2023-04-18T07:58:09Z)
Exploring Robustness of Unsupervised Domain Adaptation in Semantic Segmentation [74.05906222376608]
We propose adversarial self-supervision UDA (or ASSUDA) that maximizes the agreement between clean images and their adversarial examples by a contrastive loss in the output space. This paper is rooted in two observations: (i) the robustness of UDA methods in semantic segmentation remains unexplored, which pose a security concern in this field; and (ii) although commonly used self-supervision (e.g., rotation and jigsaw) benefits image tasks such as classification and recognition, they fail to provide the critical supervision signals that could learn discriminative representation for segmentation tasks.
arXiv Detail & Related papers (2021-05-23T01:50:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.